Text this: Multimodal one-shot learning of speech and images