Learning Transferable Visual Models From Natural Language Supervision, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, 2021Proceedings of the 38th International Conference on Machine Learning, Vol. 139 (PMLR)DOI: 10.5555/3540306.3540445 - This paper presents CLIP, a model that learns robust shared representations for images and text using contrastive learning, which enables direct cross-modal comparison and shows the utility of shared embedding spaces.
Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013International Conference on Learning Representations (ICLR) WorkshopDOI: 10.48550/arXiv.1301.3781 - This work introduces Word2Vec, a method for learning dense vector representations of words where semantic relationships are captured by vector proximity and cosine similarity, illustrating the principles of information comparison.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - Chapter 15, 'Representation Learning,' explains how neural networks learn meaningful data representations, which supports creating the feature vectors and shared spaces used for cross-modal comparison.