Learning Transferable Visual Models From Natural Language Supervision, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, 2021arXivDOI: 10.48550/arXiv.2103.00020 - Introduces CLIP, a model for learning joint embeddings of images and text, highly relevant for cross-modal similarity.