Learning Transferable Visual Models From Natural Language Supervision, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, 2021Proceedings of the 38th International Conference on Machine Learning (ICML)DOI: 10.48550/arXiv.2103.00020 - This paper introduces CLIP, a model that learns shared representations for images and text through contrastive pre-training on a large dataset. It provides an example of how shared representations are learned and applied, particularly for cross-modal retrieval.
FaceNet: A Unified Embedding for Face Recognition and Clustering, Florian Schroff, Dmitry Kalenichenko, and James Philbin, 2015Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE)DOI: 10.1109/CVPR.2015.7298682 - This paper introduces the triplet loss function, a core technique for learning embeddings in a shared space where similar samples are pulled closer and dissimilar samples are pushed further apart. It is relevant to the learning mechanism described.