Learning Transferable Visual Models From Natural Language Supervision, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, 2021arXiv preprint arXiv:2103.00020DOI: 10.48550/arXiv.2103.00020 - Introduces CLIP, a neural network trained on a wide variety of image-text pairs that efficiently learns visual concepts from natural language supervision, making its text encoder highly effective for text-to-image conditioning.
High-Resolution Image Synthesis with Latent Diffusion Models, Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer, 2022Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE)DOI: 10.48550/arXiv.2112.10752 - Presents Latent Diffusion Models, which significantly reduce computational requirements for high-resolution image synthesis. It details the architecture for conditioning via cross-attention to text embeddings (like CLIP's) and its integration with Classifier-Free Guidance, forming the basis for models like Stable Diffusion.