Learning Transferable Visual Models From Natural Language Supervision, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, 2021Proceedings of the 38th International Conference on Machine Learning, Vol. 139 (PMLR)DOI: 10.48550/arXiv.2103.00020 - Introduces Contrastive Language-Image Pre-training (CLIP), which learns a shared embedding space for text and images, serving as a strong text encoder and guidance mechanism for text-to-image synthesis.
High-Resolution Image Synthesis with Latent Diffusion Models, Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer, 2022Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE)DOI: 10.1109/CVPR52688.2022.01047 - Describes Latent Diffusion Models, the architecture behind Stable Diffusion, which uses a compressed latent space for efficient high-resolution image generation with text conditioning via cross-attention and classifier-free guidance.