Learning Transferable Visual Models From Natural Language Supervision, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, 2021arXiv preprint arXiv:2103.00020DOI: 10.48550/arXiv.2103.00020 - Describes CLIP, a foundational model that aligns images and text, enabling zero-shot recognition and serving as a basis for many modern vision-language models.
Vision (GPT-4o) - OpenAI API Reference, OpenAI, 2024 (OpenAI) - Official guide on using OpenAI's vision capabilities, including details on image formatting (Base64), prompt construction, and supported models like GPT-4o.
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, Aurélien Géron, 2022 (O'Reilly Media) - A practical guide covering deep learning concepts, including image processing and neural network architectures, valuable for understanding the foundations before working with advanced multimodal models.