Coordinated Representations: Mapping Between Modalities
Was this section helpful?
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang, 2018Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE Computer Society)DOI: 10.1109/CVPR.2018.00636 - Presents a seminal attention-based model for image captioning, illustrating direct mapping from visual features to text sequences.
Learning Transferable Visual Models From Natural Language Supervision, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, 2021arXiv preprint arXiv:2103.00020DOI: 10.48550/arXiv.2103.00020 - Introduces a highly influential model that learns robust cross-modal representations through contrastive pre-training, enabling effective correlation and retrieval tasks.