Multimodal Machine Learning: A Survey and Taxonomy, Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency, 2018IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41 (IEEE)DOI: 10.1109/TPAMI.2018.2798607 - Provides a comprehensive overview of multimodal machine learning, categorizing different fusion strategies and applications relevant to understanding diverse inputs and outputs.
VQA: Visual Question Answering, Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh, 2015International Conference on Computer Vision (ICCV) (IEEE)DOI: 10.1109/ICCV.2015.279 - Introduced the task and dataset for Visual Question Answering, demonstrating how systems process both image and text inputs to produce a textual answer.