An Illustrative Multimodal Task: Generating Image Descriptions
Was this section helpful?
Show and Tell: A Neural Image Caption Generator, Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, 2015IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE)DOI: 10.1109/CVPR.2015.7298935 - This paper introduces a seminal end-to-end neural network model for image captioning, combining a convolutional neural network (CNN) for visual feature extraction with a recurrent neural network (RNN) for sentence generation.
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, Peter Anderson, Xiaodong He, Lei Zhang, Jianfeng Gao, 20182018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE)DOI: 10.1109/CVPR.2018.00696 - This work significantly advanced image captioning by proposing a novel attention mechanism that combines bottom-up (object-level) and top-down (contextual) visual features, leading to more accurate and detailed descriptions.