In this article, the authors identify the correlative and complementary relations among multiple modalities. They then propose a multimodal ensemble fusion model to capture the complementary relation and correlative relation between two modalities (images and text) and explain why this ensemble fusion model works. Experimental results on the University of Illinois at Urbana-Champaign Image Sense Discrimination (UIUC-ISD) dataset and the Google-MM dataset show that their ensemble fusion model outperforms approaches using only a single modality for disambiguation and retrieval. Word sense disambiguation and information retrieval are the use cases they studied to demonstrate the effectiveness of their ensemble fusion model.
Authors:
Yang Peng, Xiaofeng Zhou, Daisy Zhe Wang, Ishan Patwa, Dihong Gong, Chunsheng Victor Fang
Bibtex:
@article{peng2016multimodal, title={Multimodal Ensemble Fusion for Disambiguation and Retrieval}, author={Peng, Yang and Zhou, Xiaofeng and Wang, Daisy Zhe and Patwa, Ishan and Gong, Dihong and Fang, Chunsheng Victor}, journal={IEEE MultiMedia}, volume={23}, number={2}, pages={42--52}, year={2016}, publisher={IEEE} }
Download:
[pdf]