Huge amount of multimedia data is generated on the Internet everyday. In order to utilize multimodal data, researchers have developed a lot of multimodal machine learning models to integrate data of multiple modailities, including text, images, audios and videos. In the multimedia analysis community, multimodal fusion is greatly employed for various multimedia analysis tasks, such as event detection. There are two main schemes of fusion divided by levels of fusion: early fusion, late fusion and hybrid fusion. The most widely used strategy, is to fuse the information at the feature level, which is also known as early fusion. The other approach is decision level fusion which fuses multiple modalities in the semantic space at the decision level, which is also known for late fusion. The feature level fusion or early fusion can utilize the correlation between multiple features from different modalities at an early stage, while the late fusion or ensemble fusion strategy is more flexible in terms of feature representations and learning algorithms for different modalities, and more scalable in terms of modalities. There is also another fusion scheme, which combines both early fusion and late fusion, also known for hybrid fusion scheme.
In our project for multimodal word sense disambiguation (WSD) and information retrieval (IR), multimodal fusion is examined from a deeper perspective. We discovered two important properties of multimodal data from our observations on multimodal datasets, which can explain why multimodal fusion works. There exist correlative and complementary relations among multiple modalities. Let’s use images and text to illustrate the concepts.
- Correlative Relation: At the semantic level, images and textual sentences of the same documents tend to contain semantic information describing the same objects or concepts . Because images and text for the same documents have this semantic correlative relation, they tend to be correlated in the feature spaces as well. Images and text also display certain correlation at the decision level. For example, some images and textual sentences are classified to the same senses correctly in the experiments for WSD.
- Complementary Relation: Images and text are complementary to each other by containing different semantic information at the semantic level. For example, in the WSD case, textual sentences contain more useful and relevant information for disambiguation in some documents, while images contain more useful information in other documents. At the decision level, image processing (disambiguation or retrieval) and text processing (disambiguation or retrieval) are also complementary to each other for two reasons: first, the semantic information in images and text are complementary to each other; second, text processing usually has higher precision but lower recall, while image processing has lower precision but higher recall.
The previous research work for multimodal disambiguation and retrieval, mostly focused on using early fusion scheme by developing unified representation models from text and images based on correlation between features of images and text, and then using classification techniques on top of the unified representation models to solve different tasks. On the other hand, our ensemble fusion model, a late fusion model, can capture the complementary relation between text and images, which was ignored by most previous work related to multimodal disambiguation and retrieval.
In our ensemble fusion model, images and text are first processed separately to provide decision-level results. Then the results are combined using different approaches, including the linear rule, the maximum rule and logistic regression classification, to generate the final results. Our ensemble fusion model is shown in the figure below. Let’s use score to denote the results from text processing and image processing. For disambiguation, score refers to the confidences scores of senses. For retrieval, score refers to the similarity score of a document to the query document.
The linear rule fusion uses a weight λ to combine the scores from image processing and text processing. The maximum rule selects the highest confidence score or similarity score from text processing and image processing. For logistic regression, confidence scores and similarity scores are used as features to train the logistic regression classifier.
Our ensemble model is simple but powerful. Experimental results on two multimodal datasets demonstrate the ensemble fusion model can achieve higher performance than image-only and text-only approaches. In addition, the model can be viewed as a general framework for multimodal fusion, where you can come up with new fusion approaches to combine the results from text processing and image processing, or new text processing and image processing methods. It also can be expanded to more modalities, such as audios and videos, beyond only images and text.
The multimodal word sense disambiguation part of this project has been published in IEEE ISM 2015. The comprehensive work is published in IEEE MM (Apirl/June Edition 2016). We also published an individual paper on large-scale image retrieval with multimodal fusion in FLAIRS 2016.