Yang Peng Huge amount of multimedia data is generated on the Internet everyday. In order to utilize multimodal data, researchers have developed a lot of multimodal machine learning models to integrate data of multiple modailities, including text, images, audios and videos. In the multimedia analysis community, multimodal fusion is greatly employed for various multimedia analysis