Towards expressive, robust and generalisable multimodal learning
Abstract
Multimodal learning is a fundamental subcategory under deep learning. It aims to process and understand many types of signal from the wild (aka modalities) concurrently, such as vision, language, audio and remote sensing. By combining signals from different channels, the occurrence of errors can be greatly minimised. A broad scope of tasks can benefit from multimodal learning, such as visual/video question answering, multimodal retrieval, products or content recommendation, etc. Since this kind of system usually processes several channels simultaneously, the computational cost is extremely high and should be reduced. More recently, as numerous large language models (LLMs) emerge, how to effectively integrate multimodal information into LLMs to utilize their strong knowledge raises as another research trend.
In this thesis, we aim at providing practical solutions towards these basic issues and giving practical solutions to each of them. Specifically, starting from a typical multimodal task – multimodal sentiment analysis (MSA), we propose an effective multimodal fusion algorithm in the paper “Improving Multimodal Fusion with Hierarchical Mutual Information Maximisation for Multimodal Sentiment Analysis”. We further explore a more complex setting where part of the input signals is missing and design an efficient algorithm that can quickly learn the alignment dynamics between two channels and impute the missing signals with acceptable errors. Next, we move to another representative multimodal task, video question answering (ViQA), and optimise for the system’s efficiency, as current systems always carries too much redundancy. This work combines both the compression of input signals and the acceleration at system level, which can be applied on a large family of extant vision language models (VLMs).
Speaker’s profile

Wei Han is a fifth year PhD student in ISTD, SUTD, advised by Professor Soujanya Poria. His primary research interests include multimodal learning and many advancing topics on large language models, such as context modelling, supervised fine-tuning, RL-based algorithms, etc. Prior to joining SUTD, he obtained his Master of Philosophy degree from the Hong Kong University of Science and Technology (HKUST) in 2020.