Towards trustworthy and explainable AI for multimodal hate content moderation
The proliferation of hateful multimodal content, particularly in the form of hateful memes, poses significant threats to online safety and social cohesion. Although deep learning systems, especially vision-language models, are essential to automated multimodal content moderation, they operate as black boxes, offering limited explainability into their decision-making processes. As a result, they lack the ability to generate human-understandable explanations for flagged content, which are crucial for enabling informed moderation decisions. In addition, these models are susceptible to inheriting biases from training data and struggle to generalise to new and adversarial inputs, raising concerns about fairness, accountability, and robustness in real-world deployments.
This dissertation contributes to the development of trustworthy and explainable AI for multimodal content moderation by addressing three key challenges: (1) detecting and mitigating biases in vision-language models, (2) generating human-readable explanations for model decisions, and (3) improving generalisation to unseen or out-of-distribution hateful content. The first part examines the cause of misclassification in hateful meme detection and finds that errors often stem from spurious cross-modal correlations and biased lexical cues. To address this, we introduce InstructMemeCL, a contrastive instruction fine-tuning framework that encourages the clustering of semantically similar samples while separating dissimilar ones in the embedding space, thereby reducing misclassifications. The second part focuses on improving explainability by generating relevant natural language explanations. We introduce HatReD, the first multimodal dataset annotated with human rationales for hateful memes, and IntMeme, a large multimodal model that generates explanations prior to classification through a dual encoder architecture. The final part addresses the challenge of generalisation in low-resource and distribution-shifted settings. We propose a few-shot in-context learning framework that transfers knowledge from text-based hate speech detection to multimodal hate speech classification without requiring fine-tuning.
Together, these contributions advance the development of AI systems that are not only accurate but also socially responsible and explainable, laying a foundation for more inclusive, trustworthy, and explainable content moderation in online environments.
Speaker’s profile

Ming Shan is a fifth-year PhD candidate at the Singapore University of Technology and Design (SUTD), supervised by Prof Roy Ka-Wei Lee. He is a recipient of the SUTD President’s Graduate Fellowship and the Singapore Data Science Consortium (SDSC) Dissertation Research Fellowship 2024. His research has been published in top conferences such as WWW, ACM Multimedia, EMNLP, and ICLR. He also serves as a reviewer for top-tier conferences and co-organizes SSNLP 2024 and 2025. His research interests include Large Multimodal Models, Model Reasoning, Explainability, Interpretability and Efficient Learning.