Beyond benchmarks: measuring and strengthening generalisable reasoning in large language models
Beyond benchmarks: measuring and strengthening generalisable reasoning in large language models
Abstract
Reasoning—the ability to draw inferences, integrate evidence, and solve problems—is a cornerstone of human intelligence and remains a fundamental challenge in natural language processing (NLP). This thesis addresses critical questions surrounding the evaluation and enhancement of reasoning robustness, generalisability, and comprehensiveness in modern language models, particularly under realistic conditions involving noise, ambiguity, domain shifts, and multimodal inputs.
Our investigation is guided by two central research questions: (1) How can we design evaluation frameworks that transcend superficial correctness, diagnosing deeper reasoning failures, and assessing the robustness, generalizability, and breadth of reasoning skills in contemporary language models? and (2) How can we develop models capable of robust reasoning across diverse domains, resilient to spurious correlations, and effective over extended reasoning sequences?
To tackle these questions, we propose a suite of innovative methodologies and evaluations across a variety of NLP and multimodal reasoning scenarios. We first introduce a comprehensive evaluation benchmark specifically tailored for instruction-tuned large language models (LLMs). This benchmark systematically examines a broad array of reasoning capabilities, from logical and commonsense reasoning to compositional and long-horizon inference tasks. Subsequently, we develop robust training techniques and counterfactual data augmentation methods designed explicitly to mitigate spurious correlations, thereby enhancing model resilience under distributional shifts and improving their capacity for sustained, long-horizon reasoning.
This thesis makes three principal contributions: it provides unified methodologies for systematically evaluating reasoning in LLMs, proposes practical approaches for strengthening reasoning robustness across diverse domains, and expands the scope of reasoning research to complex, real-world applications involving multimodal and embodied interactions. Collectively, these advancements significantly further our understanding of constructing language models capable of reasoning not just fluently, but also reliably and coherently across varied and challenging contexts.
Speaker’s profile

Hong Pengfei is a PhD candidate in the Information Systems Technology and Design (ISTD) pillar at the Singapore University of Technology and Design (SUTD), where he is a member of the DeCLaRe Lab. His research focuses on multimodal reasoning and the evaluation of large language models (LLMs). He began his doctoral studies in 2021, following a Bachelor’s degree in ISTD from SUTD.