AI metrics beyond performance: safety and trustworthiness of AI systems
Abstract
AI systems often face misalignment issues due to misspecified objectives and a lack of comprehensiveness in training datasets, leading to non-ideal behaviours such as unsafe command execution and biases in race, gender, politics, and religion. These shortcomings, often overlooked in favour of task-specific validation metrics, compromise the safety and trustworthiness of AI systems, limiting their suitability for public and critical applications such as finance, healthcare, and recruitment. This thesis investigates critical non-idealities in AI systems, focusing on safety behaviour post-training and alignment. We define the properties of ideal AI systems and establish safety metrics, introducing novel evaluation methods such as RED-EVAL (a prompt-based probe), Unalignment (a parametric probe), and Ruby-Teaming (a dynamic probe). To improve safety, we propose RED-INSTRUCT, which leverages RED-EVAL for enhanced alignment, and RESTA, which mitigates safety risks from downstream fine-tuning. In the context of trust, we explore gender bias in BERT by identifying gender directions that influence predictions. We introduce a layer-wise principal component removal technique to achieve unbiased outputs. For explainable predictions, we propose a novel loss function leveraging Wasserstein Optimal Transport. Lastly, we analyse the identifiability of attention weights in Transformers, highlighting issues in the standard architecture that undermine explainability. To address this, we propose encoder layer variants that enhance identifiability by decoupling key and value vector relationships.
Speaker’s Profile
Rishabh is a PhD candidate at the Singapore University of Technology and Design (SUTD). He received his B.E. from the Birla Institute of Technology And Science – Pilani (India) in 2018. His research focuses on AI safety, trust, and efficient training.
