AI metrics beyond performance: safety and trustworthiness of AI systems

EVENT DATE

08 Jan 2025

Please refer to specific dates for varied timings

TIME

2:00 pm – 4:00 pm

LOCATION

SUTD Think Tank 22 (Building 2, Level 3, Room 2.311)

Abstract

AI systems often face misalignment issues due to misspecified objectives and a lack of comprehensiveness in training datasets, leading to non-ideal behaviours such as unsafe command execution and biases in race, gender, politics, and religion. These shortcomings, often overlooked in favour of task-specific validation metrics, compromise the safety and trustworthiness of AI systems, limiting their suitability for public and critical applications such as finance, healthcare, and recruitment. This thesis investigates critical non-idealities in AI systems, focusing on safety behaviour post-training and alignment. We define the properties of ideal AI systems and establish safety metrics, introducing novel evaluation methods such as RED-EVAL (a prompt-based probe), Unalignment (a parametric probe), and Ruby-Teaming (a dynamic probe). To improve safety, we propose RED-INSTRUCT, which leverages RED-EVAL for enhanced alignment, and RESTA, which mitigates safety risks from downstream fine-tuning. In the context of trust, we explore gender bias in BERT by identifying gender directions that influence predictions. We introduce a layer-wise principal component removal technique to achieve unbiased outputs. For explainable predictions, we propose a novel loss function leveraging Wasserstein Optimal Transport. Lastly, we analyse the identifiability of attention weights in Transformers, highlighting issues in the standard architecture that undermine explainability. To address this, we propose encoder layer variants that enhance identifiability by decoupling key and value vector relationships.

Speaker’s profile

Rishabh is a PhD candidate at the Singapore University of Technology and Design (SUTD). He received his B.E. from the Birla Institute of Technology And Science – Pilani (India) in 2018. His research focuses on AI safety, trust, and efficient training.

ISTD PhD Oral Defense Seminar by Bhardwaj Rishabh - AI metrics beyond performance safety and trustworthiness of AI systems

ADD TO CALENDAR

Google Calendar

Apple Calendar