AI agents are increasingly performing consequential tasks autonomously: writing code, making purchases, and providing advice. But how do we know when to trust them? Current evaluation focuses predominantly on success rates: how often does the agent complete the task? This misses critical questions about how agents behave: Do they give the same answer twice? Do they fail gracefully when conditions change? Can they tell us when they’re likely to be wrong? Drawing on decades of practice from aviation, nuclear power, and other safety-critical domains, we propose a framework that decomposes reliability into four dimensions: consistency, robustness, predictability, and safety. Evaluating 12 frontier AI models, we find a striking result: despite rapid capability improvements over 18 months, reliability has barely budged. Agents that are substantially more accurate remain inconsistent across runs and poorly calibrated about their own uncertainty. The implication is clear: building capable AI is not the same as building dependable AI. As agents take on higher-stakes tasks, we need evaluation practices that ask not just “does it work?” but “can we count on it?”
Bio:
Stephan Rabanser works on trustworthy machine learning, with a particular focus on uncertainty quantification, selective prediction, and out-of-distribution generalization/robustness. At a high level, his research aims to improve the reliability of machine learning systems under uncertainty and distribution shift. Rabanser develops principled yet practical methods that help models understand what they know—and crucially, when they should abstain—whether by quantifying predictive uncertainty, deferring to expert models, or rejecting unfamiliar inputs. He also studies how models can generalize reliably under distribution shift, with applications ranging from out-of-distribution detection and time series anomaly detection to robustness in federated learning. A recurring theme of his research is to design intelligent systems that remain trustworthy even under imperfect or adversarial conditions, such as privacy constraints, limited data, or non-stationary environments. His current research explores how uncertainty can be designed and leveraged in large generative models to support more reliable decision-making and safer deployment.
Rabanser holds a Ph.D. in computer science from the University of Toronto, an M.Sc. and a B.Sc. in informatics from the Technical University of Munich (TUM), and an Honours Degree in technology management from the Center for Digital Technology and Management (CDTM). Over the past years, he has held engineering and research positions at Amazon / AWS AI Labs and Google. Previously, Rabanser has also been a research visitor at the Massachusetts Institute of Technology (MIT), Carnegie Mellon University (CMU), and the University of Cambridge.
Rabanser’s Google Scholar webpage
In-person attendance is open to Princeton University faculty, staff and students.
This talk will be livestreamed and recorded. The recording will be posted to the CITP website, the Princeton University Media Central channel and the CITP YouTube channel.
If you need an accommodation for a disability please contact Jean Butcher at butcher@princeton.edu at least one week prior to the event.
Sponsorship of an event does not constitute institutional endorsement of external speakers or views presented.