Model Evaluations under Uncertain Ground Truth

AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is often assumed to be fixed and certain. However, in many domains, such as medical applications, the ground truth is often curated in the form of differential diagnoses provided by multiple experts. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through potential disagreement.

In this talk, I will argue that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions, leading to unanticipated failure modes. We propose a statistical aggregation approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves, based on observed annotations. This formulation naturally accounts for the potential disagreements between different experts, as well as uncertainty stemming from individual differential diagnoses, capturing the entire ground truth uncertainty. We conclude that, while assuming a crisp ground truth can be acceptable for many AI applications, a more nuanced evaluation protocol should be utilized in medical diagnosis. If time permits, I will also cover some work, based on conformal methods that can provide statistical guarantees.

Based on joint work with David Stutz, Melih Barsbey, Alan Karthikesalingam, Arnaud Doucet and many others

Further information

Time:

Venue:

Speaker:

Series:

Forthcoming Seminars

News, Announcements and Events

Study at Cambridge

About the University

Research at Cambridge