On Selection Bias and Fairness Issues in Machine Learning

With the deluge of digitized information in the Big Data era, massive datasets are becoming increasingly available for learning predictive models. However, in many situations, the poor control of the data acquisition processes may jeopardize the outputs of machine-learning algorithms and selection bias issues are now the subject of much attention. Recently, the accuracy of facial recognition algorithms for biometrics applications has been fiercely discussed for instance, its monitoring over time revealing sometimes a predictive performance very far from what was expected at the end of the training stage. The use of machine-learning methods for designing medical diagnosis/prognosis support tools is currently triggering the same type of fear. Making the enthusiasm and the confidence for what can be accomplished by machine learning durable requires to revisit practice and theory both at the same time. It is precisely the purpose of this talk to explain and illustrate through real examples how to extend Empirical Risk Minimization, the main paradigm of statistical learning, when the training observations are biased, i.e. are drawn from distributions that may significantly differ from that of the data in the test/prediction stage. As expected, there is ‘no free lunch’: practical, theoretically grounded, solutions do exist in a variety of contexts (e.g. training examples composed of censored/truncated/survey data) but their implementation crucially depends on the availability of relevant auxiliary information about the data acquisition process. One should also have in mind that the ‘bias’ in machine-learning, as perceived by the general public, also refers to situations where the predictive error exhibits a huge disparity, to cases where the predictive algorithms are much less accurate for certain population segments than for others. If certain facial recognition algorithms make more mistakes for certain ethnic groups for instance, representativeness issues concerning the training data should not be incriminated solely: the variability in the error rates can be due just as much to the intrinsic difficulty of certain recognition problems or to the limitations of the state-of-the-art machine-learning technologies. As will be discussed in this talk, trade-offs between fairness and predictive accuracy then become unavoidable.

Further information

Time:

Venue:

Speaker:

Series:

Forthcoming Seminars

News, Announcements and Events

Study at Cambridge

About the University

Research at Cambridge