Features: Faculty Insights

## Richard Samworth, Professor of Statistical Science and Director of the Statistical Laboratory at the Department of Pure Mathematics and Mathematical Statistics, has been elected Fellow of the Royal Society. He is being recognised for his fundamental contributions to the development of modern statistical methodology and theory.

"I was enormously honoured when I found out [I was elected]," says Samworth. "It's very exciting to be a small part of such a respected institution with an incredible history. The Royal Society plays a very important role in advocating for science, and that is something I am hoping to contribute to."

### Brave new Big Data world

Samworth's work as a statistician has direct relevance to many areas of our lives. Most of it involves developing new statistical methods, as well as statistical theory, to tackle modern data challenges. "One of the ways the subject has evolved in recent years is in the size of data sets that are routinely collected, in areas like genetics, medical imaging, particle physics, and many others," he says. "This creates a demand for new techniques, because the traditional ones may be too slow or perform very poorly in this brave new Big Data world."

People have recognised the way in which Big Data can change their lives — from being able to diagnose and treat cancer more effectively to designing algorithms that underpin driverless car technology. Richard Samworth

An example of the type of work Samworth is doing involves one of the most fundamental questions in statistics: what kind of distribution does a given data set come from? To illustrate this with a very simple example, imagine you want to understand the annual income distribution in a country. Ideally, you would like to draw a curve which tells you, for each possible value of what someone can earn in a year, what proportion of people earn that amount. If you can describe this curve by a neat mathematical formula, then all the better.

To approach this problem you collect data on people's annual income and aim to find the distribution that best fits the data. One way of doing this is to assume that the corresponding curve has a particular pre-determined shape. For example, you might assume the data follows the famous bell shape of the normal distribution, which is described by a nice mathematical formula. The exact location and width of the curve can then be estimated from your data. This is what is called a parametric method because what you are doing is estimating the parameters of a given type of distribution.

"[The parametric approach] is very easy to apply, but the shape of the distribution is highly constrained, and this wouldn't be appropriate for many types of data," says Samworth.

The bell shape of the normal distribution and a histogram representing data that is approximately described by the distribution. Figure Shishirdasika, CC BY-SA 3.0.

Another approach is to draw a histogram. You divide the possible values for annual income up into bands, such as £0-£1,000, then £1,000-£2,000, etc, and count the proportion of people whose income falls into each of these bins. This could capture skewness in the distribution, for instance, but you could end up with quite a different estimate if you changed the width of the bins.

"These non-parametric or smoothing methods are much more flexible in terms of their shape, but tend to leave you with an annoying choice of how much you want to smooth, for example by changing the width of the bins," says Samworth.

Together with colleagues Samworth has developed a method that takes the best of both worlds, called log-concave density estimation. "It's both a flexible method and one that is fully automatic for the practitioner to use, and it's got a beautiful theory associated with it as well."

Samworth's method was of course designed for data sets that are a lot more complex than our example. They can involve vast numbers of individual data points and involve many different variables.

### The Statistics Clinic

Mathematical theory underlies much of Samworth's work, but there is also a practical aspect to what he does at the Statistical Laboratory: for twelve years he has been running a regular Statistics Clinic where anyone from the University of Cambridge can show up with their statistical problem and get advice from a team of experts.

"One of the things I have found interesting since I have set this up is to see just how wide a range of subjects people come from to receive their statistical help," Samworth says. "We get a lot from the life sciences, and I would have expected that. But we also get people from education, or history, or music — pretty much every subject that is being studied at the University. That's really fascinating!"

That musicologists need statistical help may seem baffling at first, but it makes sense when you realise that, like a DNA sequence or the annual incomes of people, a piece of music can be regarded as a sequence of data points. The problem that musicologists form the University of Cambridge brought to the statistics clinic involved finding patterns within a piece of music that would reveal the composer who had written it.

The statistics clinic is great for anyone who is not an expert, but there's benefit for statisticians too. "It's also a great way for me to train my PhD students and postdocs," Samworth told us in a previous interview. "It takes real skill to distil the essence of someone's problem in a way that you can understand, and to then propose a solution. It's also really important to have the ability to communicate the advice in a way that's understandable to the practitioner you are talking to. That might mean recommending a rather different method [than you first had in mind], depending on the ability level of the person you are speaking to."

### Big Data for big challenges

Samworth was drawn to his field of work because it addresses challenges he thinks are important, and also suits his personal skills. "I like the fact that the methods I develop can be used in many different application domains, but the theory is grounded in the rigour of mathematics," he says.

"In many ways I have been very lucky. The subject is now much hotter than when I started my career because people have recognised the way in which Big Data can change their lives — from being able to diagnose and treat cancer more effectively to designing algorithms that underpin driverless car technology."

There's probably never been a better time to do statistics than now, so it seems right that an outstanding representative of this field is among this year's new Fellows of the Royal Society. Congratulations!