skip to content

Summer Research Programmes

 

This is a list of Research in CMS projects from summer 2018.

Structured high-dimensional changepoint estimation

Contact Name Richard Samworth and Tengyao Wang
Contact Email r.samworth@statslab.cam.ac.uk, T.Wang@statslab.cam.ac.uk
Company Name Statistical Laboratory
Address Wilberforce Road, Cambridge, CB3 0WB
Period of the Project 8 weeks, from second week of July onwards
Project Open to Undergraduates
Deadline to Register Interest 3 March
Brief Description of the Project

These days, it is very common to monitor simultaneously many attributes that may vary over time. Times at which the underlying data generating mechanism changes, known as changepoints, are of particular interest in many applications. For instance, in internet security monitoring, changepoints may indicate a distributed denial of service attack, while in functional Magnetic Resonance Imaging studies, a changepoint may indicate a significant neurological activity. Frequently, changes occur only in a relatively sparse subset of the attributes. For instance, in the case of stock price data, it may well be the case that market shocks affect only a particular industry sector. A recent method, called `inspect', has been proposed to estimate sparse changes in the mean vector of a high-dimensional data stream (Wang and Samworth, 2018). The aim of this project is to generalise the `inspect' methodology to take advantage of known or hypothesised structures in the mean changes. Examples include spatial clustering of changes in fMRI or other images and sector grouping in financial time series. This project would be particularly suitable for a statistician who has just finished Part II (though it may be accessible to a very strong Part IB student too). It has the potential to involve a mixture of methodological development, generalisation of existing theoretical results, implementation of algorithms in R and applications to real-world datasets, though the weights of this mixture can be tailored to skills and interests of the student involved.

Wang, T., Samworth, R. J. (2018) High-dimensional change point estimation via sparse projection. J. Roy. Statist. Soc., Ser. B, 80, 57–83.

Skills Required Statistics; some linear algebra, analysis and probability. Enthusiasm for a genuine research project.
Skills Desired Previous experience of R

 

How Long Before We Are There ?

Contact Name Herbert E. Huppert
Contact Email heh1@cam.ac.uk
Company Name DAMTP
Address Wilberforce Road, Cambridge, CB3 0WB
Period of the Project 8 weeks
Project Open to Undergraduates
Deadline to Register Interest 3 March
Brief Description of the Project This project would be stimulating for anyone interested in solving numerically nonlinear partial differential equations relevant to Biology, Earth Sciences, Elasticity, Fluid Mechanics, …… The point is that there are a huge number of nonlinear pdes that do not have analytic solutions.   Often one can find similarity solutions, which reduce the number of independent variables, but still leads, generally, to a nonlinear equation whose solutions are independent of the initial conditions.   But what role do they play?   It is generally stated that the similarity  solution agrees with the (not determined) exact solution when some variable, t say, obeys t >> t_1.   But what is  t_1?   How does it depend on the initial conditions?  How large must  t be for the similarity solution to be within 15, 10, 5, 1, 0.1, ….. percent of the real solution?   And how does this depend on the parameters of the problem?   The student will consider a series of typical, fundamental problems using a new numerical program and approach I have developed. A conscientious effort over the summer should lead to two first class scientific papers to be published in top quality journals.
Skills Required Undergraduate understanding of nonlinear partial differential equations; ability to augment and write code to numerically solve numerous differential equations.
Skills Desired Ambition and pleasure to undertake exciting research project;  compatibility with (many) other members of the group;  desire to write influential papers which will have wide uptake.

 

Practical aspects of $k$ nearest neighbour methods

Contact Name Richard Samworth (project to be jointly supervised with Tom Berrett)
Contact Email r.samworth@statslab.cam.ac.uk
Company Name Statistical Laboratory
Address Wilberforce Road, Cambridge, CB3 0WB
Period of the Project 8 weeks, from 7 July
Project Open to Undergraduates
Deadline to Register Interest 3 March
Brief Description of the Project

Nearest neighbour methods are a family of techniques whose wide-ranging influence can be felt in many areas of data science. Their use in statistics dates back at least as far as Fix and Hodges (1951) in which the search for a fully nonparametric classification rule led the authors to propose a $k$-nearest neighbour classifier and density estimator. A large part of the popularity of nearest neighbour methods is undoubtedly due to their simplicity and analytic tractability, which make efficient practical implementation possible and open up the challenge of providing a thorough theoretical understanding. For their use in recent work on entropy estimation, classification and independence testing see Berrett, Samworth and Yuan (2018), Cannings, Berrett and Samworth (2017) and Berrett and Samworth (2017). As with many nonparametric techniques there is a tuning parameter, in this case $k$, whose value may affect the performance of the procedures significantly. Heuristically speaking, in many applications $k$ can be seen as controlling a bias--variance trade-off where larger values of $k$ result in larger bias and smaller values of $k$ result in larger variance. Theoretical results provide some insight into the optimal choice of $k$ (e.g. Cannings, Berrett and Samworth, 2017), but in practice $k$ is usually chosen heuristically or empirically, often by cross-validation (where possible). For this project the student would investigate the practical considerations of nearest neighbour methods in one or more of entropy estimation, classification and independence testing. This will principally involve the choice of $k$, but could also include choice of weights in weighted nearest neighbour methods or a study of whether standardising the data (prewhitening) reduces error. One further possibility is to explore the use of approximate nearest neighbours and the inherent trade-off between computational cost and statistical accuracy (e.g. Muja and Lowe, 2014).

References: Berrett, T. B. and Samworth, R. J. (2017) Nonparametric independence testing via mutual information. Available at arXiv:1711.06642.

Berrett, T. B., Samworth, R. J. and Yuan, M. (2018) Efficient multivariate entropy estimation via $k$-nearest neighbour distances. Ann. Statist., to appear.

Cannings, T. I., Berrett, T. B. and Samworth, R. J. (2017) Local nearest neighbour classification with applications to semi-supervised learning. Available at arXiv:1704.00642.

Fix, E. and Hodges, J. L. (1951) Discriminatory analysis -- nonparametric discrimination: Consistency properties. Technical Report number 4, USAF School of Aviation Medicine, Randolph Field, Texas.

Muja, M. and Lowe, D.~G. (2014) Scalable nearest neighbour algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell., 36, 2227-2240.

Skills Required Statistics, and an enthusiasm to learn about nonparametric Statistics, real analysis
Skills Desired Some previous experience of R would be desirable, but not essential

 

Oracle complexity for convex optimization on the real line

Contact Name Hamza Fawzi
Contact Email hf323@cam.ac.uk
Company Name DAMTP
Address DAMTP F0.16, Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA
Period of the Project 8 weeks between late June and 30 September
Project Open to Undergraduates, Part III (master's) students, PhD Students
Deadline to Register Interest 3 March
Brief Description of the Project The goal of the project is to get a tight complexity bound for the problem of minimizing a convex function on the real line. The complexity is defined in terms of the number of queries to the derivative of the function. It is known that the number of queries needed is a constant multiple of log(1/epsilon) where epsilon is the desired accuracy. However the exact constant is unknown. The goal of the project is to get the best possible constant.
Skills Required  
Skills Desired