2017 CMP Industrial Project Proposals

This is a list of CMP industrial project proposals from summer 2017.

Starting price analysis in sports betting
Application of Multi-Stage Stochastic Optimization in Energy and Finance
21st Century Solutions for Urban Environment
Autoregressive Models to Understand Longitudinal Drug Effect
Modern Dimensionality Reduction in Financial Dataseries
Graph Understanding using Graphlets
A New Economic Model for Patents
Real World Magic Squares
Developing natural language processing capacity
Target Discrimination using Short Range Radar
Adaptive Active Learning for Lead Optimisation
Investigate state of the art of computational approaches for alignment and co-registration of multiple imaging modalities
Development of an unsupervised machine learning framework in Python language for multi-parametric data analysis and visualization.
Trend interpretation and prognostics
Quality assessment of image alignment methods for Single-particle electron cryo-microscopy and electron tomography
Improved target discrimination for the UK weather radar network
Real-time Mapping of Turbulence
Application of topological data analysis to drug discovery
Low-frequency mutation detection using circulating tumour DNA
Simulations of Localization
Synthetic Data with the Simulacrum Project
Detection of cancer recurrence using treatment data
Can the volume and type of outpatient appointments attended predict when or whether a patient will be diagnosed with cancer?
The Data Management and Data Linkage Project Proposal
Data Science Business Internship - Product Analytics

Starting price analysis in sports betting

Contact Name:	Paul Doust
Contact email:	paul@pauldoust.com
Company:	A private gambling syndicate
Contact Address:	52 Duncan Terrace Islington London N1 8AG
Period of the Project:	Summer 2017
Brief Description of Project:	In sports betting, starting prices are the odds for the different possible outcomes of an event that are determined just before the event starts. For example, the betting exchange betfair.com has it's own method for determining starting prices, see https://promo.betfair.com/betfairsp/FAQs_theBasics.html . The project will involve analysing data relating to the betfair.com starting price. The host for this project completed his PhD in theoretical high energy physics in DAMTP in 1987, had career as a trader and quantitative analyst in the financial markets, and now runs a successful sports gambling syndicate.
Skills Required:	The project will require the student to use computers to analyse data, so computer skills are essential. Regarding mathematics, probability and statistics are the key disciplines. A keen desire to apply computer and maths skills to real life situations is also fundamental.
Skills Desired:	The gambling syndicate uses the C# programming language coupled with Excel. A familiarity with those environments would be very useful, or at least an ability to learn them quickly.
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:	Monday 20th February 2017

[Return to List]

Application of Multi-Stage Stochastic Optimization in Energy and Finance

Contact Name:	Robert Doubble
Contact email:	robert.doubble@bp.com
Company:	BP Oil International Ltd
Contact Address:	BP Oil International 20 Canada Square London E14 5NJ
Period of the Project:	8-16 weeks
Brief Description of Project:	Multi-stage stochastic optimization plays a central role in the valuation of energy assets or contracts under uncertainty. Examples of applications of strong commercial interest to BP Supply & Trading include the valuation of natural gas swing options and physical storage, power plant dispatch and portfolio optimization. These problems are challenging to solve as their complexity grows exponentially with the number of nodes used to represent the stochastic price process on a discrete lattice, and the dimensionality or granularity of the stochastic control. Stochastic Dual Dynamic Programming (SDDP), first introduced by Pereira and Pinto (1991), addresses the curse of dimensionality associated with the control variable. The Quantitative Analytics Team within BP Supply & Trading is interested in investigating methods to solve multi-stage Stochastic Optimization problems of this type. The initial focus would be on gaining an understanding of the SDDP method applied to continuous problems (such as swing options and natural gas storage), and then extending the study to consider integer problems (for example portfolio optimization and power plant dispatch). [1] J.R. Birge and F. Louveaux (2000), Introduction to Stochastic Programming, Springer, New York [2] A. Shapiro, D. Dentcheva and A. Ruszczynski, A (2009), Lectures on Stochastic Programming: Modelling and Theory. [3] J. Bonnans, Z. Cen and T. Christel (2011), Energy Contracts Management by Stochastic Programming Techniques [4] J. Zou, S. Ahmed and X. Sun (2016), Nested Decomposition of Multistage Stochastic Integer Programs with Binary State Variables
Skills Required:	Knowledge of Stochastic Control and Operations Research Experience of coding numerical algorithms
Skills Desired:	Stochastic calculus and probability Experience of using Python or a similar language
Project Open to:	Undergraduates, Part III (master's) students, PhD Students
Deadline to register interest:	end-Feb

[Return to List]

21st Century Solutions for Urban Environment

Project Title:	21st Century Solutions for Urban Environment
Contact Name:	Stephane Chretien
Contact email:	Stephane.Chretien@npl.co.uk
Company:	NPL
Contact Address:	Hampton Road Teddington TW11 0LW
Period of the Project:	8 weeks
Brief Description of Project:	We are seeking students to explore solutions to increase the resilience, quality of life or economic performance of urban areas by integrating environmental, infrastructure monitoring and satellite data with data from other sources. The focus is on better defining and solving problems through new ways of combining data.
Skills Required:	Time series of different nature (discrete and continuous) often appear in the monitoring of various ground characteristic using satellite observation. Modelling these time series and their dependencies is a big challenge which can be addressed using graphical models, a topic of renewed interest due to the recent advances in sparsity based statistical estimation. Sparsity has been a topic of extensive research in the recent years following the breakthrough results of Wilkinson prize winner Candes and Fields medal winner Tao in 2004 which had a tremendous impact in machine learning (Compressed Sensing, Netflix problem, Phase reconstruction, super resolution, etc) and applied mathematics (fast solvers for PDE’s, approximation of functions of many variables, etc). The studentship will be a unique opportunity to apply the sparse estimation theory to concrete graphical models and provide a proof of concept for the use of satellite data in the study of ground movements and subsidence.
Skills Desired:
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:

[Return to List]

Autoregressive Models to Understand Longitudinal Drug Effect

Contact Name:	Dr James Yates
Contact email:	james.yates@astrazeneca.com
Company:	AstraZeneca
Contact Address:	Hodgkin Building, Chesterford Research Park, Cambridge, CB10 1XL. United Kingdom.
Period of the Project:	8 weeks
Brief Description of Project:	Understanding the dose and time dependencies of drug effect are crucial to the discovery and development of new medicines. The discipline of pharmacokinetic-pharmacodynamic (PKPD) modelling has grown out of this need. PKPD modelling aims to develop models reflective of pharmacology and as such there are “standard” models that are used in day to day research. These tend to be ODE based. One type of such family of models, “turnover models”, have been shown to have link with autogressive (AR) statistical models (Xu, Wang, & Vermeulen, 2011). The advantage here is that the AR model is significantly cheaper computationally than the original ODE model as well as he link providing extra insight into the model behaviour. In this project it is proposed to extend this observation to models that contain mechanisms of drug tolerance(Gabrielsson & Hjorth, 2015) and model longitudinal dose-response data (Gabrielsson & Peletier, 2014) in particular the effects of anti-cancer drugs on tumour growth (Marusic, Bajzer, Vuk-Pavlovic, & Freyer, 1994). AR model equivalents or approximations will be explored and applied to data generated in AstraZeneca’s labs. Gabrielsson, J., & Hjorth, S. (2015). Pattern Recognition in Pharmacodynamic Data Analysis. The AAPS Journal. http://doi.org/10.1208/s12248-015-9842-5 Gabrielsson, J., & Peletier, L. A. (2014). Dose-response-time data analysis involving nonlinear dynamics, feedback and delay. European Journal of Pharmaceutical Sciences, 59(1), 36–48. http://doi.org/10.1016/j.ejps.2014.04.007 Marusic, M., Bajzer, Z., Vuk-Pavlovic, S., & Freyer, J. P. (1994). Tumor growth in vivo and as multicellular spheroids compared by mathematical models. Bull Math Biol, 56(4), 617–631. http://doi.org/10.1007/BF02460714 Xu, X. S., Wang, H., & Vermeulen, A. (2011). Modeling delayed drug effect using discrete-time nonlinear autoregressive models: a connection with indirect response models. Journal of Pharmacokinetics and Pharmacodynamics, 38(3), 353–367. http://doi.org/10.1007/s10928-011-9197-1
Skills Required:	Ordinary differential equations Some knowledge of statistics including autogressive models
Skills Desired:	Experience programming with matlab or R
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:	3 March

[Return to List]

Modern Dimensionality Reduction in Financial Dataseries

Contact Name:	James Bridgwater
Contact email:	Cambridge.recruitment@symmetryinvestments.com
Company:	Symmetry Investments
Contact Address:	Symmetry Investments UK LLP 2nd Floor, 15 Sackville Street London, W1S 3DJ
Period of the Project:	8-10 weeks over summer 2017
Brief Description of Project:	We are looking for an intern to work in the Analytics group at Symmetry (currently comprising three mathematicians, two based in London, one in Hong Kong) on a project of direct business relevance. We expect the project to take place over at least 8-10 weeks in the summer based predominantly in the Symmetry Investments London office. The purpose of the group is to improve the analytical toolkit available to traders for the study of markets and data, which involves research and coding in areas such as pricing models, statistical analysis, optimisation. The project is a first step towards a more sophisticated use of principal components analysis by the fund. In this context PCA is a valuable statistical tool for handling the datasets we encounter every day, essentially the technique is to project a high dimensional dataset onto a suitable low dimensional subspace which is chosen so as to maximise the information content preserved by the projection (in some sense which can be made more precise). A standard and typical example would be to use an eigen-decomposition of the covariance matrix of the data, choosing the most significant eigenvectors based on the One obvious issue with eigen-techniques for PCA is that the results are extremely sensitive to outliers in the initial dataset, and in particular to noise. Another issue is that eigen-techniques do not scale well to large dimensions, but work most efficiently on small to medium scale problems. Recent developments in the field have led to techniques that are more scalable and more robust to noisy data. The aim of this project is to review some of the recent literature on this area, to identify and implement a few of the more interesting techniques and to apply these algorithms to financial datasets of importance to the business.
Skills Required:	There are no specific programming languages required, the successful candidate will be able to pick up enough during the course of the project.
Skills Desired:	The successful candidate will have an interest in the application of mathematical techniques to the financial world and will be personable, willing and able to engage with the broader team; there will be opportunity to interact with traders and others directly involved in the business of the fund. We will be looking for a presentation of results and conclusions towards the end of the project.
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:

[Return to List]

Graph Understanding using Graphlets

Contact Name:	K Hermiston
Contact email:	kjhermiston@dstl.gov.uk
Company:	Defence Science and Technology Laboratory
Contact Address:	Room 246, Building 5, Defence Science and Technology Laboratory, Porton Down, Salisbury, SP4 OJQ
Period of the Project:	12 weeks
Brief Description of Project:	A key component of future analysis capability will be the ability to understand and analyse vast datasets automatically, both at rest and streaming, using Big Data technology. A stressing mode of Big Data processing is that of graph analytics in view of the large dataset size and non-local connectivity. We seek graph analytic methods that are scalable to massive graphs. Graph analytics is of most relevance where data sets comprise both entities (graph vertices) and relationships (graph edges). They typically occur in network datasets such as social media, computer network traffic, spatial flow systems (e.g. transport networks), bioinformatics, semantic models and natural language inference. Graph analytics has application in a number of Defence areas including cyber defence and humanitarian relief. The project will focus on a theme of future urban analysis, where large datasets capture the ebb and flow of normal population activity, often periodic. Within the datasets reside changing structures, symmetries and dynamics that provide a narrative of urban life at different levels of granularity. The main challenge of this student project will be to learn, apply and advance the understanding and theory of the global and local structure within large, undirected graphs using graphlet approaches [1, 2]. Additionally, the utility of graphlets to analyse the ‘role’ of nodes on graphs will be examined. Graphlets may be informally defined as small components of graphs that provide a structural-basis onto which large graphs may be decomposed. Formally, they are connected, induced, non-isomorphic subgraphs, with the edges between graphlet vertices only present if they are present in the large graph. Traditionally, the counting of graphlets within graphs has required the explicit counting of their embeddings on that graph. However, it is of interest to understand and exploit the group-theoretic symmetries (automorphism orbits) of graphlets to reduce the computational complexity of their counting. The project will seek to implement and advance graphlet analysis methods over a selected dataset, drawn from open sources [3] and using open source code [4, 5]. References 1. Hocevar T., Demsar J., Combinatorial algorithm for counting small induced graphs and orbits, 2016 Pre-print, Available from Website https://arxiv.org/abs/1601.06834?context=cs [accessed 10th January 2017]. 2. Hocevar T., Demsar J., A combinatorial approach to graphlet counting, Bioinformatics, Vol. 30 no. 4 2014, pages 559–565., Available from Website http://bioinformatics.oxfordjournals.org/content/30/4/559.full.pdf+html [accessed 10th January 2017]. 3. Stanford Network Analysis Project, Available from Website https://snap.stanford.edu/data/index.html [accessed 10th January 2017]. 4. Graphlet C++ source code, Available from Website http://www.biolab.si/supp/orca/orca.html [accessed 10th January 2017]. 5. SageMath, software package, Available from Website http://www.sagemath.org/ [accessed 11th January 2017].
Skills Required:	The task will require some prior knowledge of algebraic graph theory and require coding skills in C++ and/or Python. Due to the requirements of the host site, the opportunity is only open to UK nationals.
Skills Desired:	The successful applicant will be working in a collaborative environment with opportunities to share and acquire technical knowledge with others. We would also like the student to provide a short presentation of their work at the end of their placement. Skills in team working and communications skills are desired.
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:	24th February 2017

[Return to List]

A New Economic Model for Patents

Contact Name:	Terence Kealey
Contact email:	terence.kealey@buckingham.ac.uk
Company:	Department of Economics, University of Buckingham
Contact Address:	Department of Economics University of Buckingham Buckingham MK18 1EG
Period of the Project:	8 weeks
Brief Description of Project:	Note: This is an open-ended project. Last year's project was well-defined but this is more open-ended. Introduction: Patents have traditionally been modelled as important for economic growth. This is because economic growth has been modelled as being based on developments in science, which in turn have been modelled as being funded sub-optimally by the market. But whereas the empirical evidence shows that developments in science do indeed underpin economic growth, the existence or absence of patent laws seems not to affect rates of either scientific or economic growth (M Boldrin, D Levine, 2010, Against Intellectual Monopoly, CUP.) A new model for patents is therefore required. Background: The development post-war of game theoretic models of science as a public good (see the references in T Kealey, M Ricketts, 2014, Modelling science as a contribution good, Research Policy, 43: 1014-1024, open access) provided a public-interest justification for patents, namely that under the market inventors cannot capture the full benefits of their inventions, so they will underinvest in R&D unless they are provided with the temporary monopoly of patents. But we have now modelled science as a contribution good (because the full benefits of science cannot be captured by free-riders, only by active scientists) under which investors may not need the incentive of patents before investing. The project: Under the market, advances in science are translated into economic growth by first-mover advantage, which provides a temporary (if depreciating) monopoly of asymmetry of information and tacit knowledge. We would like to model that against a model of patents, to determine which - under the model of science as a public good - yields the socially-optimal outcome. The project will therefore involve modelling the incentive to do research under first mover advantage against modelling the incentive under patents. The student: We envisage that the student would work with - or even at - the Department of Economics in Buckingham (which would supply accomodation for the length of the project). Last year we hosted a student under this scheme, Mr Sebastian Damrich, who has kindly agreed to discuss how he felt about working with us (sebastian.damrich@gmx.de). We are now writing up his work (on crowding out) as a paper, to be submitted to Research Policy, and though we'd hope to be as successful again, the project this year is more open-ended so we cannot sure it will translate standing-alone into a paper, we trust it will lay the basis for one.
Skills Required:	Mathematical modelling, computer simulation and differential calculus as illustrated by Kealey & Ricketts (see above) and Sebastian Damrich's project (which will be made available to any interested candidates).
Skills Desired:	A willingness to extend mathematical analysis into a social science project.
Project Open to:	Part III (master's) students
Deadline to register interest:

[Return to List]

Real World Magic Squares

Contact Name:	Michael Wilson
Contact email:	mawilson1@dstl.gov.uk
Company:	Defence Science and Technology Lab (DSTL)
Contact Address:	Building X76 Fort Halstead Sevenoaks Kent TN14 7BP
Period of the Project:	6-8 weeks
Brief Description of Project:	WITHOUT COMMITMENT This is an open ended project which might have an approximate solution, the added value derives from how close the student will get to a useful answer (or demonstrate there are no realistic solutions). The conceptual model for this project is the Magic Square. Many variants exist for this mathematical toy. We wish to devise a means of constructing a MS based around an initial set of “non-magical” values. For example, given an initial n x n grid of unknown values. Is it possible to construct an (n+b) x (n+b) grid around the initial data where any linear path traced through the grid has the same product as any other. To develop this problem further: a)the ultimate model is a volume of 500 – 2000 cubed. b) the linear paths through the 3D model can be from nearly any direction c) What are the limitations on value b. X-ray absorption along a path of length t through a material with a linear absorption coefficient u, where u is dependent on the energy of the x-ray photon is given by I=Io exp(-ut). Where a sample is illuminated by a polychromatic x-ray beam the intensity at a point on the detector can be determined by integrating the absorption along the path from source to detector for each material and each x-ray energy. This project will investigate means of determining the material properties or material that must be added to the extremities of a path such that any path generates the same or similar value for total x-ray absorption.
Skills Required:
Skills Desired:	The student might need matlab or similar skill. An awareness of 3D modelling software such as Volume Graphics and the tools they use might also be useful.
Project Open to:	Undergraduates, Part III (master's) students, PhD Students
Deadline to register interest:	3 March

[Return to List]

Developing natural language processing capacity

Contact Name:	Maciej Hermanowicz
Contact email:	maciej.x.hermanowicz@gsk.com
Company:	GSK
Contact Address:	GlaxoSmithKline Gunnels Wood Road Stevenage SG1 2NY
Period of the Project:	8-12 weeks
Brief Description of Project:	Pharmaceutical companies store large quantities of experimental data in the form of electronic Lab NoteBooks (eLNBs), which serve as a central repository of experiments carried out within the company. These eLNBs offer a vast unstructured dataset with a large data science use potential. The successful candidate will develop proof-of-concept natural language processing (NLP) techniques aimed at extracting experimental parameters and biochemical contexts from PDF documents. Proof of concept work will be conducted on published research papers and will involve recovery of information stored in both raw text and tables. This information will be verified against curated public databases such as the Protein Data Bank or ChEMBL. References: Mallory, Zhang, Ré, and Altman, 2016, Large-scale extraction of gene interactions from full text literature using DeepDive, Bioinformatics, 32(1), 106-113; DOI: 10.1093/bioinformatics/btv476 McEntire et al., 2016, Application of an automated natural language processing (NLP) workflow to enable federated search of external biomedical content in drug discovery and development, Drug Discovery Today, 21(5), 826-835, DOI: 10.1016/j.drudis.2016.03.006
Skills Required:	Familiarity with a high-level programming language Excellent communication skills
Skills Desired:	Familiarity with Python Familiarity with natural language processing techniques Interest in drug discovery research
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:

[Return to List]

Target Discrimination using Short Range Radar

Contact Name:	Dr Nigel Davidson
Contact email:	ndavidson@dstl.gov.uk
Company:	Dstl
Contact Address:	Dstl Fort Halstead Sevenoaks Kent TN14 7BP
Period of the Project:	8 weeks within June to September
Brief Description of Project:	The task is to design a strategy to differentiate between targets from data collected by a short range radar array. Multiple datasets of the received signals will be available from tests against different targets and using the same target in different orientations. The data will be divided between a group of data where the targets are identified and a group of data where the targets are blind. The datasets of identified targets are to be used to generate algorithms for target identification. The algorithms should then be used on the blind data group with the aim of outputting a prediction of the target type. The scope of the project, size of datasets and required facilities will be finalised in discussion with the student. If the work is undertaken at Dstl Fort Halstead the student will need to be a UK national and undergo security clearance. However, these restrictions will not be necessary if the student has access to suitable university computer facilities. Completion of the project at Cambridge is preferable due to the time required to process clearance.
Skills Required:	Knowledge of signal processing methods.
Skills Desired:	Machine learning, pattern recognition.
Project Open to:	Undergraduates, Part III (master's) students, PhD Students
Deadline to register interest:	3 March

[Return to List]

Adaptive Active Learning for Lead Optimisation

Contact Name:	Darren Green
Contact email:	darren.vs.green@gsk.com
Company:	GlaxoSmithKline
Contact Address:	GSK Medicines Research Centre Gunnels Wood Road Stevenage Herts SG1 2NY
Period of the Project:	June-September (flexible)
Brief Description of Project:	Lead optimisation in the pharmaceutical industry is a classic "design-make-test" cycle. Starting with small available data, new chemicals must be designed to optimise towards a desired multi-parameter profile. Some parameters have high-throughput tests, allowing the early generation of predictive models through machine learning. Other parameters have lower throughput or more expensive tests, leading to smaller/sparse data. Active Learning methods are emerging as a potential approach to improve the efficiency of the multi parameter optimisation [1]. This project will investigate the development of adaptive approaches which will guide the balance of Explore:Exploit strategies in the Active Learning cycle, as more data and improved predictive models are produced. [1] Reker D, Schneider G. (2015) Active-learning strategies in computer- assisted drug discovery. Drug Discovery Today 20:458–465
Skills Required:	R, Python
Skills Desired:
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:

[Return to List]

Investigate state of the art of computational approaches for alignment and co-registration of multiple imaging modalities

Contact Name:	Jonathan McKinnell
Contact email:	jonathan.x.mckinnell@gsk.com
Company:	GSK
Contact Address:	GlaxoSmithKline Gunnels Wood Road Stevenage SG1 2NY
Period of the Project:	8-12 weeks
Brief Description of Project:	Assessment of disease progression and any potential toxic effects of a compound can be carried out using a variety of different imaging techniques. These datasets are typically analysed in isolation. Investigating state of the art methods for combining different imaging modalities would therefore be useful in drug discovery. The project will utilise publically available datasets (such as MTBLS176: Benchmark for 3D MALDI- and DESI-Imaging Mass Spectrometry). This project has the following main aims: - Investigation and literature review of the current state of the art computer vision approaches to automated co-registration of MALDI (Matrix-Assisted Laser Desorption/Ionization) imaging mass spectrometry data with e.g. histopathology image data . - Investigation of possible scoring or disease relevant parameter extraction techniques from co-registered images. - Investigate automated segmentation of different regions of the images, e.g. automated detection and labelling of different tissue structures. - Investigate state of the art in the literature of automated labelling of abnormal (or diseased) regions in tissue and MALDI images
Skills Required:	Coding skills such as Python, Matlab or R Previous experience with computer vision and machine learning
Skills Desired:	Experience with imaging techniques used in the life sciences Experience analysing images in the life sciences using computer vision Experience in segmentation and co-registration of different imaging modalities
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:

[Return to List]

Development of an unsupervised machine learning framework in Python language for multi-parametric data analysis and visualization.

Contact Name:	Laura Acqualagna
Contact email:	laura.x.acqualagna@gsk.com
Company:	GSK
Contact Address:	GSK Medicines Research Centre Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2NY, United Kingdom
Period of the Project:	8-12 weeks
Brief Description of Project:	Data related to chemical compounds and screening assays are intrinsically described by many parameters, e.g. physiochemical properties, transcript or protein levels, etc. Unsupervised machine learning methods provide a plethora of methods for exploratory analysis, such as dimensionality reduction methods for 2D and 3D visualization and for speeding up following prediction algorithms, clustering methods useful to identify similarities within the data and help with integration of data of different nature. In such exploratory analysis often the data are not labelled and no ground truth is available. The patterns that we identify in the data can vary with the mathematical modelling (e.g. linear vs non-linear projections), making critical the choice of the methods and the following interpretation of results. Moreover, experimental data often have a complex nature, for example the dataset can be sparse and characterized by a mixture of quantitative and categorical variables. Therefore, the purpose of the project is to develop an unsupervised machine learning framework based on Python open-source libraries, which will guide step by step the scientist in the exploration of the dataset, helping in the choice of sophisticated algorithms upon the nature of the dataset. Among the challenges of the project, there are several interesting scientific questions that can be explored / picked by the student to pursue in a period of about 8-12 weeks: 1) Dataset exploration: how does my dataset look like? a. Does the data matrix have full rank? Are there missing values, and, if affirmative, are they missing at random or missing not at random? How would this information change the mathematical modelling? b. Are data quantitative, categorical, binary, or a mixture of those? How would you treat them to finally achieve a unique representation of the dataset? c. How are data distributed? How does this influence the choice of the ML method to use? d. How would you implement in Python code some functions that would check the above questions? 2) Dataset manipulation: which algorithm to use? a. Dimensionality reduction: linear vs non-linear methods. b. Major variants of methods that deal with sparse data, e.g. probabilistic PCA, Sparse PCA etc. c. Exploration of different missing data imputation methods. d. Exploration of major variants of the non-linear projection algorithm Stochastic Neighbor Embedding (SNE) e. Exploration of the suitability of a sparse representation of data. f. Checking data overfitting, which kind of regularization would you use? g. How would you implement in Python code some functions that would check the above questions? 3) Data interpretation and visualization: a. Which are the methods that allow the results to be easily interpretable by a biologist or a chemist naive with respect to ML? b. Which are the methods that allow a reliable reconstruction of the manipulated data into the original space (in which the variables have biological meaning)? For any of the chosen scientific questions, the project foresees both the theoretical exploration and understanding of the methods in form of literature review, which would justify the choice of specific approaches to the data, and the implementation of Python scripts to evaluate the methods on real data provided internally.
Skills Required:	• Python coding skills. • Basic knowledge of unsupervised machine learning algorithms.
Skills Desired:	• Knowledge of biology or chemistry. • R and MATLAB coding skills.
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:

[Return to List]

Trend interpretation and prognostics

Contact Name:	Geoff Walker
Contact email:	geoff.walker@artesis.co.uk
Company:	Artesis LLP
Contact Address:	St John's Innovation Centre Cowley Road Cambridge CB4 0WS
Period of the Project:	8 weeks
Brief Description of Project:	Artesis LLP monitors the health of industrial equipment, and provides advice to its clients on what should be done in the way of treatment to avoid unexpected failures. This has some close analogies with the world of medicine, monitoring bodily health, and providing advice on treatment to cure or avoid diseases. We monitor equipment behaviour, identify changes, diagnose the cause, and identify the severity of developing problems, and prescribe the recommended solution, all remotely, and as far as possible, automatically, by use of a wide range of mathematical techniques and algorithms. One of the hardest areas to get right is providing guidance on how soon any remedial work must be done. Simple trending can be difficult with noisy data, and we rarely have large volumes of historic failure data on which to base statistical norms. So the PMP project will be focussed on identifying the best mathematical techniques for prognosis given the characteristics of the available data. The project will involve researching, assessing and deploying a range of prognostic techniques, and then evaluating which technique is most appropriate for the type of data available to Artesis. This project follows on from work done in last year's PMP project by Will Boulton, and is in an adjacent area to a current PhD project being worked on by Ferdia Sherry looking at Changepoint techniques, the results of which will be available by the start of this PMP project. The project is reasonably well-defined, with key steps seen to be: 1. Review existing methods 2. Review existing Artesis data and characterise data sets 3. Select a small set of methods that appear to be appropriate to the nature of the data 3. Categorise and collate data into appropriate sub groups for treatment methods 4. Try out the methods on the relevant data sets - probably involving some programming to apply the methods 5. Evaluate the outcomes - identify and recommend the preferred method 6. If time allows, encode this preferred method, and demonstrate how it produces prognosis of failure dates, with confidence ratings, and how these confidence figures change as time progresses
Skills Required:	Mathematician! Someone willing to try new things Happy to read up and research existing methods, and then try to apply them Some statistical skills and experience would be helpful Preparedness to accept the imperfect nature of the real world
Skills Desired:	Software coding skills - or if not, a willingness and enthusiasm to learn them.
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:	3 March

[Return to List]

Quality assessment of image alignment methods for Single-particle electron cryo-microscopy and electron tomography

Contact Name:	Ala' Al-Afeef
Contact email:	ala.al-afeef@diamond.ac.uk
Company:	Diamond Light Source Ltd
Contact Address:	Diamond Light Source Ltd Diamond House Harwell Science & Innovation Campus Didcot Oxfordshire OX11 0DE
Period of the Project:	12 weeks
Brief Description of Project:	See http://www.diamond.ac.uk/Careers/Work-Placement/Summer-Placement.html?mg... for details. Direct information about the 3D-structure of the sub-cellular macro-molecular is crucial for understanding its functional organization. Electron microscopes allow imaging of biological matter down to sub-nanometer resolutions. Single-particle electron cryo-microscopy (SPEM) and electron tomography (ET) are capable of providing 3D-analysis for nano-structures. In SPEM, many images of assumedly identical copies of macromolecular complexes are combined to obtain 2D or 3D structural information. SPEM can provide an isotropic 3D-reconstruction if a large number of particle images are used. However, SPEM can be susceptible to pitfalls and reconstructing erroneous features. This is mainly due to the electron exposure that needs to be limited to avoid radiation-damage. This is why Cryo-ET images typically present a very low signal-to-noise ratio and require robust image processing approaches. In Diamond, SPEM and EM work-flows enable the automatic on-line reconstruction of cryo-electron microscopy (cryo-ET) datasets. This aimed to process terabytes of images that are usually acquired in a typical cryo-ET session and to enable the e-Bic users to speed-up the processing required by the computationally intensive SPEM-routines. Although, the aim is ambitious, there are many difficulties that limit the quality of the final reconstruction. Examples of such difficulties are the noisy 2D cryo-ET-images, accuracy of classification and image alignment. Among many, the accuracy of alignment of 2D-images is a very important factor that significantly limits the fidelity and the resolution of 3D-reconstruction. The current workflows are based on classical approach of image-alignment. However, it is recognised clearly by the SPEM-community that there are reasons to doubt the accuracy of the results achieved by current alignment procedures. Therefore, there are compiling reasons to look for more advanced methods and conduct a quality assessment study of the state of the art alignment methods that can be used to improve the accuracy of our SPEM and ET pipelines at Diamond.
Skills Required:	· Knowledge of programming using modern programming language (python is preferred) · Highly motivated and can be an effective member of a multidisciplinary team · Good communication and reporting skills
Skills Desired:	· Good mathematical background in Linear algebra · Experience in solving image processing problems · Experience in solving inverse problems · Experience in single particle analysis
Project Open to:	Undergraduates
Deadline to register interest:	5pm on Thursday February 9th, 2017

[Return to List]

Improved target discrimination for the UK weather radar network

Contact Name:	Tim Darlington
Contact email:	timothy.darlington@metoffice.gov.uk
Company:	Met Office
Contact Address:	Radar System R&D D-1 Fitzroy Rd Exeter EX1 2TE
Period of the Project:	8 weeks between late June and 30 September
Brief Description of Project:	The Met Office has been upgrading its weather radar network with dual polarisation technology. This gives us information on the type of target the radar is seeing, enabling us to do a much better job of removing non-meteorological echoes. Despite this improvement, there are still some targets that can break through into the processed data, with ships being a particular issue. The placement will investigate the use of the signature of ship signals in the time domain, for example using the temporal curvature of radar fields, to remove ship signals from the radar data. Other approaches to the problem would be considered if thought to be more appropriate.
Skills Required:	Experience of Linux or Unix and computer programming Programming experience in Python or C++
Skills Desired:	Experience of working with large data sets.
Project Open to:	Undergraduates, Part III (master's) students, PhD Students
Deadline to register interest:	31 March

[Return to List]

Real-time Mapping of Turbulence

Contact Name:	Malcolm Kitchen
Contact email:	malcolm.kitchen@metoffice.gov.uk
Company:	Met Office
Contact Address:	Met Office Fitzroy Rd Exeter EX1 3PB
Period of the Project:	8 weeks
Brief Description of Project:	Turbulence experienced by civil aircraft affects passenger comfort, and in extreme cases, passenger safety. Accordingly, considerable efforts are made to forecast the occurrence of turbulence. Unfortunately, the only observations of turbulence are currently reports by pilots. real time observations are badly needed to verify forecasts and initialise forecast models. The Met Office has a network of radio receivers around the UK intercepting navigational data broadcasts from civil aircraft in real-time. These data include flight parameters such as aircraft roll-angle, heading etc. The proposition is that a turbulence measure can be developed using a combination of these parameters. The project work would include exploring the literature concerning the impact of turbulence on flight parameters; to develop some candidate turbulence measures; and to test and evaluate them using a historical dataset. Much of the basic programming required for this project has already been completed.
Skills Required:	Some knowledge of programming in Python (could be self-learning ahead of placement)
Skills Desired:	Experience of programming in Python Interest in aviation
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:	3rd March

[Return to List]

Application of topological data analysis to drug discovery

Contact Name:	Nicola J Richmond
Contact email:	nicola.j.richmond@gsk.com
Company:	GlaxoSmithKline
Contact Address:	Medicines Research Centre GlaxoSmithKline Gunnels Wood Road Stevenage Herts SG1 2NY
Period of the Project:	8-10 weeks
Brief Description of Project:	"Topological data analysis (TDA) is a fairly novel approach to deriving insights from high-dimensional, noisy and incomplete data, with routes in algebraic topology. The main tool of TDA is persistent homology, an adaptation of homology to point clouds which can be used to understand the shape of the data. A point cloud in n-dimensional Euclidean space is represented by a simplicial complex, (Čech or Vietoris-Rips), obtained from centring n-balls of fixed radius r around each point and connecting those points whose n-balls have a non-empty intersection. Topological features of the simplicial complex that persist as r increases are more likely to represent true features of the data rather than artefacts associated with noise. TDA has been applied in a number of ways and to data sets across many fields and in the healthcare sector, TDA has proven successful at identifying associations between disease and genetic variation. At GSK, we are interested in exploring the potential of TDA as a data-mining approach, understanding when to apply TDA over other methods and knowing which open source software packages are most appropriate. This is a well defined project evaluating TDA in the context of drug discovery process.
Skills Required:	An understanding of mathematical concepts underpinning TDA and an interest in data science.
Skills Desired:	The above plus some experience at scripting in languages such as R or Python.
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:	31st March 2017

[Return to List]

Low-frequency mutation detection using circulating tumour DNA

Contact Name:	Dr James Brenton
Contact email:	james.brenton@cruk.cam.ac.uk
Company:	Cancer Research UK Cambridge Institute
Contact Address:	Robinson Way, Cambridge CB2 0RE
Period of the Project:	8 weeks between July 1 and September 30 (negotiable)
Brief Description of Project:	High-grade serous ovarian carcinoma is the most lethal female gynaecological cancer. It is usually diagnosed at late stage and survival after five years is less than 25%. Overall survival has not improved over the past 20 years owing to both the significant challenges for early diagnosis, and the extreme complexity of genomic change that has hindered novel therapy development. We have previously shown that mutations in the TP53 gene are ubiquitous in women with high-grade serous ovarian carcinoma and have used this finding to develop non-invasive "liquid biopsies" that can monitor disease burden and response. More recently, we have used estimates of TP53 allele fraction in cancer biopsies to rapidly derive copy number signatures from shallow whole genome sequencing that may provide a new molecular classification for personalized therapeutic strategies. Existing methods for calling mutations in TP53 do not perform well at low levels of detection owing to noise from sequencing enzymes, sample library preparation, as well as the overwhelming abundance of normal DNA which dilutes the signal from the tumour DNA. The objective of this project is to develop new and more accurate approaches for detecting TP53 mutations in circulating tumour DNA and low cellularity tumour biopsies using Bayesian and related statistical modelling approaches. In preliminary work we have established error models that are nucleotide specific within TP53 and have sequencing data from over 2000 cases from which further error models can be constructed. This project will provide strong training in the application of computational methods to clinically relevant problems in cancer genomics and how they can be implemented in clinical practice.
Skills Required:	R programming experience; Probabilistic modelling; Bayesian inference
Skills Desired:	Bioinformatics; Cancer biology
Project Open to:	Part III (master's) students
Deadline to register interest:	3 March 2017

[Return to List]

Simulations of Localization

Contact Name:	Edmund Fordham
Contact email:	fordham1@slb.com
Company:	Schlumberger Gould Research
Contact Address:	High Cross Madingley Road Cambridge CB3 0EL
Period of the Project:	8 weeks
Brief Description of Project:	Nuclear Magnetic Resonance is established technology in petroleum well-logging and many other applications to sedimentary rocks, and to other porous media of economic importance. A ubiquitous phenomenon is the presence of induced internal fields caused by contrasts in magnetic susceptibility. Diffusion within such internal fields (or within externally applied gradients in MRI and similar applications) results in additional attenuation of detected magnetization which can be difficult to quantify. Three asymptotic regimes are recognised, of which the so-called "Localization" regime is the most destructive, and least understood. Typically arising in equipment with strong static magnetic fields, detected signal is "localized" to small parts of the fluid-filled pore space. In exploring this phenomenon, we have had some success with numerical simulations in the FE package COMSOL, and are able to simulate the time development of the volume magnetization fields in the archetype process of NMR spin echo formation, in irregular pore spaces. This open-ended project is to benchmark the COMSOL simulations against the (rather few) known asymptotic results for simple geometries, and to establish their regions of validity. Many extensions are possible and limited only by available time. Prior experience with the COMSOL package would be desirable but is not essential.
Skills Required:
Skills Desired:	Experience with the Finite-Element modelling package COMSOL
Project Open to:	Part III (master's) students, PhD Students
Deadline to register interest:

[Return to List]

Synthetic Data with the Simulacrum Project

Contact Name:	Kathryn Lawrence
Contact email:	kathryn.lawrence@healthdatainsight.co.uk
Company:	Health Data Insight
Contact Address:	CPC4 Capital Park Cambridge Road Fulbourn Cambridgeshire CB21 5XE
Period of the Project:	2-3 months between June and September.
Brief Description of Project:	We are seeking a new perspective for the testing phase of the Simulacrum project which, as a synthetic data project built on confidential patient data, will require acceptance from the patient, academic and researcher communities. Some combination of the following, depending on intern interests, will be involved, or an alternative project if proposed by the intern and mutually agreed. • Directly supporting inquiries into cancer data, real and simulated, by project partners. • Evaluating and applying measures of simulation fidelity and disclosure risk. • Visualisation and data journalism from simulated and real data. • Management of publicly available knowledge of Cancer Registry datasets via Simulacrum. • Improving Simulacrum user experience.
Skills Required:	The project involves analysing datasets containing anonymised personal information, so information governance training will be provided. Creativity and an interest in cancer research are expected. Other skills, such as statistics and probability or SQL and working with databases, depending on the project desired.
Skills Desired:	An interest in statistical theory, probability and machine learning. Experience using Matlab, SQL or statistical software.
Project Open to:	Undergraduates, Part III (master's) students, PhD Students
Deadline to register interest:	Midnight on Friday 31st March 2017

[Return to List]

Detection of cancer recurrence using treatment data

Contact Name:	Kathryn Lawrence
Contact email:	kathryn.lawrence@healthdatainsight.co.uk
Company:	Health Data Insight
Contact Address:	CPC4 Capital Park Cambridge Road Fulbourn Cambridgeshire CB21 5XE
Period of the Project:	2-3 months between June and September.
Brief Description of Project:	Recurrence is one of the most difficult problems in cancer. As cancer treatment and survival improve there has been a growing interest in understanding subsequent relapse and recurrence of the disease. Recommendation 90 of the cancer taskforce stated the need for accurate data to be collected on recurrence for all patients. We are looking for a talented mathematician, physicist or quant to use learnings from previous and ongoing recurrence work to develop a mathematical cluster based approach using treatment data in the patient pathway to identify recurrences in the cancer population. Current work looks at a small cohort of patients and focuses on the presence of one of the following events any time after an initial period of 180 days from diagnosis: • Curative Chemotherapy • Curative Radiotherapy • Major Surgery • Multi-Disciplinary team. An algorithm is being produced using this information to try to identify potential recurrences in this population. As an extension of this work we would like to investigate the potential of a mathematical clustering approach to enhance this work. The project will take the findings of the current work and develop a more robust set of rules based off the entire patient pathway incorporating all available treatment/event data on a much wider patient cohort (multiple tumour sites). This should create a systematic approach to the identification of recurrences using treatment data.
Skills Required:	This project involves analysing datasets containing anonymised personal information, so information governance training will be provided. Creativity and an interest in cancer research are expected. Other skills such as a good grip of mathematical clustering and statistics will be essential.
Skills Desired:	Experience using Matlab, SQL or statistical software will be advantageous.
Project Open to:	Undergraduates, Part III (master's) students, PhD Students
Deadline to register interest:	Midnight on Friday 31st March 2017

[Return to List]

Can the volume and type of outpatient appointments attended predict when or whether a patient will be diagnosed with cancer?

Contact Name:	Kathryn Lawrence
Contact email:	kathryn.lawrence@healthdatainsight.co.uk
Lab/Department:
Company:	Health Data Insight
Contact Address:	CPC4 Capital Park Cambridge Road Fulbourn Cambridgeshire CB21 5XE
Period of the Project:	2-3 months between June and September
Brief Description of Project:	• To use the Hospital Episodes Statistics (HES) outpatients dataset linked to cancer registrations to check for volume of attendances and patterns of main specialties prior to a patient being diagnosed with cancer o What are the outpatient specialty areas that have high interaction with cancer patients in the two years before their diagnosis? Do we see anything unusual? o Can a machine learning approach help us to identify patterns of attendance on a large cancer cohort? • Could it help us predict when a patient is about to be diagnosed with cancer? Compared with HES Admitted Patient Care, the HES outpatients’ dataset remains relatively unexplored in the cancer field. The intern project will focus on identifying patterns of outpatient clinic attendance for breast, colorectal and lung cancer patients; three of the most commonly diagnosed cancers.
Skills Required:	This project involves analysing datasets containing anonymised personal information, so information governance training will be provided. Creativity and an interest in cancer research are expected. Other skills such as a good grip of mathematical clustering and statistics will be essential.
Skills Desired:	For further details, please go to: http://healthdatainsight.co.uk/cancer-data-internships-2017/
Project Open to:	Undergraduates, Part III (master's) students, PhD Students
Deadline to register interest:	Midnight on Friday 31st March 2017

[Return to List]

The Data Management and Data Linkage Project Proposal

Contact Name:	Kathryn Lawrence
Contact email:	kathryn.lawrence@healthdatainsight.co.uk
Company:	Health Data Insight
Contact Address:	CPC4 Capital Park Cambridge Road Fulbourn Cambridgeshire CB21 5XE
Period of the Project:	2-3 months between June and September
Brief Description of Project:	Primary aim: To interrogate innovative processes to adopt an automated data linkage process that will work across the various types of datasets held in the Cancer Analysis System (CAS) otherwise known as the Cancer Datasets Database. Secondary aim: To identify and automate a data management cataloguing process across the multiple datasets within the Cancer Datasets Database. We would like an intern to explore different data linkage concepts which could be automated and used across the wide-ranging datasets housed in the Cancer Datasets Database.
Skills Required:	Some experience of working with a range of free software analytical tools to manipulate and visualise data effectively. An ability to think creatively, interrogate tools and data to provide innovative and practical data management solutions.
Skills Desired:	An interest in managing vast amounts of cancer data effectively, usability experience to obtain information and translate into practical solutions. For further details, please visit: http://healthdatainsight.co.uk/cancer-data-internships-2017/
Project Open to:	Undergraduates, Part III (master's) students, PhD Students
Deadline to register interest:	Midnight on Friday 31st March 2017

[Return to List]

Data Science Business Internship - Product Analytics

Contact Name:	Jason Yung
Contact email:	jyung@elevatecredit.co.uk
Company:	Elevate Credit International
Contact Address:	27b Floral Street, Amadeus House Covent Garden London WC2E 9DP United Kingdom
Period of the Project:	3 Months (mid June - mid September)
Brief Description of Project:	This is an exciting opportunity for a paid summer internship role. Working alongside other interns and with the support and guidance of Data Science professionals within Elevate, the intern role will be assigned responsibility for project work, which will enable the suitable candidate to apply knowledge gained from their course, to a customer-oriented problem, in a practical work-based setting. The successful candidate will also gain invaluable exposure to and experience across other disciplines and functions within our successful FinTech organisation. A key objective of the internship role is to link the power of data back to relatable solutions, actionable outcomes and business decisions. Core responsibilities of the role are to bring insights and analytics to support and/or predict patterns of consumer behaviour to service our customers better. This includes the scoping, design, development, monitoring and assessment of complex data models and tools. It can also involve identifying opportunities to streamline processes, gathering requirements from stakeholders and reporting back the gap analysis found after the process is complete.
Skills Required:	Experience with data, visualisation, software tools and modelling techniques including knowledge of Data Mining techniques SAS/SQL programming hands-on experience with experience of manipulating complex datasets and a work
Skills Desired:	Linear Regression, Random Forest and SVM. Experience/knowledge of Data Science tools such as R, Python, SQL, Java, Hive and SPARK
Project Open to:	Undergraduates, Part III (master's) students, PhD Students
Deadline to register interest:	May 10th 2017

[Return to List]

Starting price analysis in sports betting

Application of Multi-Stage Stochastic Optimization in Energy and Finance

21st Century Solutions for Urban Environment

Autoregressive Models to Understand Longitudinal Drug Effect

Modern Dimensionality Reduction in Financial Dataseries

Graph Understanding using Graphlets

A New Economic Model for Patents

Real World Magic Squares

Developing natural language processing capacity

Target Discrimination using Short Range Radar

Adaptive Active Learning for Lead Optimisation

Investigate state of the art of computational approaches for alignment and co-registration of multiple imaging modalities

Development of an unsupervised machine learning framework in Python language for multi-parametric data analysis and visualization.

Trend interpretation and prognostics

Quality assessment of image alignment methods for Single-particle electron cryo-microscopy and electron tomography

Improved target discrimination for the UK weather radar network

Real-time Mapping of Turbulence

Application of topological data analysis to drug discovery

Low-frequency mutation detection using circulating tumour DNA

Simulations of Localization

Synthetic Data with the Simulacrum Project

Detection of cancer recurrence using treatment data

Can the volume and type of outpatient appointments attended predict when or whether a patient will be diagnosed with cancer?

The Data Management and Data Linkage Project Proposal

Data Science Business Internship - Product Analytics

Forthcoming Seminars

News, Announcements and Events

Social media

Study at Cambridge

About the University

Research at Cambridge