skip to content

Summer Research Programmes

 

This is a list of CMP industrial project proposals from summer 2020.


Quantum Computing Internship

Project Title Quantum Computing Internship
Contact Name Ophelia Crawford
Contact Email ophelia.crawford@riverlane.io
Company/Lab/Department Riverlane
Address 1st Floor St Andrew's House, St Andrew's Street, Cambridge, CB2 3BZ
Period of the Project 10-12 weeks, summer 2020
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest February 21
Background Information Riverlane is a research-led start-up that builds ground-breaking algorithms and software to unleash the predictive power of quantum computing. Backed by leading venture capital funds and the University of Cambridge, we work together with visionary chemical, pharmaceuticals and materials companies to develop billion-dollar applications using quantum advantage.
Brief Description of the Project

You will:

- Develop an understanding of quantum algorithms and industrial applications of quantum computers

- Research, devise and develop algorithms and software to enhance Riverlane’s capabilities

- Contribute to one or more projects that are core to Riverlane’s scientific goals

- Discuss ideas with colleagues and communicate research in the form of presentations and reports

Requirements:

- A current undergraduate (at least third year) or master’s student

- Experience with at least one general purpose programming language

- Excellent critical thinking and problem-solving ability

- Strong communication skills, both written and verbal

- Ability to take initiative and to work well as part of a team

All candidates must be eligible to live and work in the UK. Please visit our website (https://riverlane.io/careers/) for further details and to apply.

References  
Prerequisite Skills  
Other Skills Used in the Project  
Programming Languages  
Work Environment You will join us at our office in Cambridge where you will have the opportunity to work alongside our team of software developers, mathematicians, quantum information theorists, computational chemists and computational physicists – all experts in their fields. Every intern will have a dedicated supervisor and will work on a project designed to make the best use of their background and skills whilst developing their knowledge of quantum computing.

 

Modelling inflation expectations in financial markets

Project Title Modelling inflation expectations in financial markets
Contact Name Belinda Lao
Contact Email Cambridge.recruitment@symmetryinvestments.com
Company/Lab/Department Symmetry Investments, Quantitative Macro Research
Address 86 Jermyn Street, London, SW1Y 6JD
Period of the Project 8-10 weeks, summer 2020
Project Open to Master's (Part III) students
Initial Deadline to register interest  
Background Information We are looking for an intern to work in the Quantitative Macro Research group at Symmetry Investments, an investment management company with approximately US$5 billion under management. The project focuses on the analysis and modelling of inflation expectations in the global economy in order to identify investment opportunities in financial markets. Inflation expectations is a key variable driving many financial asset prices.
Brief Description of the Project The project would be of an interest to a student considering pursuing a career in investment management. It consists of several steps allowing an intern to get exposure to all aspects of the development of an investment strategy. First, we would start with reviewing recent literature on inflation and inflation expectations modelling both by academia and market practitioners. Second, building on this review, we will construct some simple toy models starting, perhaps, with linear modelling frameworks (regression based models) and then proceeding to more sophisticated approaches, including machine learning algorithms. The last step of the project is the application of algorithms to actual financial datasets. The student will also have an opportunity to build a simple application allowing to visualize and present the results. We will be looking for a presentation of results and conclusions towards the end of the project. The project will be pursued with close cooperation of a portfolio management team. During the internship, the student will have an opportunity to learn about practical aspects of investments and risk taking from portfolio managers.
References  
Prerequisite Skills Statistics, Simulation, Predictive Modelling
Other Skills Used in the Project Data Visualization, App Building
Programming Languages Python
Work Environment The student will work in a team. There will be opportunities to talk about the project across several other teams.

 

Health Data Insight (HDI) Team Internships 2020

Project Title Health Data Insight (HDI) Team Internships 2020
Contact Name Francesco Santaniello
Contact Email francesco.santaniello@phe.gov.uk
Company/Lab/Department Health Data Insight (HDI)
Address 5XE, Capital Park, Fulbourn, Cambridge CB21 5BQ
Period of the Project 2-3 months starting from Monday 29th June 2020
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest Closing date for applications is midnight on Friday 21st February 2020
Background Information Health researchers need access to medical data to answer important questions about diseases like cancer. However, if real data is used, patient confidentiality must be protected. Health Data Insight created the ‘Simulacrum’, a synthetic data set, to allow researchers to work on data that looks and feels a lot like the real thing, but without any potential compromise in patient confidentiality. The project we are planning for interns in 2020 involves creating a synthetic data ‘service’ that allows researchers to request data and be provided with a customised synthetic output using the most appropriate synthetic algorithms for their particular requirements.
Brief Description of the Project We are offering up to six intern places on this group project in 2020. In previous years, our intern programme assigned a single project to each intern. This year, interns will work together on the ‘super-project’ bringing their particular skills and experience to help the team develop this project from specification to final output. We are looking for enthusiastic students who want to be part of an innovative multi-disciplinary team. The team will use and refine their skills in coding, statistical modelling, machine learning, visualisation, and science communication, developing resources and tools using big data to solve real-world problems. This internship is not just about further developing specialist skills. It is also a chance to join a thriving and enthusiastic community of bright individuals and to enhance your communication, collaboration, organisational and team-working skills.
References

Simulacrum: https://healthdatainsight.org.uk/project/the-simulacrum/

Interns from 2019 Internship Program: https://healthdatainsight.org.uk/2019-interns-visualise-build-automate-s...

Prerequisite Skills Statistics, Probability/Markov Chains, Numerical Analysis, Simulation, Predictive Modelling
Other Skills Used in the Project Database Queries, Data Visualization, App Building, Science Communication
Programming Languages Python, MATLAB, R, C++
Work Environment Interns will work together as a team, supported by an Intern Team Lead with input from from developers, project managers, analysts, science communicators and many other professionals. The main place of work will be the HDI Offices in Capital Park, Cambridge. The normal working week is 37.5 hours; we offer flexible working and 2.5 days leave per month. Interns will meet regularly to discuss their progress on the project and the Intern Team Lead will always be available either in person or online for queries and support.

 

Numerical and Analytical Modeling of Stress Dependent Oxidation of Silicon

Project Title Numerical and Analytical Modeling of Stress Dependent Oxidation of Silicon
Contact Name Vasily Suvorov
Contact Email vasily.suvorov@silvaco.com
Company/Lab/Department Silvaco Europe, UK
Address Compass Point, St Ives, Cambridgeshire, PE27 5JL Phone: +44 (0) 1480 484400
Period of the Project 1 July - 1 September, flexible
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest  
Background Information  
Brief Description of the Project Thermal oxidation of silicon is a way to produce a thin layer of oxide on the surface of a wafer in the fabrication of microelectronic structures and devices. The technique forces oxygen to diffuse into the silicon wafer at high temperature and react with it to form a layer of silicon dioxide: Si + O2 -> SiO2. The oxide layers are used for the formation of gate dielectrics and device isolation regions. With decreasing device dimensions, precise control of oxide thickness becomes increasingly important. In 1965 Bruce Deal and Andrew Grove proposed an analytical model that satisfactorily describes the growth of an oxide layer on the plane surface of a silicon wafer[1]. Despite the successes of the model, it does not explain the retarded oxidation rate of non-planar, curved silicon surfaces. The real cause for the observed retardation behaviour is believed to be the effect of viscous stress on the oxidation rate [2-3]. In this project, we aim to obtain the asymptotic behaviour of the oxidation rate of the curved silicon surfaces at short and long oxidation times. The approach that we will use is a combination of asymptotic and numerical analyses of a system of non-linear ordinary and partial differential equations used for this problem in the work of J.D.Evans and J.R.King [4]. Silvaco's own software product may also be used as a tool in this project if required.
References

[1] B.E.Deal, A.S.Grove (1965), General relationship for the thermal oxidation of silicon, Journal of Applied Physics, Vol.36, N12, 3770-3778.

[2] D.B.Kao, J.P.McVittie, W.D.Nix, K.C.Saraswat (1988), Two-dimensional thermal oxidation of Silicon - I. Experiment, IEEE Transactions on Electron Devices, Vol. ED-34, N 5, 1008-1017.

[3] D.B.Kao, J.P.McVittie, W.D.Nix, K.C.Saraswat (1988), Two-dimensional thermal oxidation of Silicon - II. Modeling stress Effects in Wet Oxides, IEEE Transactions on Electron Devices, Vol. ED-35, N 1, 1008-1017.

[4] J.D.Evans and J.R. King (2017), Stress-Dependent Local Oxidation of Silicon, SIAM J. Appl. Math.,77(6), 2012-2039.

Prerequisite Skills Numerical Analysis, PDEs, Mathematical Analysis
Other Skills Used in the Project Mathematical physics, Fluids, Simulation, Predictive Modelling
Programming Languages C++, No Preference
Work Environment The student will be a part of a small team

 

Reinforcement Learning and Option Pricing

Project Title Reinforcement Learning and Option Pricing
Contact Name Nicolas Leprovost
Contact Email nicolas.leprovost@bp.com
Company/Lab/Department BP
Address 20 Canada Square. E145NJ London
Period of the Project 8+ weeks
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest February 21
Background Information When operating a facility such as a natural gas storage or a thermal plant, a trader needs to come up with a strategy on how to maximise her profit. One difficulty lies in the fact that some of the variables (such as future prices) are uncertain. The traditional way to address this problem is to use stochastic control theory to compute the optimal policy in order to maximise profit. This problem can also be addressed in the context of reinforcement learning where the optimal policy is learnt by a machine. The training can be done on synthetically generated data (simulations) or real data.
Brief Description of the Project The aim of the project is to research how to use reinforcement learning to learn optimal policy for options pricing. As a first step, we will confine ourselves to the study of American options where existing literature can be extended (see reference). As a second step, we want to extend this procedure to multi-exercise options such as natural gas storage and power plant valuation.
References Ref: “Q-Learner in the Black-Scholes(-Merton) Worlds” by Igor Halperin https://arxiv.org/abs/1712.04609
Prerequisite Skills Statistics, Probability/Markov Chains
Other Skills Used in the Project Simulation
Programming Languages Python
Work Environment Part of a team, supervised by one senior team member. Office based with regular working hours. Participation to weekly team meetings.

 

Algorithm development and modelling for security applications.

Project Title Algorithm development and modelling for security applications.
Contact Name Sam Pollock
Contact Email

careers@iconal.com

Please include CMP in the subject line

Company/Lab/Department Iconal Technology
Address St Johns Innovation Centre Cowley Road, CB4 0WS
Period of the Project At least 8 weeks, June or earlier start
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest We will start interviewing as soon as we receive applications, but applications up to the end of March will be considered if we haven't already filled the position.
Background Information We are a Cambridge based consultancy carrying out research and development in new and emerging technologies for security, offering independent, impartial, science-based advice. This will be our third year offering CMP placements, and we are looking for keen, innovative, self-motivated individuals who are interested in the practical application of maths to solve real-world problems. You will be working in a small friendly (we like to think) team of scientists and engineers, and contributing directly to the output of current projects.
Brief Description of the Project Right now we do not know exactly what the student project will entail as we work in very rapidly evolving field. This years projects are likely to be focused around one or more of developing algorithms and machine learning solutions to analyse complex sensor data, building event-based simulations of security processes (including data collection and analysis from field observations) or helping with tests and trials of technology. Our work is highly varied and interesting and you will likely get stuck in with all aspects of the job!
References http://www.iconal.com
Prerequisite Skills Statistics, Probability/Markov Chains, Simulation, Data Visualization, Knowledge of queuing theory advantageous.
Other Skills Used in the Project Statistics, Probability/Markov Chains, Mathematical physics, Numerical Analysis, Image processing, Simulation, Predictive Modelling, Database Queries, Data Visualization, App Building
Programming Languages Python, MATLAB, R, C++, Python preferred (as its our main one), but can consider other languages if relevant.
Work Environment We are a small friendly team of 6 people, all working on a range of interesting diverse projects. The student will be based in our main office (or lab for data gathering) working on one or more projects with us, with a mentor on each project to help with queries, reviewing work and assigning tasks.

Mathematical Finance in the Energy Sector

Project Title Mathematical Finance in the Energy Sector
Contact Name Lee Momtahan
Contact Email lee.momtahan@centrica.com
Company/Lab/Department Centrica Energy Marketing and Trading
Address Park House, 116 Park Street, London W1K 6AF
Period of the Project 8 weeks
Project Open to Master's (Part III) students
Initial Deadline to register interest February 21
Background Information This internship would be working as part of the Quantitative Analytics Team within Centrica Energy Marketing and Trading. We use financial mathematics to help traders value, risk manage and optimise complex energy assets.
Brief Description of the Project

There are number of project opportunities, but we won’t know to until nearer the time exactly what you will be working on. It could be one of the following; if not this should give you some flavour of the kind of work on offer.

• LNG Price Process calibration: Recently we developed a model of our LNG (Liquified Natural Gas) portfolio, consisting of source contracts, sink contracts and LNG ships. Our model optimises the dispatch of ships to maximise the value of the portfolio. We also simulate the stochastic evolution of LNG and other commodities relevant to the portfolio valuation, and our model considers additional value that may be produced by reoptimizing the portfolio (changing dispatch decisions) as prices evolve. This project would look at ways to improve the calibration of the price process model and/or changes to model itself to make it more accurate.

• LNG Optimisation Performance: In relation to the LNG model described above, this project would look at the numerical techniques we are using to find the optimal dispatch solution, and consider ways to improve performance. We are using a combination of Mixed Integer Linear Programming and heuristics.

• Back-testing: Back-testing consists of simulating how a trader would have performed in the past if they had had they followed the hedging advice given by our model to manage the risk unfavourable price moves which could undermine the value of their complex assets. This gives a way to test the accuracy of our models and how well they have been calibrated as well as testing different trading strategies. This project would consist of making improvements to our back-testing framework, using the framework to run numerical experiments, and using the output of the back-testing to feed in to improvements to our models and the associated model calibration approaches.

References  
Prerequisite Skills Statistics, Probability/Markov Chains, Numerical Analysis
Other Skills Used in the Project  
Programming Languages Python
Work Environment The student will spend most of their time working with the Quant team based in Mayfair, they may be able to spend some time working remotely.

 

Imaging data design and analysis for the pharmaceutical industry

Project Title Imaging data design and analysis for the pharmaceutical industry
Contact Name Piotr Orlowski
Contact Email piotr.x.orlowski@gsk.com
Company/Lab/Department GSK
Address Gunnels Wood Road, Stevenage SG1 2NY
Period of the Project 12 weeks
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest February 21
Background Information Imaging is at the heart of the digital transformation at GSK. Adaptation of recent advances and research in Artificial intelligence and Machine learning (AI & ML) applied to computer vision is expected to be a corner stone in increasing our drug discovery rate (see reference material). To this end, the aim of our group is to provide access to narrow, deep and accurate imaging data at scale to R&D stakeholders.
Brief Description of the Project

Our work is divided into 4 programmes:

1. Streamlining data and tools provisioning for AI/ML or AutoML analysis

2. Data and tools quality control

3. Imaging pipeline optimization

4. Multi-data type and multi-resolution data integration

We plan to gather a diverse group of up to 6 students to support this portfolio and adapt individual projects to your talents and ambitions for tangible impact (specific examples will be discussed during presentation)

References Panel discussion on AI in pharma R&D at BIO 2019: https://www.youtube.com/watch?v=B5s5zJPlYiY&feature=emb_title
Prerequisite Skills Image processing, Database Queries
Other Skills Used in the Project  
Programming Languages Python
Work Environment Students will be part of a multidisciplinary team involving pathologists, data design experts and AI-ML engineers. We intend to recruit up to 6 summer placement students through CMP and other GSK programmes to create a stimulating and supportive group.

 

Shape of word embedding

Project Title Shape of word embedding
Contact Name Danijela Horak, Reza Khorshidi
Contact Email danijela.horak@aig.com
Company/Lab/Department AIG, Investments AI
Address danijela.horak@aig.com, reza.khorshidi@aig.com
Period of the Project beginning of July to end of September
Project Open to Master's (Part III) students
Initial Deadline to register interest February 21st
Background Information An important trend in the field in Natural Language Processing is the quest for Universal Embeddings: embeddings that are pre-trained on a large corpus and can be plugged in a variety of downstream task models (sentimental analysis, classification, translation…) to automatically improve their performance. However, it is not always clear how to evaluate and compare these embeddings. Topological data analysis is a recent field of machine learning and utilises topological invariants to characterise shape of point cloud data sets. Recently, there have been developed a number of stable distances among these topological invariants that are reflective of the shape of the data.
Brief Description of the Project Use TDA to determine the ideal dimension of word embeddings. The idea is to measure the topological signature of the point cloud data, where points represent word vectors, and assessing whether there is any significant difference in topological signatures of these shapes, when we vary the dimension of the word embedding. Can this result help us determine the ideal dimension of word embedding? Can this be shown experimentally in some downstream task?
References

Mikolov,T. et al. Distributed Representations of Words and Phrases and their Compositionality, NIPS 2013

Khrulkov, V. and Oseledets, I. V. Geometry score: A methodfor comparing generative adversarial networks. InICML,2018

Prerequisite Skills Geometry/Topology, Predictive Modelling
Other Skills Used in the Project Statistics
Programming Languages Python
Work Environment Student will be working with the supervision of the contact persons. The Investments AI department consists of ca. 25 Machine Learning Scientist and Engineers and each one of them is a potential discussion partner. The usual working hours are from 9-17. Remote work is possible.

 

Zero-/One-/Few-shot Learning for Financial Knowledge Graphs

Project Title Zero-/One-/Few-shot Learning for Financial Knowledge Graphs
Contact Name Jinhua Du, Karo Moilanen
Contact Email Danijela.Horak@aig.com
Company/Lab/Department Investments AI, AIG
Address jinhua.du@aig.com, Karo.Moilanen@aig.com
Period of the Project 8-10 weeks
Project Open to Master's (Part III) students
Initial Deadline to register interest February 21
Background Information Artificial intelligence (AI) has been widely applied to many industry domains such as Finance and investments. The new trend for AI is the change from data driven to knowledge driven, such as Google's knowledge-based search system. Knowledge graph is a graphic knowledge base which stores and provide knowledge to AI systems and now playing a more and more important role in AI applications, such as anomaly detection, risk control, smart customer service in Finance and Investments domains. This project will investigate state-of-the-art (SOTA) AI models to extract information from texts for knowledge graph under the condition of low resources, such as no or few training examples. This project fits the real world problems very well and has a very good perspective for industry application.
Brief Description of the Project

Knowledge graphs are playing an increasingly important role in fundamental natural language processing (NLP) tasks and applications such as question answering, dialogue, and reasoning. The majority of state-of-the-art (SOTA) methods for expanding knowledge graphs via new, richer relations rely on large-scale training data sets. In contrast with the vanilla domains and knowledge bases that have been popular in academic research to date, SOTA methods require a substantial development effort in a less explored domain such as asset management which is replete with highly fine-grained extralinguistic relations.

A number of zero-, one-, and few-shot learning methods have been proposed to tackle challenging learning scenarios which involve only few or no labelled training examples, reflecting either resource scarcity or infrequent real-world events. This internship project investigates the limits of state-of-the-art zero-/one-/few-shot learning methods for new relation prediction in the context of knowledge graphs in asset management. Through this project, the intern will learn the knowledge about SOTA deep learning models and know how to build and apply these models to NLP tasks, such as entity relation extraction. Supervised by a senior NLP scientist, we also hope there will be research papers coming out from this project, and the research outcome can be applied to AIG's products.

References

1. One-Shot Relational Learning for Knowledge Graphs: https://www.aclweb.org/anthology/D18-1223/

2. Few-Shot Knowledge Graph Completion: https://arxiv.org/pdf/1911.11298.pdf

3. Rethinking Knowledge Graph Propagation for Zero-Shot Learning: https://arxiv.org/pdf/1805.11724.pdf

Prerequisite Skills Statistics, Probability/Markov Chains, Numerical Analysis, Mathematical Analysis, Algebra/Number Theory, Predictive Modelling
Other Skills Used in the Project  
Programming Languages Python
Work Environment The student will work in a team. We have an NLP scientist team, machine learning scientist team and engineer team, so the student can communicate with any member of these teams about the project. The working time is flexible and normally they work in office with us, but the student can work remotely, such as home.

 

Visualising and analysing 3D image stacks

Project Title Visualising and analysing 3D image stacks
Contact Name Sara Schmidt
Contact Email sara.x.schmidt@gsk.com
Company/Lab/Department GSK
Address Gunnels Wood Road, Stevenage, SG1 2NY, United Kingdom
Period of the Project 10 weeks between late June and 30 September
Project Open to Master's (Part III) students
Initial Deadline to register interest February 21
Background Information The lengthy and costly drug development process and associated high failure rates could be improved with the use of more relevant cellular models in the early stages of development. Complex 3D or 4D models such as organoids or spheroids can provide better insight into drug effect and mechanisms, and although we have the technology to acquire these images in a high throughput environment, data analysis is a bottleneck. The analysis of 3D or 4D image stacks is very challenging, and currently available software is unable to visualise and analyse the data at the required speed and quality necessary for large-scale experiments.
Brief Description of the Project We propose a summer internship project for a student that is interested in image analysis and software development. The project will initially be to research and try out existing methods of 3D image stack analysis using open source packages (most likely python and ImageJ). Upon choice of a suitable algorithm (and any edits implemented) the student should design (and if successful, begin to build) a platform for an advanced image analytics platform that enables both the visualisation and analysis of these image sets at scale.
References  
Prerequisite Skills Image processing, Data Visualization
Other Skills Used in the Project Image processing, Data Visualization, App Building
Programming Languages Python, ImageJ/JavaScript
Work Environment Part of a team interacting with data scientists and imaging experts. Wider community of other CMP students

 

Verification of stress model vs known analytical solution

Project Title Verification of stress model vs known analytical solution
Contact Name Artem Babayan
Contact Email artem.babayan@silvaco.com
Company/Lab/Department Silvaco Europe
Address Artem Babayan, Silvaco Europe, PE27 5JL St Ives, Cambridgeshire
Period of the Project 8 weeks, summer/early autumn
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest February 21
Background Information  
Brief Description of the Project

Silvaco develops Electronic Design Automation (EDA) and Technology CAD (TCAD) software. One of the modules is the tool for simulating stresses inside the electronic devices (caused by bending, heating, internal stresses etc. during device manufacture). Your task would be:

-- research the standard stress problems for which analytical solution is available.

-- set up the problem within Silvaco Stress Simulator

-- compare the known analytical solution with the results obtained with simulator.

-- optionally, you may also set up a simple problem which can be solved numerically using alternative tools (e.g. Matlab) and compare these results with Silvaco. Depending on results of the project it may result in academic paper or conference publication.

References  
Prerequisite Skills Mathematical physics, Numerical Analysis, PDEs
Other Skills Used in the Project Simulation
Programming Languages Python, MATLAB, C++
Work Environment The student will be placed in the Silvaco building in St Ives. There are ~15 people in the office. Student is supposed to work on his/her own with advice available from the team. Also communication with office in US may be required. Normal office hours (from 9 to 5:30) are expected with reasonable flexibility.

 

Installation, Development and Long-Term Stability in Dialectical Behaviour Therapy (DBT) : The Productive Team

Project Title Installation, Development and Long-Term Stability in Dialectical Behaviour Therapy (DBT) : The Productive Team
Contact Name Richard Hibbs
Contact Email richard.hibbs@extra-ibs.com
Company/Lab/Department Integral Business Support Ltd trading as British Isles DBT Training
Address Croesnewydd Hall, Wrexham Technology Park, WREXHAM LL13 7YP
Period of the Project 8 weeks
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest 21 February
Background Information Service users with a diagnosis of borderline personality disorder (BPD) are high utilizers of services and frequently have poor outcomes in standard psychological and psychiatric treatment. Dialectical Behaviour Therapy (DBT) is still a relatively new treatment (first tested in 1991) that has demonstrated significant clinical benefits for highly suicidal service users with a BPD diagnosis in research studies. Despite a growing awareness of the potential benefits of DBT, large parts of the NHS do not provide this treatment and where services do provide the treatment, often the capacity of programmes does not meet local demand. DBT requires teams of therapists to work together in new ways with service users and requires extensive training to implement. NHS commissioners who have already invested in DBT therefore need to know if their programme is maximally productive and how to intervene to improve productivity if it is not.
Brief Description of the Project

This is an observational study on gains in health-related quality of life (QALYs) using Data Envelopment Analysis (DEA). A pseudonymised data repository at www.dbt.uk.net holds Euroqol EQ-5D scores for service users and team information on staff numbers and clinical time devoted to each of 5 treatment modalities. In a pilot study (n=48) using the EQ-5D scale to measure health status on admission and discharge (Swales et al. 2016) found that admissions to 7 DBT programmes in a mix of settings for which both scores were available showed an average gain of 0.28 on the EQ-5D, equivalent to an effect size of 0.68 (p<0.01) which exactly agrees with formal meta-analyses (Kliem et al., Ost et al.).

Define the 'health acceleration' of a client as the average rate at which (s)he is gaining QALYs at time t. Then team productivity can be defined as the combined health acceleration of all clients in treatment at time t. This measure varies with a) the number of clients in treatment, b) the treatment effect size, and c) length of stay (LOS). Productivity is an industrial concept and our current success criterion is therefore to validate 'health acceleration' as a generalisable clinical productivity measure in Personality Disorder settings. There are no sampling thresholds for DEA other than that the number of teams under scrutiny ('decision-making units' in DEA terms) should be no less than 5 times the number of resource input and production output variables combined.

The definition of what constitutes an input and an output is controversial in many public sector applications of DEA, and fundamental disagreements over whether a variable is an input or output are not uncommon. We propose to identify inputs and outputs according to the following formula:

8 Inputs: number of clients treated, WTE staffing in 5 DBT modalities (DBT consult, skills group, individual, telephone consultation, structuring environment), cumulative hours of trademark 10-day intensive DBT training undertaken by team since inception, cumulative hours external DBT supervision received by team

2 Outputs: QALYs gained by clients not previously admitted, QALYs gained by readmitted clients

Amongst these inputs/outputs, perhaps the most controversial is selecting number of clients treated as an input rather than an output, and introducing a distinction between admissions and readmissions when calculating QALY outcomes. The student may be required to simulate/bootstrap data to fully explore the DEA model.

References https://core.ac.uk/reader/81860218
Prerequisite Skills Statistics, Simulation, Database Queries, Linear programs (maximise Y subject to constraints C)
Other Skills Used in the Project Predictive Modelling, Database Queries, Data Visualization, Data Envelopment Analysis (DEA)
Programming Languages No Preference
Work Environment They will work normal office hours at Croesnewydd Hall with non-mathematicians and remotely at Bangor University as required where additional consultation with health economists and/or statisticians is available.

 

Towards an operational quality metric for volcanic ash detection

Project Title Towards an operational quality metric for volcanic ash detection
Contact Name Dr. Joelle Buxmann, Dr. Debbie O'Sullivan and Martin Osborne
Contact Email joelle.c.buxmann@metoffice.gov.uk
Company/Lab/Department Observations Research & Development, the Met Office
Address Met Office, Fitzroy Rd, Exeter EX1 3PB, Exeter
Period of the Project Suggested (8 weeks between late June and 30 September) period suits us, could consider alternative period
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest 21 February
Background Information The Met Office is operating a unique ground based operational network of ten of depolarization Raman LIDAR (LIght Detection And Ranging)’s, as well as over 40 the LCBR’s (Laser Cloud Base Recorder or Ceilometer) in order to get atmospheric profiles of aerosols (wet and dry particles) and clouds. All Raman lidars are accompanied by a sun photometer to provide additional information such as column integrated aerosol size distributions. The LIDAR’s are research grade profiling instruments set in an operational network and work similar to a RADAR (RAdio Detection And Ranging) but with light instead of radio waves. They provide the unique capability of both qualitative and quantitative monitoring of VA (volcanic ash) over the UK for the London VAAC (Volcanic Ash Advisory Centre) within the Met Office. It is the only fully operational network run by a Met service able to identify aerosol types, including VA, dust, biomass burning and pollution and furthermore providing aerosol mass concentrations. The LCBR’s are run continously and provide mainly cloud base height and cloud coverage. Raman LIDAR’s as well as LCBR’s benefit from an internal visualisation tool, that provides colourful time height profiles mainly to support the Met Office forecasters. LCBR’s and sun photometer data is also visualised through external network websites (links given below).
Brief Description of the Project

The data of the network is used on a daily basis by our forecasters, as well as duirng special events, such as dust storms or volcanic eruptions. Therefore it needs to be kept in an excellent operational status, which is ensured by a network management team as well as scientists working alongside engineers. However, there is currently no quality metric for those instruments.

The project is to develop an algorithm to include a quality metric that works in real - time on time series of instrument output to isolate and track instrumental bias errors. The time series data are subject to several sources of uncertainty which need to be characterised and filters applied to optimise the algorithm performance. As a metric parameter the lidar constant L_c could be used. The L_c depends only on the state of the machine – specifically the laser power, its polarisation state and several factors relating to the efficiency of the optics and detectors. While it is called a “constant”, it will change slightly over the course of time and vary between each instrument and the three different channels of the Raman lidar.

Another necessary aspect is the development of a cloud identification and filter method. The L_c constant can only be calculated in cloud free atmosphere. The student can use image processing for the identification of the clouds, as well as possibly comparing to data from the LCBR’s and sun photometers (which have a cloud filter already). However, there is also an innovative aspect to this type of work and the student can bring in their own ideas and concepts and e.g. look at special events.

A useful tool to develop and successful outcome would be online monitoring of cloud free data and/or lidar constant; flagging up outliers and bad data. The online monitoring could be directly used by the operational team to flag any problems with the instruments. Statistical analysis of those metrics using data recorded by the lidars since 2016 and a report would be appreciated.

References

Adam, M., Buxmann, J., Freeman, N., Horseman, A., Slamon, C., Sugier, J., and Bennett, R.: The UK Lidar-Sunphotometer Operational Volcanic Ash Monitoring Network, in: Proceedings of the 28th International Laser Radar Conference, 2017.

Osborne, M., Malavelle, F. F., Adam, M., Buxmann, J., Sugier, J., Marenco, F., and Haywood, J.: Saharan dust and biomass burning aerosols during ex-hurricane Ophelia: observations from the new UK lidar and sun-photometer network, Atmos. Chem. Phys., 19, 3557–3578, https://doi.org/10.5194/acp-19-3557-2019, 2019.

Andreas Behrendt and Takuji Nakamura, "Calculation of the calibration constant of polarization lidar and its dependency on atmospheric temperature," Opt. Express 10, 805-817 (2002)

Global sun photometer network (including Met Office instruments): https://aeronet.gsfc.nasa.gov/

European profiling network (including Met Office LCBR’s): https://e-profile.eu/#/cm_profile

European lidar network: https://www.earlinet.org/index.php?id=earlinet_homepage

Prerequisite Skills Mathematical physics, Programming experience with Python
Other Skills Used in the Project Statistics, Image processing, Data Visualization, Lidar theory and data processing
Programming Languages Python
Work Environment The student would be part of the Observations R&D team based in Exter and supervised by Joelle (Senior Scientist), Debbie (team leader) and Martin (Scientist and PhD student). This project can be done remotely from the students home/university with support via tele/ web conferencing and email. There is also the option for some or all the project to be carried out at the Met Office. Relevant connections to academic researchers could be made, as well as different areas within the Met Office.

 

Automating particulate image analysis

Project Title Automating particulate image analysis
Contact Name Ricky Casey
Contact Email ricky.x.casey@gsk.com
Company/Lab/Department GSK
Address Gunnels Wood Road, Stevenage, SG1 2NY
Period of the Project 8 weeks
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest February 21
Background Information Sub-visible particulate analysis is an important technique within biopharmaceutical manufacturing as it helps determine compatibility of drug molecules with the solutions they will be exposed to during their lifetime in both the lab and the body, and is a requirement from regulatory authorities. This is done using Micro Flow Imaging (MFI) using bright field technology to capture images of the various types of particulates present in samples (proteinaceous, oil, air, glass etc), and also their size. What this doesn’t do is differentiate the types of particles in an automated manner, meaning the analysis >1000s of the particles is time consuming and subjective between scientists.
Brief Description of the Project The primary aim of this project will be to develop an algorithm that can be applied to all of the images captured that will automatically categorise the particles of interest by type and size, increasing analysis efficiency, and increase the quality of the data ensure only relevant proteinaceous particle data forms part of the sample ‘result’. A stretch goal would be to combine this method with other molecule characterisation data sets to assess correlations between the different processes employed within biopharma manufacture and particulate formation.
References https://link.springer.com/article/10.1007/s11095-011-0590-7
Prerequisite Skills Statistics, Numerical Analysis, Image processing, Predictive Modelling, Database Queries, Data Visualization, App Building
Other Skills Used in the Project  
Programming Languages No Preference
Work Environment The student will be working independently on this project, but will in a team of 11 other multi-disciplinary analytical research scientists, as well as having access to GSK data scientists external to the immediate team for guidance. Primarily expected to be based in Stevenage, working 37.5 hr/week, with the possibility of working remotely if appropriate.

 

Energy market interpretation

Project Title Energy market interpretation
Contact Name Lois Sims
Contact Email lois.sims@centrica.com
Company/Lab/Department Centrica Energy Marketing & Trading
Address 116 Park Street, London, W1K6AF
Period of the Project 8 weeks
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest Feb21
Background Information Over the past 3 decades Europeaan power and gas markets have been evolving to reach todays point where they are some of the most liberalised and transparent globally. In common with other industries, this process has led to an ever increasing volume of data becoming publicly available from which to base hedging & trading decisions. The project will be focused on bringing cutting edge statistical/mathematical techniques to bear on these data sets to derive new insight into potential future outturn of energy prices.
Brief Description of the Project

There are several areas of interest and the exact project can be narrowed in discussion with the eventual student, subject to desk need and their interests. For example - as physical energy markets need to balance supply/demand in real time the prediction of potential flows between market areas is key to understanding price formation.

The project will be working with large datasets held by Centrica Energy marketing and trading for the purposes of deriving insight into future energy prices. The expectation is that the eventual student will utilise these data sets with time series analysis to do probabilistic forecasting of gas/power flows. The eventual model should be semi-autonomous, requiring programmatic back-testing to ensure performance - with will be the main basis of evaluation. There is also the requirement for model output to be interpreted by non technical users, so visualisation techniques may also be needed.

References  
Prerequisite Skills Statistics, Probability/Markov Chains, Simulation, Predictive Modelling, Database Queries, Data Visualization
Other Skills Used in the Project Data Visualization, App Building
Programming Languages Python
Work Environment Normal office hours. Project will be able to interact with the trading analysis teams and internal developers but will be expected to work autonomously in the main.

 

Low-rank matrix approximations within Kernel Methods

Project Title low-rank matrix approximations within Kernel Methods
Contact Name Sammie Lai
Contact Email info@dreams-ai.com
Company/Lab/Department Dreams-ai
Address 30 Meade House, 2 Mill Park Rd, Cambridge, CB1 2FG, United Kingdom
Period of the Project 8 weeks flexible
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest 10th Feb
Background Information In machine learning, we often employ kernel methods to learn about more general relation in datasets instead of explicit projection to avoid high computational cost.
Brief Description of the Project Often kernel trick involves computation of matrix inversion or eigenvalue decomposition and the cost becomes cubic in the number of training data cause. Due to large storage and computational costs, this is impractical in large-scale learning problems. One of the approaches to deal with this problem is low-rank matrix approximations. The most popular examples of them are Nyström method and the random features. We would like student to test out the feasibility of these approximations on real data.
References  
Prerequisite Skills Statistics, Numerical Analysis, Mathematical Analysis, Geometry/Topology, Predictive Modelling
Other Skills Used in the Project Data Visualization
Programming Languages Python, C++
Work Environment Project supervisor will provide 5 hours out of the 30 hours working time at the office in Cambridge. Good student will also be offered free trip to Hong Kong to take on more maths projects.

 

Perturbation methods for assessing the robustness of machine learning algorithms

Project Title Perturbation methods for assessing the robustness of machine learning algorithms
Contact Name Andrew Thompson
Contact Email andrew.thompson@npl.co.uk
Company/Lab/Department National Physical Laboratory
Address Maxwell Centre, Cavendish Laboratory, J J Thomson Avenue, Cambridge, CB3 0HE
Period of the Project 8-10 weeks (flexible)
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest February 21
Background Information Machine learning is an area of extensive interest at the intersection of mathematics, computer science, statistics and engineering. The advent of ML has resulted in many impressive achievements in predictive modelling, e.g. artificial intelligence, medical diagnosis and financial modelling. One of the main concerns in this area is robustness assessment, i.e. what is the probability that the machine will make the wrong decision.
Brief Description of the Project The goal of the proposed project is to study an efficient approach to the robustness assessment problem via perturbation. Perturbation of statistical estimators is a very efficient approach to building confidence intervals for standard statistical under-parametrised estimation, but it has not been tested yet in the machine learning context, where overparametrisation is the rule. Perturbation can help circumventing the curse of dimensionality inherent to Bayesian sampling, whilst providing efficient analysis of the limitations of ML for specific industrial applications. The project will involve programming, either in Matlab or Python, and possibly some theory.
References Jessica Minnier, Lu Tian & Tianxi Cai. A Perturbation Method for Inference on Regularized Regression Estimates, Journal of the American Statistical Association 106:496, pp1371-1382 (2011).
Prerequisite Skills Programming
Other Skills Used in the Project Statistics, Numerical Analysis, Mathematical Analysis, Machine learning
Programming Languages Python, MATLAB
Work Environment The student will be based in the Maxwell Centre, Cambridge, and will have occasional meetings with Andrew Thompson based in Cambridge.

 

Data Re-Use in Clinical Trials: the development of a Shiny application

Project Title Data Re-Use in Clinical Trials: the development of a Shiny application
Contact Name Dr Doug Thompson
Contact Email douglas.x.thompson@gsk.com
Company/Lab/Department GSK, Biostatistics
Address GSK, Gunnels Wood Road, Stevenage SG1 2NY
Period of the Project 8-10 weeks (flexible)
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest Feb 21st
Background Information Efficient clinical trials are beneficial for patients. Shorter studies with fewer patients that have the same risk in decision making as larger studies, would open the door for making new treatments available to patients in need in more efficient manner. Data re-use from e.g. historical controls is one way of reducing patient numbers in new studies and we want to develop a Shiny App that helps Biostatisticians at GSK to implement this approach.
Brief Description of the Project

Quantitative Decision Making (QDM) is a fundamental concept at GSK that enables coherent decisions to be made throughout the lifecycle of a drug in clinical development: whether this is to give the green light for accelerated development, or to terminate an asset. For an appropriate QDM rule, Biostatisticians are challenged to characterise upfront the quantitative properties and risks of a given study design. The most optimal study design would be the one that, for a specified rule, yields the most favourable characteristics such that a clear decision is made from the accrued patient data.

Study designs need not be restricted to using concurrently generated data only; in fact, for rare diseases it may be considered unethical to ignore external data sources in analysis and decision making. Recent interest in the topic of Dynamic Borrowing using Bayesian inference has made data re-use a key area of interest for the design of studies at GSK. Appropriate weighting of evidence (concurrent versus historical) can enable more efficient trial designs that are less onerous on patient numbers, which could promote new drugs being made available to patients sooner rather than later.

Through this project, we would like to develop Shiny applications that open up a user-friendly implementation of QDM and dynamic borrowing during the study design stage. This is a broad, open-ended project in which the student would be required to work closely with colleagues from two different groups, including: (i) the Statistical Data Sciences (SDS) team which has as the remit of increasing the data science capability within Biostatistics at GSK; and (ii) the Advanced Biostatistics and Data Analytics (ABDA) team who have purvey of innovative statistics used at GSK.

References  
Prerequisite Skills Statistics, Data Visualization, App Building
Other Skills Used in the Project R package development
Programming Languages R
Work Environment Expected to work onsite in the GSK Stevenage offices. There will be close collaboration between two Biostatistics groups with a mentor supporting the student from each group (SDS & ABDA).

 

Normalization of NGS data in a Data Lake

Project Title Normalization of NGS data in a Data Lake
Contact Name Giovanni Dall'Olio
Contact Email giovanni.m.dallolio@gsk.com
Company/Lab/Department Data Computational Sciences, GSK
Address Gunnells Wood Road, SG1 2NY Stevenage (UK)
Period of the Project 8 weeks, flexible
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest March 1st
Background Information In recent years GSK has invested heavily in processes to improve data quality and leveraged Big Data technology to make internal data more easily accessed and reused. One of the outcomes of this investment is a project called NGS End-2-End pipeline. The objective of this project is standardising the way NGS (Next Generation Sequencing) data is generated across the company, standardising the procedure for requesting the sequencing of a sample, capturing metadata along the whole process, and processing the data using the same pipeline. At the end of the process the data is stored in a Hadoop table (Big Data Technology) where it is accessed by end users. This resources provides an opportunity for new exploratory works on how data generated by different groups in the company can be compared.
Brief Description of the Project The student will explore normalization methods to compare different Next Generation Sequencing RNASeq datasets generated by the NGS End-2-End pipeline, an initiative targeted at standardizing the NGS data generation pipeline at GSK. They will evaluate which factors need to be taken into account when comparing datasets produced by different machines and platforms, and across different projects and pipelines, and evaluate how to identify batch-effects. This project is relatively open-ended and the student will have space to explore different solutions, as well as working with a curated dataset. Knowledge of NGS is not required although some preliminary understanding may be useful. Preferred programming languages would be R and Python.
References

- Leek et al, Nat Rev Gen 2010. Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data 

-Evans et al, Brief Bioinf 2018. Selecting Between-Sample RNA-Seq Normalization Methods From the Perspective of Their Assumptions

- RPKM, FPM and TPM clearly explained https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

Prerequisite Skills Statistics, Data Visualization, Next Generation Sequencing analysis, transcriptomics, R/Python
Other Skills Used in the Project Statistics
Programming Languages Python, R
Work Environment There it will be one person of reference, plus interactions with other people via Skype (some collaborators are based in the US). Another GSK group has submitted a project proposal that is related in scope, so there is the possibility to work with another student from the CMP programme, as the two projects are complementary.

 

Financial Neural ODE

Project Title Financial Neural ODE
Contact Name Danijela Horak - Kurt Cutajar - Zahra Sabetsarvestani
Contact Email Danijela.Horak@aig.com
Company/Lab/Department AIG - Investment AI department
Address 58 Fenchurch Street, London EC3M 4AB
Period of the Project 8-10 weeks, between beginning of July and end of September
Project Open to Master's (Part III) students
Initial Deadline to register interest  
Background Information  
Brief Description of the Project The goal of this internship is to extend the effectiveness of state-of-the-art neural Ordinary Differential Equations (neural ODEs) with application to financial time series data. One of the key challenges encountered in finance is that the time intervals between observed events are often irregular, and may vary significantly even among time series which should otherwise be considered jointly. Illiquid assets are a prime example of where such an issue arises, whereby trading data can only be obtained at uneven intervals. Whereas RNNs rely on having a discretised sequence of observations, the continuous dynamics of the neural ODEs proposed in 2018 mitigate this issue and can reliably model data captured at non-uniform intervals. More recent extensions, such as augmenting the space in which neural ODEs are solved, have yielded further performance improvements over the original work, and we intend to leverage these developments in our modelling practices.
References  
Prerequisite Skills Statistics, Simulation
Other Skills Used in the Project Mathematical physics, Predictive Modelling, Database Queries
Programming Languages Python
Work Environment The student will be based at AIG and will be working in Investment AI team. However, the student has the flexibility to work remotely if needed.

 

Filtering for Trend Following Strategies

Project Title Filtering for Trend Following Strategies
Contact Name Dr Silvia Stanescu
Contact Email silvia.stanescu@gam.com
Company/Lab/Department Cantab Capital Partners
Address City House, 126-130 Hills Road, CB2 1RE, Cambridge
Period of the Project 10 weeks
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest 21st February 2020
Background Information Price momentum is a well-studied market phenomenon that has been identified in a wide range of asset classes which forms the basis of one of the best known systematic investment strategies. Since no market participant has access to perfect information, trends can form as new information is gradually incorporated into the market price. In addition, behavioral biases such as over-reaction to the action of other market participants can strengthen and extend trends once they have formed. Common methods of extracting trends from financial markets include employing moving averages (MAs) or moving average convergence divergence (MACD) methods, i.e. differences of moving averages over different horizons. These however suffer from a number of drawbacks, including that the choice of windows to be used is often ad-hoc/based on heuristic measures. This project proposes defining trend more formally and employing filtering techniques in order to extract the underlying trend in a financial time series.
Brief Description of the Project

We can think of decomposing a financial time series y into a global (trend) component x and a local (noise/cycle) component e. We can further assume that y is observable, while x and e are unobservable. The project focuses on methods for extracting (filtering) the trend component x. Broadly, we can think of these methods as functions of two terms: 1) a measure of 'distance' between the observable series y and the trend component x and 2) a measure of smoothness, or rather a penalty term which accounts for the smoothness (or lack thereof) of the trend component. The relative importance of these two terms is generally dictated by a regularization parameter. Different filtering methods can be obtained by varying either the measure of 'distance' between the observable series and the trend component, or the penalty term, or both.

The aim of this project is the estimation of the trend component under a number of alternative methodologies in this class. Furthermore, part of this estimation is also the calibration of the regularization parameter which governs the trade-off between fit and smoothness. The speed of the trend we are identifying will significantly depend on the value of this regularization parameter: for certain values, we will identify long-term trends, while for others shorter-term trends. In practice however, the speed at which series are trending is likely to change over time. Also, one may wish to investigate whether to set the parameter value globally or make it asset class, or even asset, specific.

To sum up, the aim of the project is:

- to investigate different methods for calibrating the regularization parameter in the context of the two filtering methodologies described, for example, in [2] and [3];

- to further experiment with other methods of filtering (e.g. Kalman filtering, wavelet analysis) for building trend following strategies;

- for all the filtering approaches considered, to investigate the benefits from having a time-varying regularization parameter versus a constant one; or having a cross-sectionally varying parameter versus one that is set at the same value for all assets we trade; in this context, one may wish to consider how they would avoid over-fitting to the in-sample data.

References

[1] Bruder, B., Dao, T. L., Richard, J. C., & Roncalli, T. (2011). Trend filtering methods for momentum strategies. Available at SSRN 2289097.

[2] Harris, R. D., & Yilmaz, F. (2009). A momentum trading strategy based on the low frequency component of the exchange rate. Journal of Banking & Finance, 33(9), 1575-1585.

[3] Kim, S. J., Koh, K., Boyd, S., & Gorinevsky, D. (2009). \ell_1 trend filtering. SIAM review, 51(2), 339-360.

Prerequisite Skills Statistics, Numerical Analysis, Optimisation, Linear Algebra
Other Skills Used in the Project Probability/Markov Chains, Predictive Modelling
Programming Languages Python
Work Environment The student is welcome to join us in the City House offices on Hills Road in Cambridge on a daily basis. The entire Tech team here (composed of quantitative analysts like myself and core programmers) is extremely collaborative and will be happy to answer any questions the student may have in my absence. Ideally, working from the City House office between 9am and 5-6pm would maximize the time and feedback the student can get from me. However, we can also arrange for them to work remotely at times should they prefer to do so.

 

Modelling optionality in inflation linked securities

Project Title Modelling optionality in inflation linked securities
Contact Name Richard Manthorpe
Contact Email cambridge.recruitment@symmetryinvestments.com
Company/Lab/Department Symmetry Investments, Quantitative Analytics
Address 86 Jermyn Street, London, SW1Y 6JD
Period of the Project 8-12 weeks, summer 2020
Project Open to Master's (Part III) students
Initial Deadline to register interest  
Background Information We are looking for an intern to work in the Quantitative Analytics group at Symmetry Investments, an investment management company with approximately US$5 billion under management. This project focuses on modelling the embedded optionality in certain inflation linked securities, such as BTP Italias, an Italian inflation linked security containing a high watermark indexation feature.
Brief Description of the Project The project would be of an interest to a student considering pursuing a career in investment management. It consists of several steps allowing an intern to get exposure to all aspects of the development of an investment strategy. First the candidate would be introduced to the mathematics that govern bond and option pricing for both nominal and inflation linked securities, reviewing the relevant literature. Secondly the candidate will work closely with both the trading and quant teams to develop a model and the necessary analytics to evaluate these securities and gain an understanding as to how the traders assess them. An optional third stage of the project is to extend the analytics to handle other securities, such as UK LPI derivatives. We will be looking for a presentation of results and conclusions towards the end of the project. The project will be pursued with close cooperation of a portfolio management team. During the internship, the student will have an opportunity to learn about practical aspects of investments and risk taking from portfolio managers.
References  
Prerequisite Skills  
Other Skills Used in the Project Statistics, Probability/Markov Chains, PDE's, Data Visualization, App Building
Programming Languages No Preference
Work Environment The student will work in a team. There will be opportunities to talk about the project across several other teams.

Prize pool and odds forecast

Project Title Prize pool and odds forecast
Contact Name Sammi Lai
Contact Email info@dreams-ai.com
Company/Lab/Department Dreams-ai
Address info@dreams-ai.com
Period of the Project Late June til late August
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest February 21
Background Information  
Brief Description of the Project In a prize pool based betting game, the final returning odds of a bet is simply a function of the total amount of bet placed by everybody divided by the total amount of bets that guessed correctly. Therefore every time someone placed a bet, the odds for every bet type change for everybody. Only after the deadline for bet placing can the odds be finalized. In theory, if you know all of the prize pool's size you can determine all the odds exactly, and vice versa. The challenge here is to consider the cases when we only know a subset of the odds/prize pool's size; how much uncertainty would be introduced and can we leverage the relationships between bet types to improve our predictions.
References

https://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-k...

https://people.eecs.berkeley.edu/~brecht/paper/07.rah.rec.nips.pdf

https://stanford.edu/~jduchi/projects/SinhaDu16.pdf

Prerequisite Skills Statistics, Probability/Markov Chains, Numerical Analysis, Simulation, Predictive Modelling
Other Skills Used in the Project Database Queries
Programming Languages Python, C++
Work Environment There will be be about 25 hours of work at expected at our Cambridge office, 5 of which will be supervised. Strong candidate will be offered free trips to Hong Kong to pick up potentially another project to do during an internship or part-time.

 

Interpreting multiparametric/multimodal and kinetic data

Project Title Interpreting multiparametric/multimodal and kinetic data
Contact Name William Stebbeds
Contact Email william.x.stebbeds@gsk.com
Company/Lab/Department GlaxoSmithKline (GSK)
Address Gunnels wood road
Period of the Project Flexible, preferred start early July for 8-10 weeks
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest February 21
Background Information  
Brief Description of the Project

Each drug before going to market is tested in a variety of different experiments to confirm that it is both effective and safe. Due to improving lab technologies and automation, some of these experiments produce 100s or 1000s of output parameters.

However, most of the decisions made with this data are still based on only one of the parameters, and the others are just wasted. This is partly due to out-of-date data workflows and partly due to the difficulty in the data analysis when there are too many features to take into account in order to make a clear decision.

Various dimensionality reduction, clustering, classification and regression algorithms are available and used by scientists in bespoke scripts in order to explore and understand data like this. However, in the realm of drug safety, every decision made with data has to be entirely explicable and reproducible, and the decision has to be made in the same way for every drug passed through the analysis. This can make bespoke code and changing training sets for supervised algorithms a data integrity liability.

The aim of this project is to design data analysis pipelines to predict the effect of potential drugs in humans based on the multiparametric, multimodal and kinetic data, using a set of compounds with known effects in humans and large number of data points. The student will be expected to both design the analysis for the data using an appropriate method and also to build a robust pipeline to be used by scientists to analyse their data, in a way that minimises data integrity risks.

A stretch goal will be to design an “app” that integrates the analysis with an interface using Tibco Spotfire.

References  
Prerequisite Skills Statistics, Database Queries, Data Visualization
Other Skills Used in the Project Image processing, Predictive Modelling, App Building
Programming Languages Python, R
Work Environment The student will be embedded in our SPMB department and will be connected to other CMP and industrial placement students. The student will also be co-supervised by a data scientist with a maths PhD

 

Card gaming AI

Project Title Card gaming AI
Contact Name Sammi Lai
Contact Email info@dreams-ai.com
Company/Lab/Department Dreams-AI
Address 30 Meade House, 2 Mill Park Rd, Cambridge, CB1 2FG, United Kingdom
Period of the Project 8 weeks but flexible starting and ending date
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest 1st March
Background Information Building an AI that can compete with humans at a popular Chinese card game.
Brief Description of the Project

A popular Chinese card game requires 4 players; each with 13 of the 52 cards. The goal of the game is to arrange the 13 cards in 3 sets of 3, 5 and 5. Each set is then compared with the corresponding sets belonging to the other players, and the best set in each group wins.

In this project, we want the student to investigate one or more of the following questions:

1. Performance vs computational complexity of a hard decision logic based AI

2. Performance vs computational complexity of a deep reinforcement learning based AI

3. How accurately can we predict our chances of winning based on the information that is already revealed?

References  
Prerequisite Skills Probability/Markov Chains, Simulation
Other Skills Used in the Project Statistics, Data Visualization
Programming Languages Python, C++, rust
Work Environment Project supervisor will provide 5 hours out of the 30 hours working time at the office in Cambridge. Good student will also be offered free trip to Hong Kong to take on more maths projects.

 

Understanding and prediction of process failure using time-series data

Project Title Understanding and prediction of process failure using time-series data
Contact Name Georgina Armstrong
Contact Email georgina.x.armstrong@gsk.com
Company/Lab/Department GSK, Biopharm Process Research
Address GSK, Gunnels Wood Road, Stevenage SG1 2NY
Period of the Project 8-10 weeks
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest February 21
Background Information  
Brief Description of the Project

What is the problem?

During development of biomanufacturing processes, a large volume of experiments are conducted, where equipment and processes are tested to failure. During these experiments time-series data is generated from different sensors such as pressure, UV absorbance and conductivity. The aim of this project is to determine whether analysis of this time-series data can lead to a better understanding of the root causes of failure, and furthermore, if such failures can be predicted and pre-emptively mitigated.

What mathematical techniques will you get to use?

This project will require the use of supervised and unsupervised machine learning/data analysis techniques. The choice of specific mathematical techniques will be evaluated during the duration of the placement.

What else will you learn?

You will learn about the bioprocessing industry and development of new medicines. You will gain experience working with real world datasets, and how the outputs of your research will tangibly benefit the delivery of GSK’s medicines. The skills and techniques you learn here will be applicable to a wide array of other problems you may face in your future career.

References  
Prerequisite Skills Ability to work both independently and as part of a wider team.
Other Skills Used in the Project  
Programming Languages Python, MATLAB
Work Environment

Who will you work with?

You will be working closely with members of Biopharm Process Research (BPR) group within GSK.

What support will you have?

You will have direct support from your line manager plus the wider team. The department operates an open environment where students are encouraged to discuss and share their ideas and best efforts are made in order to get you up to speed with the business specific information. Furthermore, an open-door policy for discussions/troubleshooting with all members of the team, is always in operation.

 

Automated Real-time Global Events Data Collator and Persister

Project Title Automated Real-time Global Events Data Collator and Persister
Contact Name Usman Khan
Contact Email usman@apexe3.com
Company/Lab/Department APEX:3E
Address  
Period of the Project  
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest February 21
Background Information  
Brief Description of the Project

What impact does a hurricane hitting the American Eastern coastline have on insurance stocks? Or how did the victory of Boris Johnson affect FX rates and banking stocks?

The events are stored as time lines and our users can backtest trading strategies around these timelines e.g. Buy GBP/EUR and GBP/USD 1 month before the election and sell 2 months after the election.

The idea is to create an automated real-time global events data collator and persister, where the following type of events are consumed from publicly available sources on the internet then classified, tagged by sentiment and persisted to a database or index for future querying:

1. Geo Political - e.g. US Trade war timeline, Brexit timeline, Oil tanker issues, Trump tweets

2. Financial - e.g. Companies earnings report timeline, key company announcements, IPOs, mergers & acquisitions

3. Sports - e.g. European / US / Asian soccer/baseball/cricket/Formula 1 teams that are listed on stock exchanges or have sponsors that are listed on stock exchanges. Example events include premier league match results, Formula 1 race wins/losses

4. Extreme Environmental Events - e.g. hurricanes, earthquakes, tsunamis, droughts, landslides

5. Entertainment - Film releases, Game launches, Stadium events like music concerts and boxing matches, Music releases

The outcome of this project is to have a functioning microservice which automatically collates and persists events as described above, with means for further extension. Bonus points will be awarded if the microservice can also detect fake news/tweets/content.

References  
Prerequisite Skills  
Other Skills Used in the Project  
Programming Languages  
Work Environment  

 

Deep Learning of Causality from Batch Effects in Genomic Data

Project Title Deep Learning of Causality from Batch Effects in Genomic Data
Contact Name Jeremy England / Kalin Vetsigian
Contact Email jeremy.l.england@gsk.com
Company/Lab/Department GSK
Address 200 Cambridge Park Drive, Cambridge, MA 02140, USA
Period of the Project 8-12 weeks
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest 31st March 2020
Background Information  
Brief Description of the Project Biological data vary in innumerable ways from one experiment to another, even in cases where they are collected under putatively identical conditions. Some of these “batch effects” reflect meaningful differences in cellular state, while others derive from noise or systematic biases in how different sets of experiments are performed. Typically, batch effects are considered a nuisance that needs to be factored out. But it is possible that collecting information about a system under slightly different circumstances will, in fact, improve our ability to tease out robust causal relationships among its parts. The student will work with omics-scale biological data and develop a deep neural net for causal inference (e.g. gene regulation) that is specifically designed to benefit from batch effects.
References  
Prerequisite Skills Statistics, Probability/Markov Chains, Numerical Analysis
Other Skills Used in the Project Pytorch, deep learning, statistical mechanics, systems and molecular biology
Programming Languages Python
Work Environment Part of local and remote teams, 8 hours per day flexi, on site and at home

Hedge Fund - Strategy & Risk Modelling

Project Title Hedge Fund - Strategy & Risk Modelling
Contact Name Nick Greenwood
Contact Email nick.greenwood@havencove.com
Company/Lab/Department Haven Cove SICAV plc
Address 29 Farm St, London W1J 5RL
Period of the Project 8 weeks
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest Feb 21st
Background Information It is a 2-pronged project: 1) Develop statistical arbitrage strategy for the fund, based on back-testing & optimisation; 2) Develop in-house risk modelling tools for credit derivatives portfolio.
Brief Description of the Project For 1) this would be a creative part of the project, coming up with ideas that we think have statistical significance and testing them. A successful outcome would be to find a successful strategy to implement in live trading environment. For 2) this is would be an analytical part of the project - we have some existing strategies and would like to enhance our risk management models to monitor them. A successful outcome would be the development of some tools/models to monitor the risks & P&L and see how they change on a daily basis.
References For statistical arbitrage, Ernest Chan books are good (Algorithmic Trading: Winning Strategies and their Rationale). For Credit Derivatives modelling, Dominic O'Kane's book (Modelling Single Name & Multi-name Credit Derivatives) is an excellent background.
Prerequisite Skills Statistics, Probability/Markov Chains, Financial Models; Excel/VBA
Other Skills Used in the Project Predictive Modelling, Data Visualization
Programming Languages No Preference
Work Environment Fine to work remotely / come into the office, as you wish! Main contact point would be myself, but there would be other staff from the Fund would look at the project too.

Optimal forecast combination for bond portfolios

Project Title Optimal forecast combination for bond portfolios
Contact Name Ralph Smith
Contact Email rsmith@bluecove.com
Company/Lab/Department BlueCove Ltd
Address 10 New Burlington Street, LONDON W1S 3BE
Period of the Project approx 8 weeks
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest  
Background Information BlueCove is a new, scientifically-driven asset management firm. One of our key areas of focus is constructing corporate bond portfolios to maximise alpha (return in excess of systematic factors) in order to add value for clients. This project would extend our understanding of the optimal solution to this problem via deriving mathematical solutions under specific assumptions.
Brief Description of the Project

To determine an optimal portfolio of holdings of stocks or bonds, we need to produce forecasts for each asset’s return. Typically this is done by constructing a number of signals, each of which is translated into an expected return forecast (or alpha) for the asset. Our topic of interest is how to optimally combine these multiple forecasts to guide our portfolio construction. More specifically, we are interested in a linear forecast combination i.e. each signal is weighted- averaged.

A useful initial model to start thinking about this problem is to think about signals as a random vector, where each element represents a return forecast for a bond. This random vector forecasts future expected returns, which is another random vector, that follows a multivariate normal distribution. We would then seek to answer the following questions:

  • Which objective functions can be constructed to capture the optimal forecast combination problem in this setting and how is the weight vector determined?
  • How can we derive similar results when we know more about the dynamics of the random vectors e.g. that in addition, the signals follow a vector autoregressive process?
  • How do our optimal forecast combinations changes when signal forecasts are predictive not just one time period ahead but further into in the future?
  • How does our signal combination weight change when we care about predicting? a. Simple average of a set of future return vectors b. A weighting e.g. exponential decay of future return vectors.
  • What happens when we have constraints on the signal weights e.g. upper and lower bounds?

Finally, most of the above concerns statistical significance, but as investors we are interested in economic significance. Economic significance can be formalised by a utility function e.g. quadratic utility which is parametrized by the holdings vector, expected returns and covariance of returns.

  1.  We are interested in cases when statistical significance aligns with economic significance
  2.  How do constraints on holdings change optimal forecast combination?
References

To give an idea of approaches to similar questions, please see the below links. (Note that our project specification will be significantly different to the below papers’ setup, but it should give a broad flavour of some of the ideas involved):

Gerard, X., Guido, R., & Wesselius, P. (2012). Integrated Alpha Modeling (SSRN Scholarly Paper No. ID 1978156). Retrieved from Social Science Research Network website: https://papers.ssrn.com/abstract=1978156

Qian, E., Sorensen, E.H., Hua, R., 2007. Information Horizon, Portfolio Turnover, and Optimal Alpha Models. The Journal of Portfolio Management 34, 27–40. https://doi.org/10.3905/jpm.2007.698030

Kakushadze, Zura, Combining Alphas via Bounded Regression (October 22, 2015). Risks 3(4) (2015) 474-490. Available at SSRN: https://ssrn.com/abstract=2550335 or http://dx.doi.org/10.2139/ssrn.2550335

Kakushadze, Zura and Yu, Willie, Decoding Stock Market with Quant Alphas (April 25, 2017). Journal of Asset Management 19(1) (2018) 38-48. Available at SSRN: https://ssrn.com/abstract=2965224 or http://dx.doi.org/10.2139/ssrn.2965224

Prerequisite Skills Statistics, Probability/Markov Chains
Other Skills Used in the Project Mathematical finance, Portfolio theory
Programming Languages Python, MATLAB, R
Work Environment The student will be working closely with our Research team. In the initial stages this would be in our office. Once the project is in progress, we envisage this shifting to 1-2 days a week in our office, with the remainder working remotely. This can be adapted as necessary.

Automated identification and analysis of world-changing early stage investment opportunities

Project Title Automated identification and analysis of world-changing early stage investment opportunities
Contact Name Peter Dolan, Professor Richard Samworth
Contact Email info@ahren.co.uk, rjs57@hermes.cam.ac.uk
Company/Lab/Department Ahren Innovation Capital
Address  
Period of the Project 8-10 weeks, as agreed
Project Open to Undergraduates, Master's (Part III) students
Initial Deadline to register interest  
Background Information Public stock markets are designed so that information is democratic and publicly listed companies (i.e. that are listed on stock markets such as FTSE, S&P 500) are legally required to make sure all information that could be relevant for an investor to make an investment decision is available in the public domain. However private markets do not have this requirement, so information is much harder to access, and the markets are much less efficient. In early stage companies (“start-ups”) especially, potential investments are spread around private networks, with certain companies purposely staying secret and closed to the public domain. As investors, we see advantage in designing novel solutions to arbitrage these inefficient markets by creating tools that allow us to systematically search for opportunities of interest.
Brief Description of the Project

The goal of the project is to create an application that will allow an early stage investment fund source exciting, ground-breaking “under the radar” science and/or technology start-ups that will change the world for the better. The project will make use of application programming interfaces and web scraping techniques to build a database.

The project will be focused on bringing cutting edge statistical/mathematical techniques to bear on these data sets to derive insight into the best potential opportunities to invest into. It should be able to identify great entrepreneurs who may have sold prior companies at high valuations and to large corporations and have functionality to sort by sector, stage, age of company, team and valuation if applicable.

Example inputs :

In order to search and derive insights on which companies would make good investments, a data set must be built. The database should be kept up to date and relevant using application programmming interfaces and web-scraping techniques. There are many sources of data and a good project will use a range of sources to generate a deep database from which smart insights can be derived. A non-exhaustive list of possible sources is below:

  •  Date of incorporation, company filings (Companies House for UK)
  • Relevant sector (CB Insights, Pitchbook, job adverts)
  • Employee type / growth (LinkedIn)
  • Patent applications (Escape net) or licences
  • Github / (specialised blog) activity
  • Founder / individual blog posts on topic within sector
  • Company / Founder social media posts/following Founder / individual publications/conference posters: Quantity (PubMed or equivalent)  and Quality (journal impact rating, paper awards)
  • News articles (e.g. using news scraping tool such as Factiva – less relevant as if they are starting to gain traction in the news, we may be too late)

Output:

The output is to identify the best opportunities others might not have heard of. In terms of functionality the application will need to be able to sort backwards, with as much information/data as possible on an opportunity (including any prior rounds/valuations/financials if relevant). It should also be able to simply identify great entrepreneurs that might have sold prior companies to a large corporate or tech giant and be back “on the market” to start a new company. It should sort by sector, stage, age of company, team and valuation if applicable.

References  
Prerequisite Skills Statistics, Database Queries
Other Skills Used in the Project Data Visualization, Machine Learning
Programming Languages No Preference
Work Environment Exact details to be decided - Normal office hours. Project will be able to interact with investment team but will be expected to work autonomously in the main.