# Summer Research in Mathematics

This is a list of projects hosted by the Cambridge Mathematics Placements site. Projects should be aimed at students working in the Faculty of Mathematics. To remove a listing, please contact Arti Sheth Thorne.

Projects that have been filled or withdrawn are indicated with strikethrough text. Please do not contact the host in question if a project has been filled or withdrawn.

## Spectral estimation for irregularly-sampled complex-valued time series

 Project Title Spectral estimation for irregularly-sampled complex-valued time series Contact Name Keith Briggs Contact Email keith.briggs@bt.com Company/Lab/Department BT Labs Wireless Research Address Adastral Park, Martlesham Heath, Ipswich IP5 3RE Period of the Project 8 weeks Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information In the modelling of 5G radio channels, we take measurement of complex time series (which are channel matrix elements), but the measurement process unavoidably takes these samples at irregular (but known) times. We wish to explore methods for estimation of power spectral density (PSD) for such data and to understand better their sampling properties. The overall application area is the better estimation of channel matrices, in order to improve performance of 5G radio systems. The channels are described by matrices because the systems are MIMO (multi-in multi-out), effectively using a vector channel. Also of interest is estimation of autocorrelation of these matrices, and PSD has been viewed as a step towards this. The whole topic fits well under the harmonic analysis heading in the CMI mission statement. Brief Description of the Project To estimate power spectral density from irregularly-sampled complex data, we are currently using a kind of generalized Lomb-Scargle periodogram (LSP). However, the theory and sampling properties of this estimator are not well understood. Appropriate theoretical background is available in Percival & Walden, Spectral analysis for univariate time series (CUP 2020), p.528ff. This project could tackle one or more of these items: 1. The LSP can be viewed as a generalized discrete Fourier transform (DFT), in other words a matrix-vector product in which row-columm dot product is the projection of the data onto a basis vector of the model. In the LSP the matrix elements do not have as many nice properties as the DFT matrix. We can speak of the Lomb-Scargle Transform (LST), of which the LSP is simply the modulus squared. 2. Check that the standard theory for DFT estimation properties (e.g. for AR(n) processes) still holds for LST. This is almost certainly the cases and this step may be trivial. 3. Given that input times are fixed, is it possible to make sense of the idea of an optimal set of output frequencies? 4. Examine tapering methods (as used for DFT) for the LST case, to determine the possible improvements to estimate accuracy. 5. Can we get estimates of autocorrelation from the LST? This would be very useful in practice. 6. (Probably hard): there is no known concept of Fast Fourier Transform (FFT) for the LST. Is anything possible in this direction? A special case would be of interest and some sort of FFT may be possible: let us allow two (and only two) time intervals between measurements in the time series data. This would be approximately satisfied by our data. (The underlying mathematics for the usual FFT involves decompositions of finite groups, so having a group theorist interested to help with this would be needed.) Keywords Spectral estimation, irregular sampling, complex-valued, time series References Percival & Walden, Spectral analysis for univariate time series (CUP 2020), p.528ff. Prerequisite Skills Statistics, Probability/Markov Chains, Simulation Other Skills Used in the Project Statistics, Probability/Markov Chains, Simulation Programming Languages Python, C++ Work Environment Mostly working with me, with a wider team available if needed. Flexible hours, on-site preferred but remote possible.

## Advanced image analytics for drug discovery

 Project Title Advanced image analytics for drug discovery Contact Name Sara Schmidt Contact Email sara.x.schmidt@gsk.com Company/Lab/Department GSK Address Gunnels Wood Road, Stevenage SG1 2NY Period of the Project 8-10 weeks Project Open to Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information GSK is a FTSE100, science-led, global healthcare business currently ranked as the leading pharmaceutical company in the UK. Our research is focused on immunology efforts including small molecules, biologics and vaccines. Never has it been more important for us to reduce the time it takes from identification of a potential therapeutic to a marketed medicine, that is the main focus of this project. At GSK we have created a world-leading data and computational environment to enable large scale scientific experiments that exploit GSK's unique access to data. Our focus is on bringing data, analytics, and science together into solutions for our scientists to develop medicines for patients. A key enabler of this effort is the ability to extract knowledge from imaging data. A specific challenge for the early phase of drug discovery programmes is to assign potential drug molecules into those with a desired effect of the drug target in mind and those with an undesirable effect, e.g. toxicity. One way to achieve this goal is by developing advanced image analytics algorithms, where image sets of cells in the presence of molecules with known undesirable mechanisms are used to define image signatures, the so-called "ground truth". Thereafter the algorithm is applied to unknown compounds to allow us to focus on compounds that are free from potential liabilities, thereby improving drug failure rates and overall speed up the often lengthy and costly drug discovery process. Brief Description of the Project We are looking for a student with a keen interest in image processing and computer vision that can use our in-house generated image stacks of cells from early drug discovery programmes and associated training sets to develop image analytics algorithms that enable compound mechanism classification. The project will involve both improving existing, and the development of new, image analytics algorithms in open source packages (e.g. Python, Cellprofiler and Ilastik). Upon choice and validation of a suitable algorithm the student should develop a robust pipeline that can be used by scientists to analyse their own data at scale, in a way that minimises data integrity risks. Keywords Image processing, Computer vision, Machine Learning, Bioimaging, Pharmaceutical industry References Bray MA et.al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat Protoc. 2016 Sep;11(9):1757-74. doi: 10.1038/nprot.2016.105. Epub 2016 Aug 25. PMID: 27560178; PMCID: PMC5223290. Prerequisite Skills Image processing Other Skills Used in the Project Statistics, Data Visualization Programming Languages Python, R Work Environment Fully embedded into a scientific department and part of a wider team interacting with data scientists and imaging experts.

## Meta-Analysis of Transcriptomics data at GSK

 Project Title Meta-Analysis of Transcriptomics data at GSK Contact Name Giovanni Dall'Olio Contact Email giovanni.m.dallolio@gsk.com Company/Lab/Department GSK Address Gunnels Wood Road, Stevenage SG1 2NY Period of the Project 8 weeks Project Open to Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information In recent years GSK has invested in the creation of a Data Lake, an infrastructure where all the data generated in the company is stored and made available. This has many advantages, as the data is not scattered across data silos, and it is generated using a standardized and controlled process. One component of this Data Lake infrastructure is the pipeline for the sequencing of genomics and transcriptomics data (RNA-Seq and other technologies). We have built a process to generate and curate this data using standard tools and parameters, generating a high-quality dataset from experiments executed from different departments in the company. The scope of the research project will be to develop methods for meta analysis of the genomics and transcriptomics data in this dataset, comparing experiments generated by different units and collaborators. Brief Description of the Project The student will explore methods for meta-analysis of sequencing data from different experiments. The desired outcome of the project will be a computational notebook or a script documenting recommendations for meta-analysis methods, taking in consideration existing literature, and using example data from our dataset. This project is relatively open-ended and the student will have space to explore different solutions, as well as working with a curated dataset. Knowledge of NGS is not required although some preliminary understanding may be useful. Preferred programming languages would be R and Python. Keywords transcriptomics, genomics, RNA-seq, meta-analysis, statistics References - Leek et al, Nat Rev Gen 2010. Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data - Evans et al, Brief Bioinf 2018. Selecting Between-Sample RNA-Seq Normalization Methods From the Perspective of Their Assumptions - RPKM, FPM and TPM clearly explained https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ Prerequisite Skills Statistics Other Skills Used in the Project Database Queries, Data Visualization Programming Languages Python, R Work Environment Work remotely, in a team.

## Verification of stress simulation model/software

 Project Title Verification of stress simulation model/software Contact Name Artem Babayan Contact Email artem.babayan@silvaco.com Company/Lab/Department Silvaco Europe Address Silvaco Europe Ltd, Compass Point, St Ives PE27 3FJ Period of the Project 8 weeks, any time Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information Brief Description of the Project Silvaco develops Electronic Design Automation (EDA) and Technology CAD (TCAD) software. One of the modules is the tool for simulating stresses inside the electronic devices (caused by bending, heating, internal stresses etc. during device manufacture). Your task would be: - research the standard stress problems for which analytical solution is available. - set up the problem within Silvaco Stress Simulator -- compare the known analytical solution with the results obtained with simulator. - optionally, you may also set up a simple problem which can be solved numerically using alternative tools (e.g. Matlab) and compare these results with Silvaco. Depending on results of the project it may result in academic paper or conference publication. Keywords Stress simulation, Mathematical modelling, Model verification, Numerical analysis References Prerequisite Skills Mathematical physics, Numerical Analysis, PDEs Other Skills Used in the Project Programming Languages Python, MATLAB, C++ Work Environment The student will be placed in the Silvaco building in St Ives. There are ~15 people in the office. Student is supposed to work on his/her own with advice available from the team. Also communication with our office in US may be required.

## Low-rank matrix approximations within Kernel Methods

 Project Title Low-rank matrix approximations within Kernel Methods Contact Name Zdravko Zhelev Contact Email application@dreams-ai.com Company/Lab/Department DreamsAI Address 30 Meade House, 2 Mill Park Rd, Cambridge CB1 2FG Period of the Project 8 weeks, flexible Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information In machine learning, we often employ kernel methods to learn about more general relation in datasets instead of explicit projection to avoid high computational cost. Brief Description of the Project Often kernel trick involves computation of matrix inversion or eigenvalue decomposition and the cost becomes cubic in the number of training data cause. Due to large storage and computational costs, this is impractical in large-scale learning problems. One of the approaches to deal with this problem is low-rank matrix approximations. The most popular examples of them are Nyström method and the random features. We would like student to test out the feasibility of these approximations on real data. Keywords machine learning linear algebra mathematical statistic References https://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-k... https://people.eecs.berkeley.edu/~brecht/paper/07.rah.rec.nips.pdf https://stanford.edu/~jduchi/projects/SinhaDu16.pdf Prerequisite Skills Statistics, Numerical Analysis, Mathematical Analysis, Geometry/Topology, Predictive Modelling Other Skills Used in the Project Data Visualization Programming Languages Python, C++ Work Environment Project supervisor will provide 5 hours out of the 30 hours working time at the office in Cambridge. Good student will also be offered free trip to Hong Kong to take on more maths projects.

## Prize pool and odds forecast

 Project Title Prize pool and odds forecast Contact Name Zdravko Zhelev Contact Email application@dreams-ai.com Company/Lab/Department DreamsAI Address 30 Meade House, 2 Mill Park Rd, Cambridge CB1 2FG Period of the Project 8 weeks, flexible Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information Brief Description of the Project In a prize pool based betting game, the final returning odds of a bet is simply a function of the total amount of bet placed by everybody divided by the total amount of bets that guessed correctly. Therefore every time someone placed a bet, the odds for every bet type change for everybody. Only after the deadline for bet placing can the odds be finalized. In theory, if you know all of the prize pool's size you can determine all the odds exactly, and vice versa. The challenge here is to consider the cases when we only know a subset of the odds/prize pool's size; how much uncertainty would be introduced and can we leverage the relationships between bet types to improve our predictions. Keywords combinatorics probability markov chain monte carlo References Prerequisite Skills Statistics, Probability/Markov Chains, Numerical Analysis, Simulation, Predictive Modelling Other Skills Used in the Project Database Queries Programming Languages Python, C++ Work Environment There will be be about 30 hours of work expected at our Cambridge office, 5 of which will be supervised. Strong candidate will be offered free trips to Hong Kong to pick up potentially another project to do during an internship or part-time.

## Card Gaming AI

 Project Title Card Gaming AI Contact Name Zdravko Zhelev Contact Email application@dreams-ai.com Company/Lab/Department DreamsAI Address 30 Meade House, 2 Mill Park Rd, Cambridge CB1 2FG Period of the Project 8 weeks, flexible Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information Building an AI that can compete with humans at a popular Chinese card game. Brief Description of the Project A popular Chinese card game requires 4 players; each with 13 of the 52 cards. The goal of the game is to arrange the 13 cards in 3 sets of 3, 5 and 5. Each set is then compared with the corresponding sets belonging to the other players, and the best set in each group wins. In this project, we want the student to investigate one or more of the following questions: 1. Performance vs computational complexity of a hard decision logic based AI 2. Performance vs computational complexity of a deep reinforcement learning based AI 3. How accurately can we predict our chances of winning based on the information that is already revealed? Keywords combinatorics, probability, neural network, simulations References Prerequisite Skills Probability/Markov Chains, Simulation Other Skills Used in the Project Statistics, Data Visualization Programming Languages Python, C++, Rust Work Environment Project supervisor will provide 5 hours out of the 30 hours working time at the office in Cambridge. Good student will also be offered free trip to Hong Kong to take on more maths projects.

## Analytical solutions for use of varistors in superconducting magnet quench protection

 Project Title Analytical solutions for use of varistors in superconducting magnet quench protection Contact Name Dr Andrew Varney, Consultant Magnet Engineer Contact Email andrew.varney@oxinst.com Company/Lab/Department Oxford Instruments NanoScience Address Tubney Woods, Abingdon, Oxon OX13 5QX Period of the Project Up to 8 weeks between June and September Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information Typical high field superconducting magnets can have a magnetic stored energy of around 10 MJ or more, which is 5 times the kinetic energy of a 4 tonne HGV travelling at 70mph. In the superconducting circuit, as there is no resistance, the current flows without energy loss. However, an extremely small disturbance, say 10 Î¼J, can lead to the superconducting magnet windings becoming resistive locally. This leads to a chain reaction, with the whole magnet becoming resistive and the stored energy dissipating as heat in the magnet windings over a few seconds' time scale. This is known as a magnet quench. There are various schemes to protect a superconducting magnet against the effects of the rapid stored energy dissipation in a quench. The main goal is to prevent too much of the energy from being dumped locally which can produce a hotspot within the magnet windings. Typically, a protection circuit will consist of a potential divider network with resistors and diodes to manage currents and voltages within individual parts of the magnet. It will often also include secondary heaters to spread the quench across other parts of the magnet faster than the passive quench propagation would proceed. Oxford Instruments has recently proposed a novel quench protection scheme involving varistors which is the subject of a patent application (not yet published). A varistor is an electrical device which exhibits a non-linear voltage vs current relationship. Specifically, at low voltage a varistor has a relatively high electrical resistance which decreases with increase voltage. Modelling of the quench behaviour and some experimental work has shown that the use of such components could be useful in a particular configuration to improve the quench protection for high field magnets. Analytical equations can be derived based on the underlying physics using reasonable approximations. However, even for the simplest case of a homogenous coil divided into two sections and protected using conventional linear resistors, the result describing the propagation of a quench through the magnet coils is a second-order non-linear ODE for which only approximate solutions can be found with some further assumptions. It is not clear how to find solutions of the ODE representing the generalised case with variable resistance in the protection circuits. Although numerical simulations for this system could be developed, analytic descriptions of how varistors would respond in a quench protection circuit will be invaluable in providing insight into the behaviour of the system over a wide range of parameters. This will also enable additional functionality in existing in-house software without requiring a great deal of computational resource. Brief Description of the Project The primary goal of the project is to find approximate analytical solutions to the equation representing the magnet quench propagation in the simplest case of the protection circuit subdividing the magnet into two sections, but generalised to allow for the use of varistors. The varistor behaviour may be represented as a simplified equation, but it may be possible to extend the treatment to allow for a more accurate representation. The use of numerical solutions to guide and test in the search for approximate analytical solutions would be appropriate, and such solutions would still be useful should analytical solutions prove to be too elusive. This project would advance Oxford Instruments' understanding and modelling of varistors for use in protecting superconducting magnets. It is intended that it would thus support development of their practical use in the manner described in our patent application by helping in the selection of materials parameters required for a real magnet. The implementation at Oxford Instruments is likely to be in two ways: via an analytical tool to make initial estimates and by using the equations in our in-house quench modelling code. If the project were particularly successful, an extension goal would be for the student to start working on these tools. An academic outcome would be at least one published paper, possibly in a mathematical physics journal, but more likely a magnet/physics one. Keywords Varistor Superconducting magnet Protection circuit Quench Analytical solution References https://www.oxinst.com/news/a-new-era-in-high-field-superconducting-magn... Martin Wilson, Superconducting Magnets (OUP, 1983), especially chapter 9 Prerequisite Skills Mathematical physics, Numerical Analysis, Mathematical Analysis Other Skills Used in the Project Simulation, Predictive Modelling Programming Languages No Preference, FORTRAN would be ideal Work Environment The student would be part of the R&D / technology development team, which consists mostly of doctoral-qualified physicists, for the duration of the project. An experienced mathematical physicist working in another group will also be available for consultation. The status of remote working depends on progress of the current pandemic, but there is likely to be at least an element of this. Ideally, the student would be able to work in the office/factory part of the time in order to meet people and to see the products to which the work relates. Oxford Instruments normal working hours are 37 hours per week (including early Friday finish).

## Modelling and Numerical Simulation of Stress Dependent Oxidation of Silicon

 Project Title Modelling and Numerical Simulation of Stress Dependent Oxidation of Silicon Contact Name Vasily Suvorov Contact Email vasily.suvorov@silvaco.com Company/Lab/Department Silvaco Europe, Technology Computer-Aided Design (TCAD) Department Address Compass Point, St Ives, Cambridgeshire, PE27 5JL Period of the Project 8 weeks between July and September Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information The fabrication of integrated circuit microelectronic structures and devices vitally depends on the thermal oxidation process of silicon. The project aims to analyse the mathematical models of this process and construct effective numerical algorithms to explore the effects of various modelling assumptions. The successful outcome of the project will become a part of the company's commercial software. Brief Description of the Project Thermal oxidation of silicon is a way to produce a thin layer of oxide on the surface of a wafer in the fabrication of microelectronic structures and devices. The technique forces oxygen to diffuse into the silicon wafer at high temperature and react with it to form a layer of silicon dioxide: Si + O2 -> SiO2. The oxide layers are used for the formation of gate dielectrics and device isolation regions. With decreasing device dimensions, precise control of oxide thickness becomes increasingly important. In 1965 Bruce Deal and Andrew Grove proposed an analytical model that satisfactorily describes the growth of an oxide layer on the plane surface of a silicon wafer [1]. Despite the successes of the model, it does not explain the retarded oxidation rate of non-planar, curved silicon surfaces. The real cause for the observed retardation behaviour is believed to be the effect of viscous stress on the oxidation rate [2-3]. In this project, we aim to explore the existing mathematical models of the stress-dependent oxidation and propose a numerical scheme to obtain the solution. The approach that we will use is a combination of analytical and numerical analyses of a system of non-linear ordinary differential equations. The student is expected to implement the numerical algorithms in C++ language, although no previous experience in C++ coding is required. Silvaco's own software products may also be used as a tool in this project if required. Keywords Oxidation, TCAD, Mathematical modelling, Numerical Algorithms, C++ coding References [1] B.E.Deal, A.S.Grove (1965), General relationship for the thermal oxidation of silicon, Journal of Applied Physics, Vol.36, N12, 3770-3778. [2] D.B.Kao, J.P.McVittie, W.D.Nix, K.C.Saraswat (1988), Two-dimensional thermal oxidation of Silicon - I. Experiment, IEEE Transactions on Electron Devices, Vol. ED-34, N 5, 1008-1017. [3] D.B.Kao, J.P.McVittie, W.D.Nix, K.C.Saraswat (1988), Two-dimensional thermal oxidation of Silicon - II. Modeling stress Effects in Wet Oxides, IEEE Transactions on Electron Devices, Vol. ED-35, N 1, 1008-1017. Prerequisite Skills Mathematical physics, Numerical Analysis, Mathematical Analysis, Simulation Other Skills Used in the Project Programming Languages C++, None Required, Interest in C++ coding Work Environment The student will work on his/her own with the support and guidance from the supervisor.

## Deep representation learning for health records: identifying patients with similar interactions with health services

 Project Title Deep representation learning for health records: identifying patients with similar interactions with health services Contact Name Steve Kiddle Contact Email steven.kiddle@astrazeneca.com Company/Lab/Department AstraZeneca, Biopharmaceuticals R&D, Data Science and AI Address Academy House, 136 Hills Road, Cambridge CB2 8PA Period of the Project 8 weeks Project Open to Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information Because people have been living healthier and longer lives, they are often living with more than one health condition, referred to in the scientific research setting as living with "multimorbidity." However, current NHS guidelines provided to doctors and nurses are organised around patients having only a single condition, ignoring the fact that many, especially the elderly, live with multimorbidity. It's important to better understand how to identify and group patients with multimorbidity in a meaningful way, so that doctors and nurses could provide the best possible personalised care. Brief Description of the Project The aim of the study is to use "deep learning" (a form of artificial intelligence) to determine whether patients that fall within a particular "multimorbidity" subgroup are in greater need of healthcare services in future (e.g., more frequent doctor visits, prescriptions, hospitalisations, etc). The MSc student will contribute to the creation of a "proof of concept" for the above study question that will be used to help inform future decision making and planning of next steps on the project. The student would have an opportunity to: - Learn about and apply deep learning and artificial intelligence - Apply these techniques to a real-world database (e.g., MIMIC or CPRD) - Interpret the outputs of the analysis in a meaningful way to support scientific decision-making at AstraZeneca The student would split their time between the above project and working on other "live" projects running within the Data Science and AI team, providing students an opportunity to work on a wide variety of tasks that a data scientist typically face during a normal working day. Keywords Multimorbidity, deep learning, neural networks, artificial intelligence, healthcare data, health data science References Landi, I., Glicksberg, B. S., Lee, H. C., Cherng, S., Landi, G., Danieletto, M., Dudley, J. T., Furlanello, C., & Miotto, R. Deep representation learning of electronic health records to unlock patient stratification at scale. npj Digit. Med. 3, 96 (2020). Prerequisite Skills Statistics, Mathematical Analysis, Algebra/Number Theory Other Skills Used in the Project Image processing, Predictive Modelling, Database Queries Programming Languages Python, R Work Environment Virtual or face-to-face, depending on the Covid situation

## Analytical Solution for Multi-Barrier Release, Mechanically Link Diffusion to In-vitro Release

 Project Title Analytical Solution for Multi-Barrier Release, Mechanically Link Diffusion to In-vitro Release. Contact Name Weimin Li Contact Email weimin.li1@astrazeneca.com Company/Lab/Department AstraZeneca Address The Pavilion, Granta Park, Great Abington, Cambridge CB21 6GP Period of the Project 8 weeks Project Open to Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information Extended release of drug molecules from its carriers is one of the approches to improve patient compatability by for example reducing dose frequency. This project focus on one field that diffusion is believed to be the main mechanism for drug releasing. The increased complexity of formulation that multi-layers are designed, then it requires high levels of math ability to put together the differentials from Fick's law of diffusion. Brief Description of the Project Week 1-2: Introduction of the background and read papers. Practice on writing simple and executable scripts. Week 3-4: Work on the analytical solutions from Elliot J. Carr and Giuseppe Pontrelli that solves release from multi-layer spheres. Week 5 -8: Bring empty spheres in to the calculation, and fit with existing data to estimate the diffusion coefficient and impact of the amount of empty sphere. Keywords Fick's law of diffusion. Differential equations. Analytical and numerical solutions. References Prerequisite Skills Mathematical physics, Numerical Analysis, Mathematical Analysis, Algebra/Number Theory Other Skills Used in the Project Simulation, Predictive Modelling Programming Languages Python, MATLAB, C++ Work Environment Mostly work from home

## Multi-scale modeling to enable data-driven biomarker and target discovery

 Project Title Multi-scale modeling to enable data-driven biomarker and target discovery Contact Name Dr Shameer Khader Contact Email shameer.khader@astrazeneca.com Company/Lab/Department AstraZeneca, Data Science and Artificial Intelligence Address Academy House, 136 Hills Road, Cambridge CB2 8PA Period of the Project 8 weeks Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information Metagenomic sequencing of clinical samples has improved our understanding of how dysbiosis of microbial flora influences various human diseases. Emerging studies have shown that several microbial signatures were explicitly altered in the setting of immunological, cardiovascular, or gastrointestinal disorders, etc. Microbiome signatures, identified in the context of a disease and integrated with other types of molecular profiling data (genome, microbiome, transcriptome, metabolome, etc., collectively called multi-omics data), are gaining relevance in drug discovery. Such a data set offers opportunities to understand the specific functional pathways and metabolic reactions mediated by host-pathogen interactions in various diseases. Multi-omics is an emerging theme in drug discovery. It provides an unprecedented view into molecular players driving conditions and enables a path to discover new targets and therapies shortly. Brief Description of the Project AstraZeneca is investing in this exciting and vital area of drug development to generate unique multi-omics data sets to accelerate the development of novel therapies. Several projects are currently in progress to integrate microbiome with heterogeneous data sets (imaging, multi-omics, clinical, in-vivo disease models, etc.) using quantitative approaches. Collectively, such a system could lead to new targets and unique signatures correlated with human diseases. The collaborative study of altered microbial taxa/species and corresponding clinical phenotype by compiling a large and diverse data set will be an essential step toward understanding microbes' role in disease comorbidities. To achieve this goal, we collaborate with Microbial Sciences across a portfolio of projects that span multiple disease modalities. The student will develop multi-scale models capable of integrating multi-omics data with clinical and imaging data using modern machine intelligence methods. The incoming candidate will be part of the Special Projects and Research Team. The team is currently working on a portfolio of projects with a common goal of accelerating drug or target discovery using machine intelligence methods. We aim to cross-train the incoming student in drug discovery, precision medicine, multi-scale biology, and data science. We expect the student to leverage high-performance computing and biomedical informatics facilities in AZ to develop data-driven methods to analyze large multi-scale, multi-omics data sets. The student will be part of collaborative efforts across microbial science, artificial intelligence, and drug development. This unique collaborative nature of the project will improve hands-on skills in clinical data, biomedical data analytics, and data science. The incoming student will contribute to the design, development, and deployment of predictive models that help organize, analyze and interpret. The student can also gain experience by working closely with the Microbial Sciences clinical development team. Keywords Drug Discovery, Data Science, Machine Learning, Bioinformatics, Precision Medicine References https://pubmed.ncbi.nlm.nih.gov/28892060/ https://pubmed.ncbi.nlm.nih.gov/31126891/ Prerequisite Skills Statistics, Probability/Markov Chains, Image processing, Predictive Modelling, Database Queries Other Skills Used in the Project Probability/Markov Chains, Predictive Modelling, Data Visualization, App Building Programming Languages Python, R, No Preference Work Environment 9-5 at AZ campus or remote (depends on COVID restrictions)

## TrialGraph: Machine Intelligence Enabled Insight from Graph Modeling of Clinical Trials

 Project Title TrialGraph: Machine Intelligence Enabled Insight from Graph Modeling of Clinical Trials Contact Name Dr Shameer Khader Contact Email shameer.khader@astrazeneca.com Company/Lab/Department AstraZeneca, Data Science & Artificial Intelligence, Special Projects & Research Address Academy House, 136 Hills Road, Cambridge CB2 8PA Period of the Project 8 weeks Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information One of the major impediments to successful drug development is the complexity, cost and scale of clinical trials, particularly large Phase III trials. Despite a wealth of historical data, clinical trial sponsors typically have a difficult time fully leveraging historical trial data to drive insight into optimal clinical trial design, reducing trial cost and scale. Many barriers exist to leveraging this data including drift in clinical terms and procedure over time, differences in trial structure and differences in data sampled. Recent advances in machine learning in areas such as Natural Language Processing (NLP) and graph modeling of complex data have enabled rapid advances in a number of domains. The TrialGraph project seeks to apply these methodologies to clinical trial data, creating a unified graph model to represent clinical trials across phases and therapeutic areas. Such a data modeling approach would enable novel and power analytics that enable efficiencies in drug development and benefit to our patients. Brief Description of the Project Multiple graph modeling initiatives are running in parallel and this project will leverage their infrastructure, graph modeling of external clinical and biomedical data as well as expertise. In collaboration with this wider community, the TrialGraph project will seek to leverage these resources while developing novel graph representations of historical AZ trials, methodologies to analyze these graph representations that provide meaningful insight and experiment with other machine learning methodologies that could yield both novel discoveries and operational efficiencies. Expected Outcomes: - Prototype graph data mode lapplied to multiple clinical trials - Graph analytics aimed at providing insight into clinical trial operations and outcome - Improve clinical trial enrollment lifecycle Keywords Graph modeling, Data integration, Data Science, Clinical Trials, Machine Learning References Prerequisite Skills Statistics, Probability/Markov Chains, Geometry/Topology, Predictive Modelling Other Skills Used in the Project Database Queries, Data Visualization, App Building Programming Languages Python, R Work Environment AstraZeneca Campus/Remote (depending on COVID situation)

## Network reconstruction from single cell transcriptomic data

 Project Title Network reconstruction from single cell transcriptomic data Contact Name Nil Turan Contact Email nil.c.turan-jurdzinski@gsk.com Company/Lab/Department GSK, Human Genetics Computational Biology Address Gunnels Wood Road, Stevenage, SG1 2NY, United Kingdom Period of the Project 8-10 weeks, flexible Project Open to Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information Currently, molecular interaction networks in the field, and those used to support numerous target identification and validation efforts within GSK, present a generalized network comprising interactions that may not exist within a specific cell-type. The ability to reconstruct and analyse cell-specific molecular interaction networks has the potential to improve our cell-specific understanding of molecular processes and directly inform on relevant assays or mechanisms driving a disease. Recent advances in single cell RNA-seq technology allows the transcriptome of individual cells to be assessed [1]. This brings a great opportunity to reconstruct cell-specific molecular interaction networks. Several methods have been implemented to build such networks [2-3] but a systematic evaluation of such methods is yet to be conducted. Brief Description of the Project The student will explore available methods to reconstruct networks from single cell RNA-seq data [2-3]. A background in statistics and mathematics is critical for reviewing these methods. They will then evaluate and test the performances of these different methods. Knowledge of single cell data is not required although some preliminary understanding will be useful. Preferred programming language would be R. Keywords Network inference, single cell transcriptomics, computational biology, statistics, R References [1] Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol (2019)15:e8746https://doi.org/10.15252/msb.20188746 [2] Simon Cabello-Aguilar, MÃ©lissa Alame, Fabien Kon-Sun-Tack, Caroline Fau, Matthieu Lacroix, Jacques Colinge, SingleCellSignalR: inference of intercellular networks from single-cell transcriptomics, Nucleic Acids Research, Volume 48, Issue 10, 04 June 2020, Page e55, https://doi.org/10.1093/nar/gkaa183 [3] Efremova, M., Vento-Tormo, M., Teichmann, S.A. et al. CellPhoneDB: inferring cell"“cell communication from combined expression of multi-subunit ligand"“receptor complexes. Nat Protoc 15, 1484"“1506 (2020). https://doi.org/10.1038/s41596-020-0292-x Prerequisite Skills Statistics Other Skills Used in the Project Programming Languages Python, R Work Environment The student will work closely with Human Genetics Computational Biology, Functional Genomics Computational Biology and the stats group. The student will have the opportunity to interact and discuss with experts in single cell seq technology and also network approaches.

## Algorithm development and modelling for security applications

 Project Title Algorithm development and modelling for security applications. Contact Name Sam Pollock Contact Email careers@iconal.com Company/Lab/Department Iconal Technology Address St John's Innovation Centre, Cowley Road, Cambridge CB4 0WS Period of the Project At least 8 weeks, June start Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest We will start interviewing as soon as we receive applications, but applications up to the end of March will be considered if we haven't already filled the position. Background Information We are a Cambridge based consultancy carrying out research and development in new and emerging technologies for security, offering independent, impartial, science-based advice. This will be our fourth year offering CMP placements, and we are looking for keen, innovative, self-motivated individuals who are interested in the practical application of maths to solve real-world problems. You will be working in a small friendly (we like to think) team of scientists and engineers, and contributing directly to the output of current projects. Brief Description of the Project Right now we do not know exactly what the student project will entail as we work in very rapidly evolving field. This years projects are likely to be focused around one or more of developing algorithms and machine learning solutions to analyse complex sensor data, building event-based simulations of security processes (including data collection and analysis from field observations) or helping with tests and trials of technology. Previous students have been exposed to all stages of the data pipeline / data science process. Our work is highly varied and interesting and you will likely get stuck in with all aspects of the job! Keywords Security, machine learning, algorithms, References http://www.iconal.com Prerequisite Skills Statistics, Numerical Analysis, Image processing, Simulation, Data Visualization Other Skills Used in the Project Predictive Modelling, App Building Programming Languages Python, R, C++, Python preferred (as its our main one), but can consider other languages if relevant Work Environment We are a small friendly team of 8 people, all working on a range of interesting diverse projects. The student will be based in our main office (or lab for data gathering) working on one or more projects with us, with a mentor on each project to help with queries, reviewing work and assigning tasks. This is of course subject to change should we still be under lockdown! We had a remote summer student in 2020, who worked virtually with the team.

## Developing an approach for biotherapeutic purity quantitation from analytical instrument signals

 Project Title Developing an approach for biotherapeutic purity quantitation from analytical instrument signals. Contact Name David Hilton Contact Email david.w.hilton@gsk.com Company/Lab/Department GSK, Biopharm Process Research Group Address Gunnels Wood Road, Stevenage, SG1 2NY Period of the Project 8 weeks Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information The Biopharm Process Research group is the first step on the route from newly discovered biotherapeutic drugs to a commercial product which can be administered to a patient. It is the group's responsibility to screen candidate molecules for their developability, process fit and identify a suitable commercial cell line for their production. A key requirement of the group during process development and candidate molecule screening is to characterize the chemical, physical or biological attributes of the molecule to assess its purity. This is a critical attribute, as the purity of a biopharmaceutical product will influence both the efficacy and also safety of the drug. Brief Description of the Project The analytical techniques used to characterize the purity of a biopharmaceutical drug, often output a signal that is a composite of peaks associated with the product of interest and product related purities along with signal noise, baseline deviations and instrument associated drift. A a part of GSK's standard biopharm drug development activities, thousands of these instrument signals are generated within the department each month, and the automated peak identification methods that are currently employed cannot adequately and consistently quantify drug purity. This oftentimes necessitates high levels of time-consuming manual data processing. The aim of this project is to develop an optimal procedure for peak identification and purity determination, using techniques ranging from simple deconvolution to CNN and LSTM machine learning methods, with model performance benchmarked against our large departmental datasets. Should a successful strategy be developed, this could be incorporated into a tool for deployment to our data processing pipelines, thereby enabling more rapid and robust development of GSK's biopharm drug portfolio. Keywords Modelling, Visualization, Signals, Scripting, Pharmaceuticals References Prerequisite Skills Statistics, Predictive Modelling, Data Visualization Other Skills Used in the Project Database Queries Programming Languages Python, R Work Environment The student will be supervised during the project and, though working individually, will be involved in all departmental activities. Support from the Statistical Sciences group and Data Science teams will be available should this be required. Standard office hours will apply and remote working opportunities are available.

## Is Quantum Machine Learning mature for clinical applications?

 Project Title Is Quantum Machine Learning mature for clinical applications? Contact Name Domingo Salazar Contact Email domingo.salazar@astrazeneca.com Company/Lab/Department AstraZeneca Address City House, 130 Hills Road, Cambridge CB2 1RE Period of the Project 8 weeks between late June and September Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information Quantum Computing (QC), in general, and Quantum Machine Learning (QML), in particular, have made considerable progress in the last few years. It is now possible to formulate typical clinical data science projects like uncovering associations of adverse effects with medicines or subgroup identification as QML problems. But what would be the benefits of doing this at this moment in time? Is QML ready to start to be used regularly in Pharma? And if so, for which kind of projects may it provide advantages over classical computing? Brief Description of the Project We would like to formulate an open-ended project made up of two parts: * A literature review * A practical example. The literature review should provide a feeling for the state of the art in this area. In particular, it should point us towards what are the most promising current applications of QML to Pharma. The practical example should be chosen based on the results of the literature review. It will be dimensioned according to the available time and QC resources available. Data sources may include publicly available clinical datasets, text, images and/or genomic sequences depending of the selected application. There are a number of QC providers in the market place at the moment but for this purpose as well as for the literature review, it would be very interesting if we could set up a 3-way collaboration between the Cambridge Math Department, The QC group in Cambridge and AstraZeneca. This relationship could then be continue beyond this student project. Keywords Quantum Computing, Quantum Machine Learning, Pharma, Clinical, AI References * Quantum Machine Learning, Peter Wittek, Elsevier Insights (book) * Amazon Braket (https://aws.amazon.com/blogs/aws/amazon-braket-get-started-with-quantum-...) * Introduction to Quantum Computing with Python (https://pythonspot.com/an-introduction-to-building-quantum-computing-app...) Prerequisite Skills Statistics, Simulation, Machine Learning Other Skills Used in the Project Image processing, Predictive Modelling, Data Visualization Programming Languages Python, R, Some of QC languages like Q#, if the corresponding Python packages proves to be too limited for our purposes. Work Environment We like to integrate our students within our team so they experience what it means to do Data Science in a Pharma company. So the student will be able to talk to a number of data science specialists in our team as well as clinicians, biologist, bio-informaticians, image analysts, etc. as appropriate.

## Aggregating embeddings in deep unsupervised graph learning

 Project Title Aggregating embeddings in deep unsupervised graph learning Contact Name Khan Baykaner Contact Email khan.baykaner@astrazeneca.com Company/Lab/Department Astrazeneca, Deep Learning, AI Engineering, R&D IT Address Cambridge Road, Melbourn, Royston SG8 6EH Period of the Project 8-12 weeks Project Open to Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information The application of AI to digital pathology for drug development is a burgeoning field which promises to radically replace and enhance to the existing analysis workflows that lead to biological insight. One area of interest is the analysis of multiplex immunofluorescence (mIF) imaging for oncology; by using multiplexed tissue staining one can acquire a rich set of data for investigating the tumour microenvironment. However, efficient methods for analysing this rich data are still in their infancy. One method of investigation is to build a graph mapped to the cells within the tissue, and then use unsupervised learning techniques on the graph to capture the structure of the information in embeddings. Brief Description of the Project This project will explore how elements of the unsupervised learning technique (e.g. such as the corruption function in deep graph infomax) affect the downstream performance of the trained embeddings, as well as techniques for aggregating embeddings in a spatially-aware manner. Depending on the area of focus, success would involve alterations to the mIF graph pipeline that allow embeddings to be combined across multiple samples in a consistent, spatially-aware manner without loss of relevant information. This in turn would be expected to dramatically improve the predictive power of downstream patient survival models. Keywords graphs, unsupervised learning, deep learning, AI, pathology References https://arxiv.org/pdf/1809.10341.pdf Prerequisite Skills python, deep learning Other Skills Used in the Project Data Visualization Programming Languages Python Work Environment Will collaborate with a small team of machine learning engineers. Whether work will be remote depends on the situation regarding the pandemic.

## Predicting the pick-up weight of chocolate from real-time factory data

 Project Title Predicting the pick-up weight of chocolate from real-time factory data Contact Name Joe Donaldson Contact Email Joe.Donaldson@unilever.com Company/Lab/Department Unilever R&D Address Colworth Science Park, Sharnbrook, Bedford MK44 1LQ Period of the Project Flexible, minimum 8 weeks Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Friday 26th February 2021 Background Information Chocolate is an expensive ingredient that Unilever uses extensively within its ice cream portfolio in some of its most well-known brands like Magnum. To maintain the business viability of products, reduce waste, and maintain product quality and uniformity, the chocolate dosage, the so-called pick-up weight, needs to be well controlled. This parameter is a function not only of the properties of chocolate variant and batch itself, but also the conditions under which it is processed in the factory. Therefore, an accurate adjustment of these parameter during product assembly is key, and the ability to predict and proactively manage possible deviations would offer significant quality improvements and savings. Brief Description of the Project This project will explore the feasibility of using sensor data to predict chocolate pick-up weight. The aim is to build upon our existing insights and harness the availability of this new data stream to construct a predictive hybrid model linking the data and the science of chocolate behaviour. Our end goal is a real time model suggesting simple adjustments to the operating parameters of the process line so factory operators can ensure the best possible chocolate-coated ice cream products make it into the hands of the consumer at a competitive price. Keywords Ice Cream, Chocolate, Modelling, Machine-Learning, Python References Prerequisite Skills Statistics, Predictive Modelling, Data Visualization Other Skills Used in the Project Statistics, Predictive Modelling, Data Visualization Programming Languages Python, MATLAB, R Work Environment Independent working but with regular support from the wider science and technology team. The student will work remotely and be expected to share progress/results with supervisor(s) in daily/bi-weekly calls.

## Solvers for Integer Quadratic Program ("IQP") problems related to allocating trades

 Project Title Solvers for Integer Quadratic Program ("IQP") problems related to allocating trades Contact Name Pierre Micottis Contact Email cambridge.recruitment@symmetryinvestments.com Company/Lab/Department Symmetry Investments, Quantitative Analytics Address 86 Jermyn Street, Fourth Floor, London SW1Y 6JD Period of the Project 8-12 weeks Project Open to Master's (Part III) students Initial Deadline to register interest Background Information We are looking for an intern to work in the Quantitative Analytics group at Symmetry Investments is a post-startup US \$7billion alternative asset management company with around 220 people across multiple time zones and locations. Brief Description of the Project The main objective consists in determining how the total quantity of a partially or fully-executed order should be allocated to a number of accounts of funds. It is typically preferable to do a single trade in the market and then allocate it, subject to a series of constraints. This type of problem has to be solved in such a way that each allocated trade "stand on its own", meaning that it could have been executed as such and satisfy constraints like minimum tradeable size, minimum position size, strategy-level implied ratios as close as possible to target ratios and so forth. So the solutions are typically expressed as a list of integers which minimize some objective function under constraint. Optimisations have to be done both with respect to the quantities allocated but also the Volume Weighted Average Price (or "VWAP"). Keywords integer quadratic programming, algorithms, trade allocation References Prerequisite Skills Other Skills Used in the Project Programming, solvers, algorithms Programming Languages No Preference Work Environment The student will work in a team. There will be opportunities to talk about the project across several other teams.

## Neural Network Model Calibration

 Project Title Neural Network Model Calibration Contact Name Nicolas Leprovost Contact Email nicolas.leprovost@bp.com Company/Lab/Department BP, Quantitative Analytics Address 20 Canada Square, London E14 5NJ Period of the Project 2 to 6 months starting in summer 2021 Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Wednesday 31st March 2021 Background Information To assist BP's ambition in the renewable space, it is essential to be able to model the joint evolution of power prices and renewables production. For example, modelling jointly the wind output (or equivalently the wind speed) and the electricity price is necessary to assess the cost of developing a wind farm project. This problem can be addressed by using Monte-Carlo simulations. In order to properly represent the dynamics of the underlying one needs to have a robust calibration mechanism that mimics its statistical properties. Brief Description of the Project Recent development in Machine Learning showed that Deep Learning methods could be applied efficiently to calibrate an option pricing model. During that internship, we will focus on two approaches, namely the historical calibration [1] where model parameters are estimated from historical market data and the volatility surface calibration [2] where parameters are obtained by inverting the market implied volatility surface. Those two problems will involve latest development in machine learning area such as the use of the signature function [3] or the Swish activation function [4]. Keywords financial engineering, machine learning References [1] Stone H. Calibrating rough volatility models: a convolutional neural network approach. Quantitative Finance, 20(3):379–392, 2020 [2] Bayer C, Horvath B, Muguruza A, Stemper B, Tomas M. On deep calibration of (rough) stochastic volatility models. arXiv preprint arXiv:1908.08806, 2019. [3] Chevyrev I, Kormilitzin A. A primer on the signature method in machine learning. arXiv preprint arXiv:1603.03788, 2016 [4] Ramachandran P, Zoph B, Le QV. Swish: A self-gated activation function. arXiv 2017. arXiv preprint arXiv:1710.05941. Prerequisite Skills Statistics, Probability/Markov Chains Other Skills Used in the Project Simulation Programming Languages Python Work Environment Depending on regulations at the time, we hope you will be able to work in the office. You will be assigned a project supervisor and will take part in weekly team meetings.

## Segmenting duodenal biopsy images

 Project Title Segmenting duodenal biopsy images Contact Name Julian Gilbey Contact Email jdg18@cam.ac.uk Company/Lab/Department Lyzeum Ltd. / DAMTP Address jdg18@cam.ac.uk Period of the Project 8 weeks between late June and September Project Open to Undergraduates, Master's (Part III) students Initial Deadline to register interest Monday 29th March 2021 Background Information Coeliac disease is an autoimmune condition triggered by exposure to gluten (in wheat and other grains), and it can cause significant long-term harm if left untreated. Treatment is a lifelong gluten-free diet. This condition is estimated to affect about 1% of the UK population, but is very under-diagnosed; probably only 1 in 5 or 1 in 6 sufferers is aware that they have it. The gold standard for diagnosis is to perform a biopsy and to look for signs of the disease process on the tissue. This requires highly-trained pathologists to look at each biopsy and to assess it for disease. There is a shortage of pathologists in the UK, and there is often disagreement between pathologists on the diagnosis of individual tissue samples. The long-term aim of our work is to develop a method for obtaining a diagnosis from a tissue sample in an automated fashion, either to guide pathologists in their work or to save the need for a pathologist to look at every sample. Brief Description of the Project One of the challenging parts of this work is dealing with very large and varied microscope images and identifying the different small-scale and large-scale structures present. Some techniques have already been developed for this, but they are usually effective for only one scale. In our case, we need to use some large-scale information to inform the small-scale identification, and possibly vice-versa. The purpose of this summer project is to explore some of the existing state-of-the-art techniques and to see how they can be combined, adapted and/or developed for our needs. A successful outcome would be a tool for performing this identification. (Note that in the literature, this process is called "segmentation".) Keywords Deep learning, neural networks, image analysis, digital pathology, coeliac disease References - An introductory seminar on this work is available at: https://cambridgelectures.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=... and a related seminar on the biology of coeliac disease and a bioinformatics approach is here: https://cambridgelectures.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=... (These can be accessed from the cam.ac.uk domain or using a Raven account.) - For learning PyTorch, the fastai course (https://github.com/fastai/fastbook) is very helpful. - There are also many papers available on digital histopathology that are potentially relevant, and the Coeliac UK website gives more information about the condition. Prerequisite Skills Image processing, neural networks and deep learning; any other mathematical skills are also potentially useful. Other Skills Used in the Project Programming Languages Python, We are using PyTorch in our work; this can be learnt during the course of the project. Work Environment We are currently a small team (of 2 plus an MPhil student!) all working from home, and meet very regularly over Discord or Zoom. If the COVID-19 situation allows it, we might be able to meet in person in Cambridge or London on occasion as well, but there are no specified working hours or location for working. Note that you must have the right to work in the UK to be eligible for this project; you do not have to be currently based in the UK, though.