2021 CMP Academic Projects

List of all projects with keywords (click link for full listing)

Models and algorithms for DNA evolution and sequencing
Keywords: Molecular evolution, phylogenetics, sequencing, Markov chains, probability
Mathematical modelling of cancer immunotherapy
Keywords: Mathematical modelling, cancer immunotherapy, systems biology, Ordinary differential equation, numerical simulation
Optimisation algorithms, statistical models, and probabilistic methods on manifolds
Keywords: Differential geometry, optimisation, hidden Markov models, geometric learning, geometric statistics
Certifying real root isolation
Keywords: interactive theorem proving, computer algebra, root isolation, Isabelle, symbolic computing
Tomb mathematics: using probability and sets to count bodies in prehistoric burials
Keywords: archaeology, burials, megalithic tombs, sets, probability
3D Mean-Field Theory for Cytoskeleton Organisation in Plant Cells
Keywords: bology modelling coding plants cytoskeleton
Modelling virus evolution in age-structured populations
Keywords: COVID-19, evolution, dynamic models, epidemics
Deepmath
Keywords: deep learning, machine learning, interactive theorem proving, Isabelle\HOL, certified mathematics
Developing a method to identify highly similar short substrings of DNA sequence from a pool of longer, variably related DNA sequences
Keywords: boinformatics, DNA sequence, alignment, analytical pipeline
Optimal transport for light microscopy
Keywords: optimal transport, image processing, light microscopy

Models and algorithms for DNA evolution and sequencing

Project Title	Models and algorithms for DNA evolution and sequencing
Contact Name	Nicola De Maio
Contact Email	demaio@ebi.ac.uk
Company/Lab/Department	European Bioinformatics Institute (EMBL-EBI), Goldman group
Address	Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD
Period of the Project	About 10 weeks, but flexible
Project Open to	Undergraduates, Master's (Part III) students
Initial Deadline to register interest
Background Information	DNA is key to understanding life and evolution. In our group we use probabilistic models, often Markov Chains, to study DNA and its evolution. These models are useful for many applications, such as reconstructing the evolutionary histories of species or pathogens, and understanding the inner workings of life. We also work to improve how DNA information is collected, specifically, developing models and algorithms to improve the efficiency of, e.g., Oxford Nanopore Technology ("Nanopore") sequencing.
Brief Description of the Project	We can offer several projects, depending on the students' interests and skills. Other projects also may become available or may be discussed. 1) Pandemic Phylogenetics. Genomics has revolutionized epidemiology. DNA data from SARS-CoV-2 is used, for example, to track transmission of the virus, to develop vaccines, to understand the spread of the disease across human populations and between different animal hosts, and to understand how the virus is evolving and possibly adapting to humans. However, the vast genomic data availability for SARS-CoV-2 and future epidemics poses important challenges for current algorithms and mathematical models used in data analysis. For this reason, we want to scale up methods in phylogenetics, molecular evolution and genomic epidemiology to SARS-CoV-2 data and to future pandemics (see e.g. https://doi.org/10.1101/2020.09.26.314971). This project will require basic understanding of probabilistic models. Basic programming skills will also be important. 2) Detecting natural selection from DNA data using neural networks. Artificial intelligence is revolutionizing many aspects of bioinformatics, but so far applications to phylogenetics and molecular evolution have been of limited success. We are developing a new neural network approach (OmegaAI) to detect natural selection acting on genes which so far shows improved performance over traditional approaches based on maximum likelihood statistical inference. We want to extend the applications of our methods to include other evolutionary processes. To work on this project, you will need some experience programming neural networks and interest in working in molecular evolution. 3) Improving the efficiency of Nanopore sequencing. We recently developed an approach to improve the efficiency with which DNA data is gathered in Nanopore sequencing and we are currently further experimenting with this technology (https://www.biorxiv.org/content/10.1101/2020.02.07.938670v2). We would like to develop alternative versions of our model of DNA sequencing in Nanopore that would improve its long-term performance. Part of this project would require interest in mathematical modeling, and some basic knowledge of probability theory; some very basic familiarity with concepts from information theory are a plus. In the second part of the project (depending on internship length), familiarity with Python programming would be a great advantage.
Keywords	Molecular evolution, phylogenetics, sequencing, Markov chains, probability
References	https://doi.org/10.1101/2020.09.26.314971 https://www.biorxiv.org/content/10.1101/2020.02.07.938670v2
Prerequisite Skills	Probability/Markov Chains, Most projects require familiarity with a programming language, often Python.
Other Skills Used in the Project	Statistics, Simulation, Predictive Modelling, Data Visualization
Programming Languages	Python, C++. Some projects may allow broader choice of programming language.
Work Environment	So far we expect remote working, but the situation might change. We expect normal working hours (about 40 per week) but we are flexible. The project well be in collaboration with the group leader (Nick Goldman) a senior postdoc in the group (Nicola De Maio), and, depending on the project, a PhD student and/or other interns in the group. Daily discussions are possible with the members of the group directly involved in the projects, and weekly discussions with the whole group.

Mathematical modelling of cancer immunotherapy

Project Title	Mathematical modelling of cancer immunotherapy
Contact Name	Rahuman Sheriff
Contact Email	sheriff@ebi.ac.uk
Company/Lab/Department	BioModels, European Bioinformatics Institute (EMBL-EBI)
Address	Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD
Period of the Project	8 weeks - 6 months
Project Open to	Undergraduates, Master's (Part III) students
Initial Deadline to register interest	Friday 26th February 2021
Background Information	As part of its normal function, the immune system detects and destroys abnormal cells and most likely prevents or curbs the growth of many cancers. Harnessing the immune system to treat chronic infectious diseases or cancer is a major goal of immunotherapy. Immunosuppression is the key mechanism through which cancer escapes immune targeting and involves a complex interplay between cancer and immune cells. Cancer immunotherapy involves targeting or use of immune system components (including antibodies, cancer vaccines, T cells, etc.,) to kill tumour cells and supress cancer progression. The complex interaction between immune cells and cancer cells are mathematically modelled to study systems behaviour and develop new immune therapy strategies for cancer. There are several models proposed in recent times and new models are coming frequently (Arulraj and Barik, PLOSone 2018 ; Ruby Kim, SPORA 2018 ; Kaitao Li, JBC 2017). In this internship, you will encode the mathematical models (primarily Ordinary Differential Equation models) published in research articles, perform simulations to reproduce results and share them publicly through BioModels repository (Malik-Sheriff et al. 2020, Nucleic Acid Research). Your code will be potentially reused by scientists across the world. You will also have an opportunity to contribute and author the scientific manuscripts we write on this topic. Almost all of our interns in BioModels have been part of our scientific manuscripts (Malik-Sheriff et al. 2020, Nucleic Acid Research, Tiwari et al. BioRxiv). This internship will provide you a strong systems biology modelling background and skills. Couple of our interns who worked on this project earlier have secured PhD position in Harvard and Oxford university, the former was a Cambridge undergraduate.
Brief Description of the Project	BioModels, (https://www.ebi.ac.uk/biomodels/) is one of the world leading reposition of mathematical models of biological processes hosted at EMBL-EBI. In the proposed curation internship, you will work with the curation team at BioModels and will learn to curate and annotate mathematical models from published scientific literature and deposit them to BioModels. You will further learn to use simulation software such as COPASI to reproduce the simulation results from the reference publication, and learn to encode models in standard modelling formats such as Systems Biology Markup Language (SBML). Specifically, you will work on targeted curation of literature-based models in immuno-oncology. We will focus on Ordinary differential equation models. You will gain strong exposure to cancer immunology and cancer biology (in general) by participating in the targeted curation of these models. You will encode and perform numerical simulation of ODE models. You will learn to visualise simulation results. In the past, BioModels targeted curation of literature-based models on diabetes and neurodegenerative diseases has resulted in scientific publications; Lloret-Villas et al. 2017 and Ajmera et al. 2013, respectively. The cancer immunotherapy models you will curate under the supervision of the curation team will be published in the BioModels repository, disseminated to a broader scientific community and furthermore can be potentially summarized into a scientific publication. Thus, in this internship, in addition to professional growth, you will also contribute to the growth of the open access BioModels repository, which has around 20,000 distinct users per month.
Keywords	Mathematical modelling, cancer immunotherapy, systems biology, Ordinary differential equation, numerical simulation
References	[1] www.ebi.ac.uk/biomodels [2] https://academic.oup.com/nar/article/48/D1/D407/5614569 [3] https://www.biorxiv.org/content/10.1101/2020.08.07.239855v1 [4] https://www.cancer.gov/about-cancer/treatment/types/immunotherapy
Prerequisite Skills	Simulation, Predictive Modelling, Ordinary Differential Equations
Other Skills Used in the Project	Mathematical Analysis, A levels Biology knowledge essential. Undergraduate level Biology knowledge will be an advantage.
Programming Languages	No Preference, Knowledge in any of the above programming languages desirable
Work Environment	Although we work from 9:00 to 17:00, flexible working hours can be arranged. You will be part of BioModels team including curators and software engineers. You can interact and learn from other team members and interns. You will have an opportunity to engage in broader activities at EMBL-EBI. Due to current COVID-19 situation, our campus is currently closed and it is not clear how the situation will evolve, hence you might have work remotely. We current have an intern working remotely from a foreign country. So, despite working remotely you will have full support and supervision needed.

Optimisation algorithms, statistical models, and probabilistic methods on manifolds

Project Title	Optimisation algorithms, statistical models, and probabilistic methods on manifolds
Contact Name	Dr Cyrus Mostajeran
Contact Email	csm54@cam.ac.uk
Company/Lab/Department	Engineering Department
Address	Trumpington Street, Cambridge CB2 1PZ
Period of the Project	Any 8-week period from late June to early September.
Project Open to	Master's (Part III) students
Initial Deadline to register interest	Friday 26th February 2021
Background Information	Many scientific fields study data with an underlying structure that is non-Euclidean. Research interest in optimisation on manifolds and related geometric methods have gained considerable momentum in recent years, with successful applications to areas such as computer vision, robotics, medical imaging, and machine learning. Similarly, statistics and probabilistic methods on non-Euclidean spaces are of interest in numerous applications in information engineering where data is intrinsically manifold-valued. While many statistical concepts and techniques such as principal component analysis (PCA) and regression have been extended and widely applied to non-Euclidean data, more sophisticated techniques such as Markov Chain Monte Carlo (MCMC) and deep neural network models are in the relatively early days of their development within a non-Euclidean setting. The aim of the CMP will be to contribute to such efforts within the framework of one of a number of available projects on statistical models and probabilistic methods on manifolds or closely related optimisation algorithms.
Brief Description of the Project	A number of projects are available depending on the particular interests and strengths of the candidate. Many of the projects will focus on the space of symmetric positive definite (SPD) matrices, which is often viewed as a Riemannian symmetric space of nonpositive curvature. This particular manifold is of enormous practical interest due to the prevalence of covariance matrices as states or data points in numerous applications. A non-exhaustive list of potential projects is given below. 1. Conic geometric optimisation with extremal eigenvalues: Computing standard geometric objects such as distances, geodesics, matrix exponentials and logarithms in the standard Riemannian geometry on SPD matrices often amounts to the computation of the generalised eigenvalues of a pair of SPD matrices, which typically means a significant increase in computational complexity, particularly for larger matrices. An alternative and scalable approach to geometric computation on the cone of SPD matrices is based on the Thompson geometry of the space, which amounts to working with extremal eigenvalues. There are a number of high-impact open problems on the development of a geometric statistical framework rooted in extremal eigenvalue computations, which could form a self-contained project. 2. Proximal algorithms on manifolds: Proximal methods are standard tools for solving optimisation problems and are particularly effective for non-smooth, constrained, large-scale, or distributed problems involving large and high-dimensional datasets that have become a subject of widespread interest in recent years due to applications in machine learning and related fields. While there has been some research on developing proximal algorithms on manifolds, the current state of the theory is nowhere near as successful as it is in the traditional Euclidean setting. The development of an effective theory of proximal algorithms on manifolds is another possible CMP project. 3. Hidden Markov models and fields: A hidden Markov model is a statistical Markov model in which the system under consideration is assumed to be a Markov process with unobservable hidden states that drive another observable process. The objective is to learn about the hidden states from the available observations. Hidden Markov chains and fields play a major role in signal and image processing and are central to several approaches to image restoration and segmentation. While the literature on hidden Markov models with observations in Euclidean spaces is extensive, the corresponding theory with observations in non-Euclidean spaces is not well developed. The development of such a framework is important due to the role that data in nonlinear spaces often play in the study of complex signals and images. While there has been some recent promising early work on the theory of hidden Markov chains in Riemannian manifolds, the framework has yet to fully mature or be applied to real datasets, which is work that can be pursued in the context of estimating and tracking of hidden brain states in Brain-Computer interfaces (BCI) using SPD-valued observations generated from EEG data. The successful candidate would be expected to have (A) a very strong background in differential geometry and/or probability theory, OR (B) strong programming skills, an interest in differential geometry with a good background in differential geometry and/or probability theory and/or numerical analysis. Significant contributions to any of these projects may be expected to result in a scholarly paper in a leading journal in applied mathematics or information theory and participation in a leading international conference within a year or so.
Keywords	Differential geometry, optimisation, hidden Markov models, geometric learning, geometric statistics
References	1. Optimisation on manifolds: https://press.princeton.edu/absil 2. Optimisation on manifolds: https://www.jmlr.org/papers/volume15/boumal14a/boumal14a.pdf 3. Optimisation on manifolds: https://www.manopt.org 4. Geometric Statistics: https://arxiv.org/pdf/2004.04667.pdf 5. Geometric Statistics: https://github.com/geomstats/geomstats 6. Geometric Deep Learning: https://arxiv.org/pdf/1611.08097.pdf 7. Geometric Deep Learning: https://www.quantamagazine.org/an-idea-from-physics-helps-ai-see-in-high... 8. Geometric Deep Learning: https://www.quantamagazine.org/new-machine-learning-system-decodes-how-p... 9. Geometric Science of Information, GSI 2021: https://www.see.asso.fr/en/GSI2021 10. SPD manifolds (sample paper): https://arxiv.org/pdf/1605.06182.pdf
Prerequisite Skills	Geometry/Topology, Simulation
Other Skills Used in the Project	Statistics, Probability/Markov Chains, Numerical Analysis, Geometry/Topology, Simulation, Data Visualization
Programming Languages	Python, MATLAB, R, C++, Mathematica
Work Environment	The student will work closely with Dr Cyrus Mostajeran and at least another academic with expertise on the topic. Professor Rodolphe Sepulchre will also provide some guidance and will keep an eye on the project from time to time. If more than one candidate is accepted, the CMPs will be coordinated closely. The student will be introduced to members of the Control Group and provided access to seminars within the Information Engineering Division of the Department of Engineering. The specifics of the work environment will be dictated by healthcare measures in place at the time. Remote work and flexible working hours are acceptable options.

Certifying real root isolation

Project Title	Certifying real root isolation
Contact Name	Wenda Li
Contact Email	wl302@cam.ac.uk
Company/Lab/Department	Department of Computer Science and Technology
Address	15 JJ Thomson Avenue, Cambridge CB3 0FD
Period of the Project	8 weeks during the summer of 2021
Project Open to	Undergraduates, Master's (Part III) students
Initial Deadline to register interest	Friday 26th February 2021
Background Information	Modern proof assistants allow us to carry out symbolic computation in a certified way -- we can utilise the expressive logic of the proof assistant to encode the properties of the results and eventually have a (highly trust-worthy) mechanised proof of their correctness. Real root isolation is a classic and ubiquitous problem in computer algebra and a certified implementation of it will hugely facilitate certifying other procedures in the field. We are gradually building a computer algebra platform that computes with proofs.
Brief Description of the Project	This a project to build an efficient real root isolation procedure based on Descartesâ€™ rule of signs, extending a previous formalisation in Isabelle/HOL [1]. The ultimate goal might be to formalise the state-of-the-art version [2], but that's probably too much; it's possible to omit advanced features like partial Taylor shift and approximate arithmetic. The starting point of this project is rather moderate: to certify the termination of the procedure. This requires the theorem of three circles, which is already available in another proof assistant, Coq [3].
Keywords	interactive theorem proving, computer algebra, root isolation, Isabelle, symbolic computing
References	[1] https://www.cl.cam.ac.uk/~wl302/publications/cpp19.pdf [2] https://arxiv.org/pdf/1605.00410.pdf [3] https://arxiv.org/pdf/1306.0783.pdf
Prerequisite Skills	Algebra/Number Theory
Other Skills Used in the Project
Programming Languages	The applicant will need to write formal proofs in Isabelle/HOL. Previous experience with proof assistant like Coq, Lean and Isabelle is a plus but not mandatory.
Work Environment	The student will be mainly with me, but can also talk to other members of the group about writing formal proofs in Isabelle. Due to the pandemic, the student is likely to work remotely.

Tomb mathematics: using probability and sets to count bodies in prehistoric burials

Project Title	Tomb mathematics: using probability and sets to count bodies in prehistoric burials
Contact Name	John Robb
Contact Email	jer39@cam.ac.uk
Company/Lab/Department	Department of Archaeology, Cambridge
Address	Downing Street, Cambridge CB2 3DY
Period of the Project	July 1 - August 31
Project Open to	Master's (Part III) students
Initial Deadline to register interest	Friday 26th February 2021
Background Information	This project arises from a practical, surprisingly challenging problem in archaeology: estimating how many people were originally buried in a prehistoric tomb. There are archaeological methods for answering it, none very good; it is worth exploring whether applying mathematics may yield a much better solution.
Brief Description of the Project	Suppose there are have things that come in repetitive sets of n objects, perhaps of different types (bones in a skeleton, for instance; or footprints made by an animal crossing a water hole, or radio blips from a space object). Imagine that somebody takes p sets, each of which is incomplete and has lost some of its objects so that it now contains between 1 and n objects. Now they mix them all together, and give the whole assemblage to you. You know the aggregate characteristics of the total assemblage (for instance, how many of each type of object there are). But you no longer know which set a given object came from or how many original sets there were (p). Is there a way to estimate the most likely value of p, e.g. the number of sets which were mixed together to create the assemblage (e.g. the number of people originally buried in the tomb, the number of animals crossing the water hole, the number of space objects causing your radio blips)? Or, better still, come up with a probability distribution for potential values of p? This is a real problem archaeologists often face in analysing skeletal assemblages from mixed deposits such as collective tombs. I suspect that there are mathematical ways to solve it, using probability theory, set theory, some combination of these or something else. One challenge is figuring out the mathematics. A second challenge is integrating real archaeological parameters into the solution; this is likely to be where a fair bit of discussion with archaeologists will be needed. Our research project (based in the Department of Archaeology, and focusing upon analysing prehistoric burials) would like a mathematical colleague to research an appropriate solution. Outputs would include (i) educating us about it, (ii) collaboratively writing a research paper presenting the solution for archaeological publication, and (iii) making up some low-tech programme, perhaps a spreadsheet or simple calculator, which archaeologists could enter the parameters of a particular assemblage and calculate the number of individuals it is most likely to contain.
Keywords	archaeology, burials, megalithic tombs, sets, probability
References
Prerequisite Skills
Other Skills Used in the Project
Programming Languages	No Preference
Work Environment	The overarching project consists of a team of 7 archaeologists, geneticists and chemists in the UK, Italy and Estonia. The student will work with the Cambridge people (John Robb (PI) and Jess Thompson (postdoc)) and is welcome to participate in the general project meetings. They will work remotely though, Covid safety permitting, we may be able to have in-person meetings.

3D Mean-Field Theory for Cytoskeleton Organisation in Plant Cells

Project Title	3D Mean-Field Theory for Cytoskeleton Organisation in Plant Cells
Contact Name	Tamsin A. Spelman
Contact Email	tas46@cam.ac.uk
Company/Lab/Department	Sainsbury Laboratory
Address	47 Bateman Street, Cambridge CB2 1LR
Period of the Project	8 weeks
Project Open to	Master's (Part III) students
Initial Deadline to register interest
Background Information	The cytoskeleton is a dynamic fibre network internal to the cell, consisting of microtubules and actin in plants, and essential for many cell functions. This project aim is to use a continuum theory to analyse microtubule organisation in 3D and compare the results to output from our detailed computational simulations (which models all fibres individually).
Brief Description of the Project	The main analytic tool for studying microtubule organisation is Mean-Field Theory (MFT). Rather than modelling each microtubule fibre separately, MFT models the densities of growing/shrinking microtubules, of each length and direction (angle relative to reference direction) at each time, using differential equations. While there are many results using MFT in 2D, only a small amount of work considers MFT in 3D. This project will focus on using MFT to analyse microtubule organisation in 3D. This project can include: using/extending the existing 3D MFT framework to look at cytoskeleton organisation in a plant root hair cell and other cell geometries; and/or considering more mathematically fundamental questions about MFT such as analysing the bifurcation point for the onset of cytoskeleton organisation, identified in 2D but not derived in 3D. These results can be compared to output from our existing computational model (and experimental data if available). This project can be geared more towards mathematical analysis or computational modelling dependent on student interest.
Keywords	biology modelling coding plants cytoskeleton
References	Introduction is a nice summary of MFT and chapter 3 would be our starting point for this project [1] Panagiotis Foteinopoulos. Models of spatial organization of microtubules and cell polarization. PhD Thesis One of the earliest papers on MFT [2] M. Dogterom and S. Leibler (1993) Physical Aspects of the Growth and Regulation of Microtubule Structures, Phys. Rev. Lett., 70 (9) Example application of our computational cytoskeleton tool [3] P. Durand-Smet et. al. (2020) Cytoskeleton organization in isolated plant cells under geometry control. Proc. Natl. Acad. Sci. 202003184 Gitlab page for our computational model Tubulaton [4] https://gitlab.com/slcu/teamHJ/tubulaton
Prerequisite Skills
Other Skills Used in the Project
Programming Languages	Python, MATLAB, C++
Work Environment	The expectation is that you would need to work remotely due to covid restrictions. You will be part of the Jonsson group consisting of 2 phd students and multiple postdocs, and are welcome to all group activities such as our group meetings and morning coffee meetings, as well as wider lab activities. Supervision will be at least weekly but more regularly to begin with, and I will be available for further ad hoc discussions beyond that as needed.

Modelling virus evolution in age-structured populations

Project Title	Modelling virus evolution in age-structured populations
Contact Name	Olivier Restif
Contact Email	or226@cam.ac.uk
Company/Lab/Department	Department of Veterinary Medicine
Address	Madingley Road, Cambridge CB3 0ES
Period of the Project	8 weeks between July and September
Project Open to	Undergraduates, Master's (Part III) students
Initial Deadline to register interest	Friday 26th February 2021
Background Information	Like many diseases, COVID-19 varies in its infectivity and severity among age groups, however it is not clear how this may shape viral evolution. Traditional evolutionary models have shown that a trade-off between infectivity and lethality can lead to the selection of strains with intermediate levels of virulence (similar to Nash equilibrium in game theory). This prediction may not hold when e.g. transmission is greater in younger age groups who experience milder symptoms.
Brief Description of the Project	The project will adapt existing epidemic models of COVID-19 in age-structured populations to include competing strains with different levels of infectivity and lethality, and predict their relative fitness in different epidemiological contexts. The project will combine analysis of ODE systems, game theory and numerical simulations.
Keywords	COVID-19, evolution, dynamic models, epidemics
References	Sébastien Lion, Johan A.J. Metz. 2018. Beyond R0 Maximisation: On Pathogen Evolution and Environmental Dimensions. Trends in Ecology & Evolution, 33:458-473. https://doi.org/10.1016/j.tree.2018.02.004.
Prerequisite Skills	Numerical Analysis, Mathematical Analysis, Simulation, Mathematical biology
Other Skills Used in the Project
Programming Languages	No Preference
Work Environment	The student will be embedded in the Disease Dynamics Unit, a multi-disciplinary group working on infectious diseases, and will take part in regular meetings. If restrictions on in-person meetings remain in place over the summer, the project can be conducted and supervised remotely.

Deepmath

Project Title	Deepmath
Contact Name	Dr. Anthony Bordg
Contact Email	apdb3@cam.ac.uk
Company/Lab/Department	Department of Computer Science and Technology
Address	15 JJ Thomson Avenue, Cambridge CB3 0FD
Period of the Project	about 8 weeks (flexible start and end dates)
Project Open to	Undergraduates, Master's (Part III) students
Initial Deadline to register interest
Background Information	Do you want to explore the current frontier of AI techniques? Deep learning for theorem proving is the future. Don't miss out!
Brief Description of the Project	A first goal would be to create a data set that could be used as a benchmark in a learning environment for the Isabelle\HOL theorem prover.
Keywords	deep learning, machine learning, interactive theorem proving, Isabelle\HOL, certified mathematics
References	HOList: An Environment for Machine Learning of Higher-Order Theorem Proving, Bansal et al.
Prerequisite Skills
Other Skills Used in the Project
Programming Languages	Python, C++, No Preference, Standard ML, Scala, Ocaml
Work Environment	The student will interact with the ALEXANDRIA team and its collaborators

Developing a method to identify highly similar short substrings of DNA sequence from a pool of longer, variably related DNA sequences

Project Title	Developing a method to identify highly similar short substrings of DNA sequence from a pool of longer, variably related DNA sequences
Contact Name	Dr Elizabeth Soilleux
Contact Email	ejs17@cam.ac.uk
Company/Lab/Department	Department of Pathology, University of Cambridge
Address	Division of Cellular and Molecular Pathology, Department of Pathology, Cambridge Biomedical Campus
Period of the Project	8 weeks
Project Open to	Undergraduates, Master's (Part III) students
Initial Deadline to register interest	Friday 26th February 2021
Background Information	All cells in the body normally contain identical genomic DNA sequences. B and T-cells are types of lymphocytes and are unique among cells in each having short unique sequences within their genome, produced by rearranging defined short blocks of sequence in a combinatorial fashion, with an added stochastic element of changing the DNA sequences further at the junction points between these blocks of sequence. This provides an extraordinary opportunity in biology to use this unique sequence (essentially a "bar code") to track individual B and T-cells. In this project, we are interested in using such unique sequences to determine whether a sample contains cancerous lymphocytes, known as lymphoma. Because lymphomas develop from one original cell, they all contain the same unique DNA sequences, rather than all being different.
Brief Description of the Project	We have sequence data from lymphoma samples, in which there will also have been varying percentages (c.10-90%) of benign lymphocytes. We aim to define the unique bar code region for each individual sequence, but as there are many thousands of sequences, this needs to be done in an automated manner. Because of the novel sequencing method we have employed, the DNA sequences are of different lengths, making aligning them to reference sequences more complicated than with standard methods used in our laboratory. These sequences also contain parts of the blocks of sequence that were rearranged to make the unique sequences mentioned above, but we need to ignore most of the building block sequence and focus just on the unique bar code sequence. Ultimately, we are seek to make an automated, high throughput pipeline that can tell us the percentage of sequences containing identical "bar codes" in any given sample, in order to determine whether or not it contains lymphoma.
Keywords	Bioinformatics, DNA sequence, alignment, analytical pipeline
References	https://www.repository.cam.ac.uk/handle/1810/309837 10.1016/j.xcrm.2021.100192 10.1002/path.5592 https://jcp.bmj.com/content/71/3/195
Prerequisite Skills	Statistics, Mathematical Analysis, Ideally computational skills in R or Python or a willingness to learn. No particular background is necessary. A creative approach and enthusiasm to get to grips with DNA sequences (not very challenging) and to test out different methods would be most valuable to us.
Other Skills Used in the Project	Statistics, Numerical Analysis, Mathematical Analysis, Simulation, Database Queries, Data Visualization, App Building
Programming Languages	Python, R
Work Environment	We are a multidisciplinary research group ranging from "wet lab" to bioinformatics, with individuals with a maths, engineering, bioengineering, biology and medical background and can provide plenty of support for this project, which can be undertaken remotely in any time zone. We meet weekly by Teams and have ad hoc meetings in between. We have successfully hosted two mathematicians for summer projects and Computational Biology MPhil projects in the past, both achieving a publication from the work.

Optimal transport for light microscopy

Project Title	Optimal transport for light microscopy
Contact Name	Leila Muresan, Yury Korolev, Jerome Boulanger
Contact Email	lam94@cam.ac.uk
Company/Lab/Department	Cambridge Advanced Imaging Centre / PDN, DAMTP and MRC-LMB
Address	PDN, Downing Site, Cambridge CB2 3DY
Period of the Project	8 weeks
Project Open to	Undergraduates, Master's (Part III) students
Initial Deadline to register interest	Friday 26th February 2021
Background Information	Biological tissue can undergo significant morphological changes over time. Changes occur over multiple scales such as establishment of the whole body pattern during development. At another scale, fluxes of labelled proteins involved in processes such as membrane trafficking between organelles provide useful insights into underlying biological mechanisms.
Brief Description of the Project	A promising approach to modelling fluxes in biological processes is using optimal transport [1]. Optimal transport studies optimal couplings of probability measures that allow to redistribute the mass from one measure to another one in a way that minimises a given cost. Optimal transport has found many applications in imaging and data science in the past decade. The aim of this project is to explore the potential of optimal transport approaches to videomicroscopy data. The project will benefit from the support of groups in the MRC Laboratory of Molecular Biology, DAMTP and the Cambridge Advanced Imaging Centre.
Keywords	optimal transport, image processing, light microscopy
References	[1] G. Peyre, M. Cuturi. Computational Optimal Transport. Foundations and Trends in Machine Learning: Vol. 11: No. 5-6, pp 355-607. Available in open access: https://optimaltransport.github.io/book/
Prerequisite Skills
Other Skills Used in the Project
Programming Languages	No Preference
Work Environment	The student will be part of a multi-disciplinary team (Jerome Boulanger, Nick Barry LMB, Yury Korolev, Carola Schoenlieb, DAMTP, Leila Muresan, CAIC). Given the Covid situation, remote working is possible.

List of all projects with keywords (click link for full listing)

Models and algorithms for DNA evolution and sequencing

Mathematical modelling of cancer immunotherapy

Optimisation algorithms, statistical models, and probabilistic methods on manifolds

Certifying real root isolation

Tomb mathematics: using probability and sets to count bodies in prehistoric burials

3D Mean-Field Theory for Cytoskeleton Organisation in Plant Cells

Modelling virus evolution in age-structured populations

Deepmath

Developing a method to identify highly similar short substrings of DNA sequence from a pool of longer, variably related DNA sequences

Optimal transport for light microscopy

Forthcoming Seminars

News, Announcements and Events

Social media

Study at Cambridge

About the University

Research at Cambridge