## List of all projects with keywords (click link for full listing)

- Models and algorithms for DNA evolution and sequencing

Keywords: Molecular evolution, phylogenetics, sequencing, Markov chains, probability

- Mathematical modelling of cancer immunotherapy

Keywords: Mathematical modelling, cancer immunotherapy, systems biology, Ordinary differential equation, numerical simulation

- Optimisation algorithms, statistical models, and probabilistic methods on manifolds

Keywords: Differential geometry, optimisation, hidden Markov models, geometric learning, geometric statistics

- Certifying real root isolation

Keywords: interactive theorem proving, computer algebra, root isolation, Isabelle, symbolic computing

- Tomb mathematics: using probability and sets to count bodies in prehistoric burials

Keywords: archaeology, burials, megalithic tombs, sets, probability

- 3D Mean-Field Theory for Cytoskeleton Organisation in Plant Cells

Keywords: bology modelling coding plants cytoskeleton

- Modelling virus evolution in age-structured populations

Keywords: COVID-19, evolution, dynamic models, epidemics

- Deepmath

Keywords: deep learning, machine learning, interactive theorem proving, Isabelle\HOL, certified mathematics

- Developing a method to identify highly similar short substrings of DNA sequence from a pool of longer, variably related DNA sequences

Keywords: boinformatics, DNA sequence, alignment, analytical pipeline

- Optimal transport for light microscopy

Keywords: optimal transport, image processing, light microscopy

## Models and algorithms for DNA evolution and sequencing

Project Title | Models and algorithms for DNA evolution and sequencing |

Contact Name | Nicola De Maio |

Contact Email | demaio@ebi.ac.uk |

Company/Lab/Department | European Bioinformatics Institute (EMBL-EBI), Goldman group |

Address | Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD |

Period of the Project | About 10 weeks, but flexible |

Project Open to | Undergraduates, Master's (Part III) students |

Initial Deadline to register interest | |

Background Information | DNA is key to understanding life and evolution. In our group we use probabilistic models, often Markov Chains, to study DNA and its evolution. These models are useful for many applications, such as reconstructing the evolutionary histories of species or pathogens, and understanding the inner workings of life. We also work to improve how DNA information is collected, specifically, developing models and algorithms to improve the efficiency of, e.g., Oxford Nanopore Technology ("Nanopore") sequencing. |

Brief Description of the Project |
We can offer several projects, depending on the students' interests and skills. Other projects also may become available or may be discussed. 1) Pandemic Phylogenetics. Genomics has revolutionized epidemiology. DNA data from SARS-CoV-2 is used, for example, to track transmission of the virus, to develop vaccines, to understand the spread of the disease across human populations and between different animal hosts, and to understand how the virus is evolving and possibly adapting to humans. However, the vast genomic data availability for SARS-CoV-2 and future epidemics poses important challenges for current algorithms and mathematical models used in data analysis. For this reason, we want to scale up methods in phylogenetics, molecular evolution and genomic epidemiology to SARS-CoV-2 data and to future pandemics (see e.g. https://doi.org/10.1101/2020.09.26.314971). This project will require basic understanding of probabilistic models. Basic programming skills will also be important. 2) Detecting natural selection from DNA data using neural networks. Artificial intelligence is revolutionizing many aspects of bioinformatics, but so far applications to phylogenetics and molecular evolution have been of limited success. We are developing a new neural network approach (OmegaAI) to detect natural selection acting on genes which so far shows improved performance over traditional approaches based on maximum likelihood statistical inference. We want to extend the applications of our methods to include other evolutionary processes. To work on this project, you will need some experience programming neural networks and interest in working in molecular evolution. 3) Improving the efficiency of Nanopore sequencing. We recently developed an approach to improve the efficiency with which DNA data is gathered in Nanopore sequencing and we are currently further experimenting with this technology (https://www.biorxiv.org/content/10.1101/2020.02.07.938670v2). We would like to develop alternative versions of our model of DNA sequencing in Nanopore that would improve its long-term performance. Part of this project would require interest in mathematical modeling, and some basic knowledge of probability theory; some very basic familiarity with concepts from information theory are a plus. In the second part of the project (depending on internship length), familiarity with Python programming would be a great advantage. |

Keywords | Molecular evolution, phylogenetics, sequencing, Markov chains, probability |

References | https://doi.org/10.1101/2020.09.26.314971 https://www.biorxiv.org/content/10.1101/2020.02.07.938670v2 |

Prerequisite Skills | Probability/Markov Chains, Most projects require familiarity with a programming language, often Python. |

Other Skills Used in the Project | Statistics, Simulation, Predictive Modelling, Data Visualization |

Programming Languages | Python, C++. Some projects may allow broader choice of programming language. |

Work Environment | So far we expect remote working, but the situation might change. We expect normal working hours (about 40 per week) but we are flexible. The project well be in collaboration with the group leader (Nick Goldman) a senior postdoc in the group (Nicola De Maio), and, depending on the project, a PhD student and/or other interns in the group. Daily discussions are possible with the members of the group directly involved in the projects, and weekly discussions with the whole group. |

## Mathematical modelling of cancer immunotherapy

Project Title | Mathematical modelling of cancer immunotherapy |

Contact Name | Rahuman Sheriff |

Contact Email | sheriff@ebi.ac.uk |

Company/Lab/Department | BioModels, European Bioinformatics Institute (EMBL-EBI) |

Address | Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD |

Period of the Project | 8 weeks - 6 months |

Project Open to | Undergraduates, Master's (Part III) students |

Initial Deadline to register interest | Friday 26th February 2021 |

Background Information |
As part of its normal function, the immune system detects and destroys abnormal cells and most likely prevents or curbs the growth of many cancers. Harnessing the immune system to treat chronic infectious diseases or cancer is a major goal of immunotherapy. Immunosuppression is the key mechanism through which cancer escapes immune targeting and involves a complex interplay between cancer and immune cells. Cancer immunotherapy involves targeting or use of immune system components (including antibodies, cancer vaccines, T cells, etc.,) to kill tumour cells and supress cancer progression. The complex interaction between immune cells and cancer cells are mathematically modelled to study systems behaviour and develop new immune therapy strategies for cancer. There are several models proposed in recent times and new models are coming frequently (Arulraj and Barik, PLOSone 2018 ; Ruby Kim, SPORA 2018 ; Kaitao Li, JBC 2017). In this internship, you will encode the mathematical models (primarily Ordinary Differential Equation models) published in research articles, perform simulations to reproduce results and share them publicly through BioModels repository (Malik-Sheriff et al. 2020, Nucleic Acid Research). Your code will be potentially reused by scientists across the world. You will also have an opportunity to contribute and author the scientific manuscripts we write on this topic. Almost all of our interns in BioModels have been part of our scientific manuscripts (Malik-Sheriff et al. 2020, Nucleic Acid Research, Tiwari et al. BioRxiv). This internship will provide you a strong systems biology modelling background and skills. Couple of our interns who worked on this project earlier have secured PhD position in Harvard and Oxford university, the former was a Cambridge undergraduate. |

Brief Description of the Project | BioModels, (https://www.ebi.ac.uk/biomodels/) is one of the world leading reposition of mathematical models of biological processes hosted at EMBL-EBI. In the proposed curation internship, you will work with the curation team at BioModels and will learn to curate and annotate mathematical models from published scientific literature and deposit them to BioModels. You will further learn to use simulation software such as COPASI to reproduce the simulation results from the reference publication, and learn to encode models in standard modelling formats such as Systems Biology Markup Language (SBML). Specifically, you will work on targeted curation of literature-based models in immuno-oncology. We will focus on Ordinary differential equation models. You will gain strong exposure to cancer immunology and cancer biology (in general) by participating in the targeted curation of these models. You will encode and perform numerical simulation of ODE models. You will learn to visualise simulation results. In the past, BioModels targeted curation of literature-based models on diabetes and neurodegenerative diseases has resulted in scientific publications; Lloret-Villas et al. 2017 and Ajmera et al. 2013, respectively. The cancer immunotherapy models you will curate under the supervision of the curation team will be published in the BioModels repository, disseminated to a broader scientific community and furthermore can be potentially summarized into a scientific publication. Thus, in this internship, in addition to professional growth, you will also contribute to the growth of the open access BioModels repository, which has around 20,000 distinct users per month. |

Keywords | Mathematical modelling, cancer immunotherapy, systems biology, Ordinary differential equation, numerical simulation |

References | [1] www.ebi.ac.uk/biomodels [2] https://academic.oup.com/nar/article/48/D1/D407/5614569 [3] https://www.biorxiv.org/content/10.1101/2020.08.07.239855v1 [4] https://www.cancer.gov/about-cancer/treatment/types/immunotherapy |

Prerequisite Skills | Simulation, Predictive Modelling, Ordinary Differential Equations |

Other Skills Used in the Project | Mathematical Analysis, A levels Biology knowledge essential. Undergraduate level Biology knowledge will be an advantage. |

Programming Languages | No Preference, Knowledge in any of the above programming languages desirable |

Work Environment | Although we work from 9:00 to 17:00, flexible working hours can be arranged. You will be part of BioModels team including curators and software engineers. You can interact and learn from other team members and interns. You will have an opportunity to engage in broader activities at EMBL-EBI. Due to current COVID-19 situation, our campus is currently closed and it is not clear how the situation will evolve, hence you might have work remotely. We current have an intern working remotely from a foreign country. So, despite working remotely you will have full support and supervision needed. |

## Optimisation algorithms, statistical models, and probabilistic methods on manifolds

Project Title | Optimisation algorithms, statistical models, and probabilistic methods on manifolds |

Contact Name | Dr Cyrus Mostajeran |

Contact Email | csm54@cam.ac.uk |

Company/Lab/Department | Engineering Department |

Address | Trumpington Street, Cambridge CB2 1PZ |

Period of the Project | Any 8-week period from late June to early September. |

Project Open to | Master's (Part III) students |

Initial Deadline to register interest | Friday 26th February 2021 |

Background Information | Many scientific fields study data with an underlying structure that is non-Euclidean. Research interest in optimisation on manifolds and related geometric methods have gained considerable momentum in recent years, with successful applications to areas such as computer vision, robotics, medical imaging, and machine learning. Similarly, statistics and probabilistic methods on non-Euclidean spaces are of interest in numerous applications in information engineering where data is intrinsically manifold-valued. While many statistical concepts and techniques such as principal component analysis (PCA) and regression have been extended and widely applied to non-Euclidean data, more sophisticated techniques such as Markov Chain Monte Carlo (MCMC) and deep neural network models are in the relatively early days of their development within a non-Euclidean setting. The aim of the CMP will be to contribute to such efforts within the framework of one of a number of available projects on statistical models and probabilistic methods on manifolds or closely related optimisation algorithms. |

Brief Description of the Project |
A number of projects are available depending on the particular interests and strengths of the candidate. Many of the projects will focus on the space of symmetric positive definite (SPD) matrices, which is often viewed as a Riemannian symmetric space of nonpositive curvature. This particular manifold is of enormous practical interest due to the prevalence of covariance matrices as states or data points in numerous applications. A non-exhaustive list of potential projects is given below. 1. Conic geometric optimisation with extremal eigenvalues: Computing standard geometric objects such as distances, geodesics, matrix exponentials and logarithms in the standard Riemannian geometry on SPD matrices often amounts to the computation of the generalised eigenvalues of a pair of SPD matrices, which typically means a significant increase in computational complexity, particularly for larger matrices. An alternative and scalable approach to geometric computation on the cone of SPD matrices is based on the Thompson geometry of the space, which amounts to working with extremal eigenvalues. There are a number of high-impact open problems on the development of a geometric statistical framework rooted in extremal eigenvalue computations, which could form a self-contained project. 2. Proximal algorithms on manifolds: Proximal methods are standard tools for solving optimisation problems and are particularly effective for non-smooth, constrained, large-scale, or distributed problems involving large and high-dimensional datasets that have become a subject of widespread interest in recent years due to applications in machine learning and related fields. While there has been some research on developing proximal algorithms on manifolds, the current state of the theory is nowhere near as successful as it is in the traditional Euclidean setting. The development of an effective theory of proximal algorithms on manifolds is another possible CMP project. 3. Hidden Markov models and fields: A hidden Markov model is a statistical Markov model in which the system under consideration is assumed to be a Markov process with unobservable hidden states that drive another observable process. The objective is to learn about the hidden states from the available observations. Hidden Markov chains and fields play a major role in signal and image processing and are central to several approaches to image restoration and segmentation. While the literature on hidden Markov models with observations in Euclidean spaces is extensive, the corresponding theory with observations in non-Euclidean spaces is not well developed. The development of such a framework is important due to the role that data in nonlinear spaces often play in the study of complex signals and images. While there has been some recent promising early work on the theory of hidden Markov chains in Riemannian manifolds, the framework has yet to fully mature or be applied to real datasets, which is work that can be pursued in the context of estimating and tracking of hidden brain states in Brain-Computer interfaces (BCI) using SPD-valued observations generated from EEG data. The successful candidate would be expected to have (A) a very strong background in differential geometry and/or probability theory, OR (B) strong programming skills, an interest in differential geometry with a good background in differential geometry and/or probability theory and/or numerical analysis. Significant contributions to any of these projects may be expected to result in a scholarly paper in a leading journal in applied mathematics or information theory and participation in a leading international conference within a year or so. |

Keywords | Differential geometry, optimisation, hidden Markov models, geometric learning, geometric statistics |

References |
1. Optimisation on manifolds: https://press.princeton.edu/absil |

Prerequisite Skills | Geometry/Topology, Simulation |

Other Skills Used in the Project | Statistics, Probability/Markov Chains, Numerical Analysis, Geometry/Topology, Simulation, Data Visualization |

Programming Languages | Python, MATLAB, R, C++, Mathematica |

Work Environment | The student will work closely with Dr Cyrus Mostajeran and at least another academic with expertise on the topic. Professor Rodolphe Sepulchre will also provide some guidance and will keep an eye on the project from time to time. If more than one candidate is accepted, the CMPs will be coordinated closely. The student will be introduced to members of the Control Group and provided access to seminars within the Information Engineering Division of the Department of Engineering. The specifics of the work environment will be dictated by healthcare measures in place at the time. Remote work and flexible working hours are acceptable options. |

## Certifying real root isolation

Project Title | Certifying real root isolation |

Contact Name | Wenda Li |

Contact Email | wl302@cam.ac.uk |

Company/Lab/Department | Department of Computer Science and Technology |

Address | 15 JJ Thomson Avenue, Cambridge CB3 0FD |

Period of the Project | 8 weeks during the summer of 2021 |

Project Open to | Undergraduates, Master's (Part III) students |

Initial Deadline to register interest | Friday 26th February 2021 |

Background Information | Modern proof assistants allow us to carry out symbolic computation in a certified way -- we can utilise the expressive logic of the proof assistant to encode the properties of the results and eventually have a (highly trust-worthy) mechanised proof of their correctness. Real root isolation is a classic and ubiquitous problem in computer algebra and a certified implementation of it will hugely facilitate certifying other procedures in the field. We are gradually building a computer algebra platform that computes with proofs. |

Brief Description of the Project | This a project to build an efficient real root isolation procedure based on Descartesâ€™ rule of signs, extending a previous formalisation in Isabelle/HOL [1]. The ultimate goal might be to formalise the state-of-the-art version [2], but that's probably too much; it's possible to omit advanced features like partial Taylor shift and approximate arithmetic. The starting point of this project is rather moderate: to certify the termination of the procedure. This requires the theorem of three circles, which is already available in another proof assistant, Coq [3]. |

Keywords | interactive theorem proving, computer algebra, root isolation, Isabelle, symbolic computing |

References | [1] https://www.cl.cam.ac.uk/~wl302/publications/cpp19.pdf [2] https://arxiv.org/pdf/1605.00410.pdf [3] https://arxiv.org/pdf/1306.0783.pdf |

Prerequisite Skills | Algebra/Number Theory |

Other Skills Used in the Project | |

Programming Languages | The applicant will need to write formal proofs in Isabelle/HOL. Previous experience with proof assistant like Coq, Lean and Isabelle is a plus but not mandatory. |

Work Environment | The student will be mainly with me, but can also talk to other members of the group about writing formal proofs in Isabelle. Due to the pandemic, the student is likely to work remotely. |

## Tomb mathematics: using probability and sets to count bodies in prehistoric burials

Project Title | Tomb mathematics: using probability and sets to count bodies in prehistoric burials |

Contact Name | John Robb |

Contact Email | jer39@cam.ac.uk |

Company/Lab/Department | Department of Archaeology, Cambridge |

Address | Downing Street, Cambridge CB2 3DY |

Period of the Project | July 1 - August 31 |

Project Open to | Master's (Part III) students |

Initial Deadline to register interest | Friday 26th February 2021 |

Background Information | This project arises from a practical, surprisingly challenging problem in archaeology: estimating how many people were originally buried in a prehistoric tomb. There are archaeological methods for answering it, none very good; it is worth exploring whether applying mathematics may yield a much better solution. |

Brief Description of the Project |
Suppose there are have things that come in repetitive sets of n objects, perhaps of different types (bones in a skeleton, for instance; or footprints made by an animal crossing a water hole, or radio blips from a space object). Imagine that somebody takes p sets, each of which is incomplete and has lost some of its objects so that it now contains between 1 and n objects. Now they mix them all together, and give the whole assemblage to you. You know the aggregate characteristics of the total assemblage (for instance, how many of each type of object there are). But you no longer know which set a given object came from or how many original sets there were (p). Is there a way to estimate the most likely value of p, e.g. the number of sets which were mixed together to create the assemblage (e.g. the number of people originally buried in the tomb, the number of animals crossing the water hole, the number of space objects causing your radio blips)? Or, better still, come up with a probability distribution for potential values of p? This is a real problem archaeologists often face in analysing skeletal assemblages from mixed deposits such as collective tombs. I suspect that there are mathematical ways to solve it, using probability theory, set theory, some combination of these or something else. One challenge is figuring out the mathematics. A second challenge is integrating real archaeological parameters into the solution; this is likely to be where a fair bit of discussion with archaeologists will be needed. Our research project (based in the Department of Archaeology, and focusing upon analysing prehistoric burials) would like a mathematical colleague to research an appropriate solution. Outputs would include (i) educating us about it, (ii) collaboratively writing a research paper presenting the solution for archaeological publication, and (iii) making up some low-tech programme, perhaps a spreadsheet or simple calculator, which archaeologists could enter the parameters of a particular assemblage and calculate the number of individuals it is most likely to contain. |

Keywords | archaeology, burials, megalithic tombs, sets, probability |

References | |

Prerequisite Skills | |

Other Skills Used in the Project | |

Programming Languages | No Preference |

Work Environment | The overarching project consists of a team of 7 archaeologists, geneticists and chemists in the UK, Italy and Estonia. The student will work with the Cambridge people (John Robb (PI) and Jess Thompson (postdoc)) and is welcome to participate in the general project meetings. They will work remotely though, Covid safety permitting, we may be able to have in-person meetings. |

## 3D Mean-Field Theory for Cytoskeleton Organisation in Plant Cells

Project Title | 3D Mean-Field Theory for Cytoskeleton Organisation in Plant Cells |

Contact Name | Tamsin A. Spelman |

Contact Email | tas46@cam.ac.uk |

Company/Lab/Department | Sainsbury Laboratory |

Address | 47 Bateman Street, Cambridge CB2 1LR |

Period of the Project | 8 weeks |

Project Open to | Master's (Part III) students |

Initial Deadline to register interest | |

Background Information | The cytoskeleton is a dynamic fibre network internal to the cell, consisting of microtubules and actin in plants, and essential for many cell functions. This project aim is to use a continuum theory to analyse microtubule organisation in 3D and compare the results to output from our detailed computational simulations (which models all fibres individually). |

Brief Description of the Project | The main analytic tool for studying microtubule organisation is Mean-Field Theory (MFT). Rather than modelling each microtubule fibre separately, MFT models the densities of growing/shrinking microtubules, of each length and direction (angle relative to reference direction) at each time, using differential equations. While there are many results using MFT in 2D, only a small amount of work considers MFT in 3D. This project will focus on using MFT to analyse microtubule organisation in 3D. This project can include: using/extending the existing 3D MFT framework to look at cytoskeleton organisation in a plant root hair cell and other cell geometries; and/or considering more mathematically fundamental questions about MFT such as analysing the bifurcation point for the onset of cytoskeleton organisation, identified in 2D but not derived in 3D. These results can be compared to output from our existing computational model (and experimental data if available). This project can be geared more towards mathematical analysis or computational modelling dependent on student interest. |

Keywords | biology modelling coding plants cytoskeleton |

References |
Introduction is a nice summary of MFT and chapter 3 would be our starting point for this project PhD Thesis One of the earliest papers on MFT Example application of our computational cytoskeleton tool Gitlab page for our computational model Tubulaton |

Prerequisite Skills | |

Other Skills Used in the Project | |

Programming Languages | Python, MATLAB, C++ |

Work Environment | The expectation is that you would need to work remotely due to covid restrictions. You will be part of the Jonsson group consisting of 2 phd students and multiple postdocs, and are welcome to all group activities such as our group meetings and morning coffee meetings, as well as wider lab activities. Supervision will be at least weekly but more regularly to begin with, and I will be available for further ad hoc discussions beyond that as needed. |

## Modelling virus evolution in age-structured populations

Project Title | Modelling virus evolution in age-structured populations |

Contact Name | Olivier Restif |

Contact Email | or226@cam.ac.uk |

Company/Lab/Department | Department of Veterinary Medicine |

Address | Madingley Road, Cambridge CB3 0ES |

Period of the Project | 8 weeks between July and September |

Project Open to | Undergraduates, Master's (Part III) students |

Initial Deadline to register interest | Friday 26th February 2021 |

Background Information | Like many diseases, COVID-19 varies in its infectivity and severity among age groups, however it is not clear how this may shape viral evolution. Traditional evolutionary models have shown that a trade-off between infectivity and lethality can lead to the selection of strains with intermediate levels of virulence (similar to Nash equilibrium in game theory). This prediction may not hold when e.g. transmission is greater in younger age groups who experience milder symptoms. |

Brief Description of the Project | The project will adapt existing epidemic models of COVID-19 in age-structured populations to include competing strains with different levels of infectivity and lethality, and predict their relative fitness in different epidemiological contexts. The project will combine analysis of ODE systems, game theory and numerical simulations. |

Keywords | COVID-19, evolution, dynamic models, epidemics |

References | Sébastien Lion, Johan A.J. Metz. 2018. Beyond R0 Maximisation: On Pathogen Evolution and Environmental Dimensions. Trends in Ecology & Evolution, 33:458-473. https://doi.org/10.1016/j.tree.2018.02.004. |

Prerequisite Skills | Numerical Analysis, Mathematical Analysis, Simulation, Mathematical biology |

Other Skills Used in the Project | |

Programming Languages | No Preference |

Work Environment | The student will be embedded in the Disease Dynamics Unit, a multi-disciplinary group working on infectious diseases, and will take part in regular meetings. If restrictions on in-person meetings remain in place over the summer, the project can be conducted and supervised remotely. |

## Deepmath

Project Title | Deepmath |

Contact Name | Dr. Anthony Bordg |

Contact Email | apdb3@cam.ac.uk |

Company/Lab/Department | Department of Computer Science and Technology |

Address | 15 JJ Thomson Avenue, Cambridge CB3 0FD |

Period of the Project | about 8 weeks (flexible start and end dates) |

Project Open to | Undergraduates, Master's (Part III) students |

Initial Deadline to register interest | |

Background Information | Do you want to explore the current frontier of AI techniques? Deep learning for theorem proving is the future. Don't miss out! |

Brief Description of the Project | A first goal would be to create a data set that could be used as a benchmark in a learning environment for the Isabelle\HOL theorem prover. |

Keywords | deep learning, machine learning, interactive theorem proving, Isabelle\HOL, certified mathematics |

References | HOList: An Environment for Machine Learning of Higher-Order Theorem Proving, Bansal et al. |

Prerequisite Skills | |

Other Skills Used in the Project | |

Programming Languages | Python, C++, No Preference, Standard ML, Scala, Ocaml |

Work Environment | The student will interact with the ALEXANDRIA team and its collaborators |

## Developing a method to identify highly similar short substrings of DNA sequence from a pool of longer, variably related DNA sequences

Project Title | Developing a method to identify highly similar short substrings of DNA sequence from a pool of longer, variably related DNA sequences |

Contact Name | Dr Elizabeth Soilleux |

Contact Email | ejs17@cam.ac.uk |

Company/Lab/Department | Department of Pathology, University of Cambridge |

Address | Division of Cellular and Molecular Pathology, Department of Pathology, Cambridge Biomedical Campus |

Period of the Project | 8 weeks |

Project Open to | Undergraduates, Master's (Part III) students |

Initial Deadline to register interest | Friday 26th February 2021 |

Background Information | All cells in the body normally contain identical genomic DNA sequences. B and T-cells are types of lymphocytes and are unique among cells in each having short unique sequences within their genome, produced by rearranging defined short blocks of sequence in a combinatorial fashion, with an added stochastic element of changing the DNA sequences further at the junction points between these blocks of sequence. This provides an extraordinary opportunity in biology to use this unique sequence (essentially a "bar code") to track individual B and T-cells. In this project, we are interested in using such unique sequences to determine whether a sample contains cancerous lymphocytes, known as lymphoma. Because lymphomas develop from one original cell, they all contain the same unique DNA sequences, rather than all being different. |

Brief Description of the Project | We have sequence data from lymphoma samples, in which there will also have been varying percentages (c.10-90%) of benign lymphocytes. We aim to define the unique bar code region for each individual sequence, but as there are many thousands of sequences, this needs to be done in an automated manner. Because of the novel sequencing method we have employed, the DNA sequences are of different lengths, making aligning them to reference sequences more complicated than with standard methods used in our laboratory. These sequences also contain parts of the blocks of sequence that were rearranged to make the unique sequences mentioned above, but we need to ignore most of the building block sequence and focus just on the unique bar code sequence. Ultimately, we are seek to make an automated, high throughput pipeline that can tell us the percentage of sequences containing identical "bar codes" in any given sample, in order to determine whether or not it contains lymphoma. |

Keywords | Bioinformatics, DNA sequence, alignment, analytical pipeline |

References | https://www.repository.cam.ac.uk/handle/1810/309837 10.1016/j.xcrm.2021.100192 10.1002/path.5592 https://jcp.bmj.com/content/71/3/195 |

Prerequisite Skills | Statistics, Mathematical Analysis, Ideally computational skills in R or Python or a willingness to learn. No particular background is necessary. A creative approach and enthusiasm to get to grips with DNA sequences (not very challenging) and to test out different methods would be most valuable to us. |

Other Skills Used in the Project | Statistics, Numerical Analysis, Mathematical Analysis, Simulation, Database Queries, Data Visualization, App Building |

Programming Languages | Python, R |

Work Environment | We are a multidisciplinary research group ranging from "wet lab" to bioinformatics, with individuals with a maths, engineering, bioengineering, biology and medical background and can provide plenty of support for this project, which can be undertaken remotely in any time zone. We meet weekly by Teams and have ad hoc meetings in between. We have successfully hosted two mathematicians for summer projects and Computational Biology MPhil projects in the past, both achieving a publication from the work. |

## Optimal transport for light microscopy

Project Title | Optimal transport for light microscopy |

Contact Name | Leila Muresan, Yury Korolev, Jerome Boulanger |

Contact Email | lam94@cam.ac.uk |

Company/Lab/Department | Cambridge Advanced Imaging Centre / PDN, DAMTP and MRC-LMB |

Address | PDN, Downing Site, Cambridge CB2 3DY |

Period of the Project | 8 weeks |

Project Open to | Undergraduates, Master's (Part III) students |

Initial Deadline to register interest | Friday 26th February 2021 |

Background Information | Biological tissue can undergo significant morphological changes over time. Changes occur over multiple scales such as establishment of the whole body pattern during development. At another scale, fluxes of labelled proteins involved in processes such as membrane trafficking between organelles provide useful insights into underlying biological mechanisms. |

Brief Description of the Project | A promising approach to modelling fluxes in biological processes is using optimal transport [1]. Optimal transport studies optimal couplings of probability measures that allow to redistribute the mass from one measure to another one in a way that minimises a given cost. Optimal transport has found many applications in imaging and data science in the past decade. The aim of this project is to explore the potential of optimal transport approaches to videomicroscopy data. The project will benefit from the support of groups in the MRC Laboratory of Molecular Biology, DAMTP and the Cambridge Advanced Imaging Centre. |

Keywords | optimal transport, image processing, light microscopy |

References | [1] G. Peyre, M. Cuturi. Computational Optimal Transport. Foundations and Trends in Machine Learning: Vol. 11: No. 5-6, pp 355-607. Available in open access: https://optimaltransport.github.io/book/ |

Prerequisite Skills | |

Other Skills Used in the Project | |

Programming Languages | No Preference |

Work Environment | The student will be part of a multi-disciplinary team (Jerome Boulanger, Nick Barry LMB, Yury Korolev, Carola Schoenlieb, DAMTP, Leila Muresan, CAIC). Given the Covid situation, remote working is possible. |