skip to content

Summer Research Programmes

 

Academic CMP Project Proposals from Summer 2024

 

Pandemic-scale genomic epidemiology

Project Title Pandemic-scale genomic epidemiology
Keywords Genomics, bioinformatics, phylogenetics, mathematical modelling, data analysis.
Project Listed 8 January 2024
Project Status Filled
Contact Name Nicola De Maio
Contact Email demaio@ebi.ac.uk
Company/Lab/Department Goldman group at EMBL-EBI
Address EMBL's European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD 
Project Duration We are flexible, but we prefer longer internships.
Project Open to Undergraduates, Master's (Part III) students
Background Information The COVID-19 pandemic has shown us the importance of genomic epidemiology for tracking the spread of transmissible pathogens, reconstructing their origin, evaluating possible containment measures, and identifying new variants of concern. Millions of SARS-COV-2 genomes have been sequenced and shared in the last few years, and phylogenetic tools to make sense of datasets of this size have been developed, including by the Goldman group at EMBL-EBI. Despite this, fundamental phylogenetic analyses, such as those in viral phylogeography (the phylogenetic-based reconstruction of transmission histories and patterns between geographic locations or social groups) and phylodynamics (the phylogenetic-based study of pathogen prevalence and reproductive fitness over time, location, and genotype) are still limited to small datasets (typically below 1,000 genomes) and therefore cannot exploit large collections of genomic data that will become ever more prevalent in genomic epidemiology. Additionally, genomic epidemiological data analyses are plagued by sequence errors and contamination: given the typically low divergence of the considered genomes, a small number of consensus sequence errors can have a high impact on downstream analyses, and we currently have no straightforward way to address this problem.
Project Description

With this project, we will consider one or more of the following questions and tasks:

1) Which approach is better for viral and bacterial phylogeography and phylodynamics when large numbers of genomes are available? Is it better to downsample a given dataset and use accurate Bayesian methods, or is it better to analyse large collections of genomes with more approximate approaches?
2) How can we best deal with errors and contamination in genome sequences: is it better to filter out presumably unreliable genomes and genome positions, or is it better to explicitly account for these within phylogenetic methods?
3) Can we apply methods developed for SARS-CoV-2 phylogenetics (such as UShER, RIPPLES and MAPLE) to large collections of genomes from a broad set of pathogens (for example bacterial ones)? That is, are we ready for the next pandemic or otherwise what issues will our field need to consider?

The project will deliver important recommendations to the genomic epidemiological field on the best way to analyse large collections of genomes.

Some useful skills, depending on the specific project that will be considered, will be ability to analyse large collections of genome data, interest in method development (in particular in python), or the ability to understand complex statistical and computational methods such as those involved in Bayesian phylodynamics and phylogeography.

Work Environment

Work will be carried out under the daily supervision of Nicola De Maio (Scientist), and the weekly supervision of Nick Goldman (Group Leader). We only accept in-person internships, but it is possible to work from home 2 days a week. Supervision can be more or less involved, depending on the student's preferences: in the past we have had students who prefer to discuss matters daily, and students who preferred weekly meetings. Usually more frequent meetings are useful at the beginning of the internship, and less so towards the end. Supervisors will be on hand in the office most of the time (usually about 4 days a week) for any questions, or reachable via Slack, Zoom, and emails.

Working times are very flexible, but typically most of the group is in the office between 10:00 and 16:00

References

https://www.nature.com/articles/s41588-023-01368-0
https://www.nature.com/articles/s41586-022-05189-9
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005421
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009175

Prerequisite Skills Statistics, Probability/Markov Chains, Simulation, Data Visualization
Other Skills Used in the Project Statistics, Probability/Markov Chains, Simulation, Data Visualization, programming and data analysis skills, use of teminal and LATex. Ability to deal efficiently with large collections of data, in particular genomic data. Algorithmics.
Acceptable Programming Languages Python, R, C++
Additional Information Please note that the maximum internship bursary that EMBL-EBI is able to offer is £1280 per month (£2,560 for a typical 2-months internship)

 

Model comparison and robust estimators for equilibria in supramolecular chemistry

Project Title Model comparison and robust estimators for equilibria in supramolecular chemistry
Keywords Model Comparison, Model Structural Error, Nonlinear Regression, Bayesian Inference, Maximum Likelihood Estimation
Project Listed 8 January 2024
Project Status Filled
Contact Name Daniil Soloviev
Contact Email dos23@cam.ac.uk
Company/Lab/Department Department of Chemistry
Address Yusuf Hamied Department of Chemistry, Lensfield Road, Cambridge, CB2 1EW
Project Duration Any 8-10 week period during the summer
Project Open to Undergraduates, Master's (Part III) students
Background Information

In the Hunter Group, we are developing models that can predict the strength of interactions between molecules, with applications ranging from designing self-replicating molecules [1] to diagnosing Alzheimer's and Parkinson's [2].

Every interaction can be described by an equilibrium constant [3], which is one of the most important properties in supramolecular chemistry, as it predicts when and how strongly molecules will bind to each other. An equilibrium constant can be measured using a titration experiment: by letting molecules gradually bind to each other to form a supramolecular complex, we can observe changes in properties such as light absorbance. By measuring these properties, and fitting the observed data to a model using non-linear regression, we can determine the value of the equilibrium constant. A more detailed introduction can be found in reference [4].

However, most of the techniques chemists use to estimate the value of an equilibrium constant rely a number of mathematical assumptions, which may not always be valid and can lead to inaccurate results. In addition, when different models can be used to analyse the data (e.g. does a protein bind one or two drug molecules?), it is difficult to prove which model provides a better description. We have recently created a new program to analyse titration data called Musketeer [5], and are looking for a mathematician who can help us develop a method which accounts for these issues.

Project Description

The aim of the project is to work on two key questions, which will help us measure equilibrium constants more accurately and with greater confidence:

1. Given a model, how do we determine the best estimate for an equilibrium constant from a titration experiment? Can we take into account the various non-linear sources of experimental error, and the fact that every model we use is always incomplete? And can we obtain meaningful error bounds on the estimated parameters? The best-case outcome of the project is to derive a formula for a robust estimator, or any estimator which can be shown to work better than the currently used least-squares method.

2. Given two or more models, which describe alternative hypotheses for what is happening on a molecular level, how do we choose which one to use? Adding an extra parameter to a model will always improve the fit, but it's not trivial to determine whether that's because the parameter describes a real process, or is just overfitting noise. The best-case outcome of the project is to find a way to quantify how likely a parameter is to be meaningful, and output it in a way that an experimental chemist can easily understand and publish.

During the project, you will be able to work with simulated datasets to investigate how different regression methods compare, and use the best techniques you identify on real experimental data from our lab. Some work has been done by other research groups on the two questions outlined above [6, 7]; you can choose to explore these methods further, or decide try a completely different approach. Any equations and techniques which you discover can be included in our software, so that they can be used both by our lab and by other researchers across the world.

Work Environment The student will join the Hunter Group in the Chemistry Department, working closely with a PhD student (Daniil Soloviev).
References [1] Núñez-Villanueva, D. and Hunter, C.A. (2023) 'Replication of synthetic recognition-encoded oligomers by ligation of trimer building blocks', Organic Chemistry Frontiers, 10(23), pp. 5950-5957. Available at: https://doi.org/10.1039/D3QO01717F
[2] Chisholm, T.S. and Hunter, C.A. (2024) 'A closer look at amyloid ligands, and what they tell us about protein aggregates', Chemical Society Reviews [Preprint]. Available at: https://doi.org/10.1039/D3CS00518F
[3] Clark, J. (2015) 'Equilibrium constants - Kc'. Available at: https://www.chemguide.co.uk/physical/equilibria/kc.html
[4] Thordarson, P. (2012) 'Binding Constants and Their Measurement', in Supramolecular Chemistry. John Wiley & Sons, Ltd. Available at: https://doi.org/10.1002/9780470661345.smc018
[5] Soloviev, D.O. (2023) 'daniilS/Musketeer'. Available at: https://github.com/daniilS/Musketeer
[6] Kazmierczak, N.P., Chew, J.A. and Vander Griend, D.A. (2022) 'Bootstrap methods for quantifying the uncertainty of binding constants in the hard modeling of spectrophotometric titration data', Analytica Chimica Acta, 1227, p. 339834. Available at: https://doi.org/10.1016/j.aca.2022.339834
[7] Barrans Jr., R.E. and Dougherty, D.A. (1994) 'An improved method for determining bimolecular association constants from NMR titration experiments', Supramolecular Chemistry, 4(2), pp. 121-130. Available at: https://doi.org/10.1080/10610279408029871
Prerequisite Skills Statistics
Other Skills Used in the Project App Building, Model comparison/selection
Acceptable Programming Languages Python, Any language that interfaces with Python

 

How do plant cells develop, grow and communicate?

Project Title How do plant cells develop, grow and communicate?
Keywords Plant growth, Mathematical biology, mechanical modelling, reaction-diffusion systems, geometry
Project Listed 8 January 2024
Project Status Filled
Contact Name Euan Smithers
Contact Email euan.smithers@slcu.cam.ac.uk
Company/Lab/Department Sainsbury Laboratory
Address Sainsbury Laboratory, Bateman Street, Cambridge CB2 1LR
Project Duration 8 weeks, full time
Project Open to Undergraduates, Master's (Part III) students
Background Information

Plants are fundamental to our world, and research into their mechanisms can allow us to develop more resilient crops, an essential need for the problems the future will bring. Plants also offer many exciting mysteries ripe for mathematical modelling and insight.

Our goal is to understand how mechanics, cell division and chemical signalling interact to allow plants to develop and grow. Plant cells are rigidly connected, so where they decide to divide and grow needs to be heavily coordinated to determine their overall growth and tissue shape, which is of vital importance for plant functionality and efficiency. To investigate plant development, we apply experimental and modelling approaches.

Project Description

Join us for a project and you can help develop tools to understand plant development. There is flexibility in what you choose to do, but we apply a mix of image analysis, mechanical modelling and reaction-diffusion modelling. For this project, you will have access to actual data and experience working directly with experimentalists in an interdisciplinary environment.

Some possible projects are, the consequences of cell and tissue topology/geometry on plant cell tissues, the effect of different cell setups/networks on cell-cell communication, and how mechanical stress can affect plant tissues.

Work Environment

The student will work with the Robinson lab group as a team and will be primarily supervised by a post-doc, available to talk and provide support at any time. They will also have weekly meetings with the group leader.

There are no strict hours, but the post-doc supervisor will be available during regular work hours. The student will get a desk and a computer at the Sainsbury laboratory so they can do the work.

The Sainsbury laboratory is a great work environment, with different sports groups and organised social events, including ones for just the summer students.

References https://www.slcu.cam.ac.uk/research/robinson-group
Prerequisite Skills PDEs, Simulation, Predictive Modelling
Other Skills Used in the Project Numerical Analysis, Simulation, Predictive Modelling
Acceptable Programming Languages Python, MATLAB, C++

 

Parameter inference from time-lapse images

Project Title Parameter inference from time-lapse images
Keywords modelling, parameter inference, image analysis
Project Listed 8 January 2024
Project Status Filled
Contact Name Elise Laruelle
Contact Email elise.laruelle@slcu.cam.ac.uk
Company/Lab/Department Sainsbury Laboratory
Address Sainsbury Laboratory, Bateman Street, Cambridge CB2 1LR
Project Duration 8 weeks, full time
Project Open to Undergraduates, Master's (Part III) students
Background Information

During their lifetime, cells follow some programme to grow and divide, this is called the cell cycle. Every cell can follow a different programme depending on parameters as their function or their environment. Understanding how they module this cycle depending on these parameters is challenging due to the time that this process can take and the variability within the studied tissue. To track the phases of the cell cycle and have enough values to have a statistical significance, many time points will be needed.

Moreover, with plant tissue, the light sensibility generated by the imaging adds a stress and could possibly perturb the recorded process. In the context of the cell cycle study, to extract the duration of the different cell cycle phase, some compromises are made by the researcher that could be avoided by using modelling methods to infer the measure. This new analysis will allow to obtain faster results, to improve the statistical significance of the results and to extend the analysis.

Project Description

As the current method needs to track cells during several hours to maybe obtain one or two measures. the aim of this project is to infer a biological parameter from microscopy images. The project includes:
- to implement the method (in python preferably)
- to extend the set of data needed as input of the developed method (image processing)
- to apply their method and analyse the results to maybe propose a new image acquisition protocol

To extract the measure, a set of images has been generated in our group and some of them have been pre-analyse to prepare this development.

The student will be in charge of adapting a parameter estimation method based on a Markov chain model. Then, the student will apply the method to answer the question : How long are the cell cycle phases ?

Work Environment

The student will be part of Sarah Robinson’s group and supervised by a computational biologist (postDoc). The main activity will be the method development (programming).

The development of the method will be its own project, but it’s based on a bigger project of the group. To understand the data and the biology, the student will be able to discuss with the experimentalist (postDoc) developing the main project.

The Sarah Robinson’s group is a multidisciplinary group, the student will exchange with biologists as well as mathematician or biochemist. The student will manage their working hours and their presence in the lab to achieve their project.

The student should take part of the group life and also the lab life.

References

Grewal, Jasleen K., Martin Krzywinski, and Naomi Altman. "Markov models—Markov chains." Nat. Methods 16.8 (2019): 663-664. https://doi.org/10.1038/s41592-019-0476-x
Desvoyes, B., Arana-Echarri, A., Barea, M.D. et al. A comprehensive fluorescent sensor for spatiotemporal cell cycle analysis in Arabidopsis. Nat. Plants 6, 1330–1334 (2020). https://doi.org/10.1038/s41477-020-00770-4
Kumud Saini, Aditi Dwivedi, Aashish Ranjan, High temperature restricts cell division and leaf size by coordination of PIF4 and TCP4 transcription factors, Plant Physiology, Volume 190, Issue 4, December 2022, Pages 2380–2397, https://doi.org/10.1093/plphys/kiac345

Prerequisite Skills Probability/Markov Chains
Other Skills Used in the Project Image processing, Predictive Modelling
Acceptable Programming Languages Python

 

Microtubule organisation in plant cells

Project Title Microtubule organisation in plant cells
Keywords Plants; Mathematical biology; computational; cytoskeleton; organisation
Project Listed 9 January 2024
Project Status Filled
Contact Name Tamsin A Spelman
Contact Email Tamsin.Spelman@slcu.cam.ac.uk
Company/Lab/Department Sainsbury Laboratory, University of Cambridge
Address Sainsbury Laboratory, University of Cambridge, 47 Bateman Street, Cambridge, CB2 1LR
Project Duration 8 weeks between June and September
Project Open to Undergraduates, Master's (Part III) students
Background Information Cortical microtubules (MTs) are long, thin active fibres found on the surface of plant cells. MT organisation within growing cells contributes to asymmetric growth generating a myriad of different cell shapes. In the growing plant stem, MTs align transversely before realigning more longitudinally when mature. This supports the anisotropic cell growth. However, without any additional cues, for example from an external stress, we would expect MTs to align along the long axis of the cell [1]. The effect of gradients of external stress on the microtubule organisation has started to be considered in the literature [4].
Project Description

There are two possible projects. The first involves exploring the feasibility of utilising images generated by our microtubule modelling software to train a convolutional neural network for the purpose of detecting microtubules in microscopy images. The second project will use our modelling software to investigate the impact of locally applied external stress on the overall organisation of microtubules within a plant cell. Specifically, the focus will be on determining the size and intensity of external stress required to induce a reorientation of the microtubule array.

We have C++ implemented open-source software [2] for simulating microtubules within different cell shapes which has been developed over multiple years [1,3]. This project will involve running (and if there is interest also developing) this microtubule modelling software and analysing the results. We normally perform data analysis in Python, but you can use your preferred software. Utilising the high-performance computer (hpc), either our internal cluster or the university cluster, may be necessary to obtain sufficient data for statistical averaging.

Work Environment You will be part of the Jonsson group currently consisting of 10 group members and led by Professor Henrik Jonsson, the director of the Sainsbury Laboratory. You are welcome to all group activities such as our group meetings (held weekly but we expect with a few weeks break during the summer) and wider lab activities such as twice weekly coffee gatherings and other activities organised by the lab social societies. Supervision will be at least weekly but more regularly to begin with, and I will be available for further conversations as needed. We will have more irregular meetings with Henrik.
References

[1] P. Durand-Smet et. al. (2020) Cytoskeleton organization in isolated plant cells under geometry control. Proc. Natl. Acad. Sci. 202003184
[2] https://gitlab.com/slcu/teamHJ/tubulaton
[3] V. Mirabet et. al (2018) The self-organization of plant microtubules inside the cell volume yields cortical localization, stable alignment, and sensitivity to external cues. PLoS Comp Biol.
[4] J. Li, D. B. Szymanski andT. Kim (2023) Probing stress-regulated ordering of the plant cortical microtubule array via a computational approach. BMC Plant Biol 23, 308

Prerequisite Skills  
Other Skills Used in the Project  
Acceptable Programming Languages No Preference

 

Embryo segmentation in microscopy images using AI

Project Title Embryo segmentation in microscopy images using AI
Keywords deep learning, AI, image segmentation, microscopy images, biological samples
Project Listed 19 January 2024
Project Status Filled
Contact Name Anita Karsa
Contact Email ak2557@cam.ac.uk
Company/Lab/Department Cambridge Advanced Imaging Centre, Department of Physiology, Development and Neuroscience
Address Anatomy Building, Downing Site, CB2 3DY
Project Duration full time, 8 weeks between June and mid-August
Project Open to Undergraduates, Master's (Part III) students
Background Information Biologists can obtain critical information on fertility by studying the preimplantation development of mouse embryos. Light sheet microscopy enables them to acquire high-resolution, 4D (3D + time) images to investigate cell division and cellular arrangement across time. 3D cell segmentation is a crucial first step in the quantitative analysis of these images. However, given the enormous size of these rich datasets, manual segmentation is extremely time-consuming. Automatic 3D cell segmentation using AI could be a powerful technique to identify each individual cell [1][2]. However, many of the available AI tools do not generalise well to a wide range of data making them challenging to tailor for specific applications. Therefore, there is a need for developing bespoke solutions to these image processing problems.
Project Description The broad theme of this project is improving and tailoring 3D cell segmentation methods to light sheet microscopy images of mouse embryos, primarily using AI tools. For example, one of the aims could be to develop and test automated image processing methods to filter non-specific labelling and/or delineate the embryonic boundary. You will have access to annotated, 4D (3D + time) mouse embryo images and various computational resources, including the CSD3 cluster. You will work as part of a specialised team of computer/data scientists, biologists, and research software engineers who will support you with this challenging, highly interdisciplinary project.
Work Environment You will be based in the Cambridge Advanced Imaging Centre (CAIC) office, Anatomy Building, Downing Site, CB2 3DY. You are likely to work individually most of the time under the daily supervision of Anita Karsa (postdoctoral researcher), and have regular meetings with Leila Muresan (group leader) and Jerome Boulanger (scientist). Supervision can be more or less involved, depending on your preferences and needs. Flexible working is possible but, typically, most of the team is in the office between 10 am and 4 pm. Some of the work can be performed remotely, but support is more readily available at the office.
References [1] Schmidt, Uwe, et al. "Cell detection with star-convex polygons." Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part II 11. Springer International Publishing, 2018.
[2] Weigert, Martin, et al. "Star-convex polyhedra for 3D object detection and segmentation in microscopy." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2020.
Prerequisite Skills  
Other Skills Used in the Project Image processing, Data Visualization, Machine learning
Acceptable Programming Languages Python

 

Shaping electrical stimulation in hearing implants

Project Title Shaping electrical stimulation in hearing implants
Keywords hearing, neuroprosthetics
Project Listed 1 February 2024
Project Status Filled
Contact Name Dorothée Arzounian
Contact Email dorothee.arzounian@mrc-cbu.cam.ac.uk
Company/Lab/Department MRC Cognition and Brain Sciences Unit
Address 15 Chaucer Road, Cambridge CB2 7EF
Project Duration Ideally 8 weeks minimum, full time
Project Open to Undergraduates, Master's (Part III) students
Background Information Some people with profound deafness receive so-called cochlear implants that restore a sensation of sound by stimulating the auditory nerve with electrical currents. The electrical stimulation of the implant bypasses the outer, middle and inner ear, and this shortcut represents both a challenge and an opportunity. It is a challenge because artificially reproducing the natural neural stimulation pattern of a functioning ear requires a good knowledge of the mechanical and physiological functioning of these organs, and is limited by technological constraints. It is an opportunity because it allows the exploration of our perception of neural excitation patterns of the auditory nerve that are impossible to generate acoustically.
Project Description

The mathematician will be given the opportunity to address different types of problems according to their personal interests.

One problem concerns the optimization of inputs for system identification. The generation of neural activity in the auditory nerve as a result of electrical stimulation follows complex physical and physiological mechanisms that have been modelled to some extent and can be simulated computationally. Some of the parameters of these mechanisms are known to vary between different implant recipients and between electrode contacts of a given implant (there are between 16 and 22 electrodes depending on the device). Having patient- and contact-specific estimations of these parameters would be very valuable because it would allow one to optimize the parameters of the cochlear implant for each recipient. One question is, if we had access to the continuous neural firing pattern of the auditory nerve, how should we shape the electrical stimulation in order to obtain maximal information for the estimation of our parameters of interest, and how should we compute this estimate?

The mathematician could alternatively work on a technique that aims to improve the so-far often-imperfect perception of speech with a cochlear implant. Their task would be to suggest ways to modify the stimulation patterns that are associated with different phonemes of speech with the aim to exaggerate the differences between these phonemes (e.g. /d/ versus /g/, /t/ versus /k/, /a/ versus /e/, etc.), with some constraints to guarantee the naturalness of the induced sound sensation. This problem could be addressed incrementally, first working independently on specific pairs of phonemes, and later addressing jointly a set of multiple pairs within an ensemble of phonemes.

Another option would be to work on proposing computationally efficient ways to invert existing models of neural excitation by cochlear implants. An inverse model that can determine what patterns of electrical stimulation produce target neural excitation patterns is very valuable because it could potentially be used to recreate an excitation pattern that is more similar to that obtained in a healthy human ear in response to sound.

Work Environment The mathematician will get supervision from and regular meetings with Dorothée Arzounian and Lidea Shahidi. They will also be invited to join the weekly meetings of our research group comprising 9 researchers. They will be hosted in the MRC Cognition and Brain Sciences Unit, with opportunities to interact with researchers working in different fields of neuroscience and cognitive science.
References

Interactive demo introduction to cochlear implants: https://deephearinglab.mrc-cbu.cam.ac.uk/ci-fi/

Computational model of cochlear implant hearing:
Brochier, T., Schlittenlacher, J., Roberts, I., Goehring, T., Jiang, C., Vickers, D., & Bance, M. (2022). From Microphone to Phoneme: An End-to-End Computational Neural Model for Predicting Speech Perception With Cochlear Implants. IEEE Transactions on Biomedical Engineering, 69(11), 3300–3312. https://doi.org/10.1109/TBME.2022.3167113

Prerequisite Skills  
Other Skills Used in the Project Numerical Analysis, Simulation
Acceptable Programming Languages Ideally Matlab and Python

 

Modelling tree growth responses to climate change

Project Title Modelling tree growth responses to climate change
Keywords tree growth climate carbon photosynthesis
Project Listed 1 February 2024
Project Status Filled
Contact Name Professor Andrew D. Friend
Contact Email adf10@cam.ac.uk
Company/Lab/Department Department of Geography
Address The David Attenborough Building, Pembroke Street, Cambridge, CB2 3QZ
Project Duration 8 weeks in the summer, full time
Project Open to Undergraduates, Master's (Part III) students
Background Information Tree growth is typically considered a direct function of photosynthesis, and there are many models that use this relationship. We have a joint experimental-modelling project to reassess this approach by focussing more on the process of growth itself. We are growing small trees in growth chambers and in the field, measuring them, and developing and testing model approaches to simulating their growth dynamics. The project is more fully described here: https://gtr.ukri.org/projects?ref=NE%2FW000199%2F1
Project Description We would like the student to help develop and test new approaches to simulating the coupled dynamics of carbon in trees with their growth. We are working with a number of ideas regarding internal metabolic feedback that require systematic evaluation, mathematical exploration, and efficient coding. A successful outcome would be a series of analytical solutions to the coupled behaviour of the tree metabolic system.
Work Environment Part of a small team. 2 PhD students and 2 post-docs. Flexible hours, Can work remotely for some, but not all, of the time.
References Friend, A.D., Eckes-Shephard, A.H. and Tupker, Q., 2022. Wood structure explained by complex spatial source-sink interactions. Nature Communications, v. 13, p.7824-. doi:10.1038/s41467-022-35451-7.
Friend, A.D., Eckes-Shephard, A., Fonti, P., Rademacher, T.T., Rathgeber, C., Richardson, A.D. and Turton, R.H., 2019. On the need to consider wood formation processes in global vegetation models and a suggested approach. Annals of Forest Science, doi:10.1007/s13595-019-0819-x.
Eckes-Shephard, A.H., Ljungqvist, F.C., Drew, D.M., Rathgeber, C.B.K. and Friend, A.D., 2022. Wood Formation Modeling – A Research Review and Future Perspectives. Frontiers in Plant Science, v. 13, p.837648-. doi:10.3389/fpls.2022.837648
Prerequisite Skills Mathematical Analysis, Simulation, Predictive Modelling
Other Skills Used in the Project Data Visualization
Acceptable Programming Languages Python, MATLAB, C++, Fortran

 

Time series models for forecasting health service demand driven by weather (temperature, precipitation) extremes and variations.

Project Title Time series models for forecasting health service demand driven by weather (temperature, precipitation) extremes and variations.
Keywords Time series models, state space modelling, Kalman filter, forecasting, leading indicators, weather, health
Project Listed 1 February 2024
Project Status Filled
Contact Name Paul Kattuman
Contact Email p.kattuman@jbs.cam.ac.uk
Company/Lab/Department Judge Business School
Address Trumpington Street
Project Duration 8 weeks: July, August 2024
Project Open to Undergraduates, Master's (Part III) students
Background Information The CMP project is to be nested within a larger project which is focussed on developing a dynamic statistical model that can accurately forecast the impact of extreme weather events on health service demand. Extreme weather is increasingly observed across the world. The objective of the larger project is the development of an early warning system that enables timely mitigation measures to smooth over weather-induced stresses upon the health care system.
Project Description The leading indicator model variant that serves the larger project objective is an extension of the dynamic growth curve model (Harvey and Kattuman, 2020). The use of leading indicators is aimed at improving forecasts, particularly for daily frequency data, by employing dynamic lag structures between the leading (weather) and lagging (healthcare demand) series. The program of work for the student will involve coding model refinements building on existing R code, examining model adequacy by running the model to generate forecasts for hold out samples, and finally using the model for out of sample forecasts.
Work Environment Part of a team: Three others involved are: Prof: Andrew Harvey, Prof: Stefan Scholtes, and Dr. Michael Ashby, Dr. Hua Lu
References i) Harvey. A., Kattuman, P. (2020). ‘Time series models based on growth curves with applications to forecasting coronavirus’, Harvard Data Science Review, Special issue 1— COVID-19. https://doi:10.1162/99608f92.828f40de
ii) Harvey, A., Kattuman, P. and Thamotheram, C. (2021). ‘Tracking the mutant: Forecasting and nowcasting COVID-19 in the UK in 2021’, National Institute Economic Review, 256, pp. 110–26. https://doi.org/10.1017/nie.2021.12
iii) Harvey, A. and Kattuman. P. (2021). ‘Farewell to R: time-series models for tracking and forecasting epidemics’, Journal of the Royal Society Interface, 18, 210179. https://doi.org/10.1098/rsif.2021.0179
Prerequisite Skills Statistics, Time series models
Other Skills Used in the Project Predictive Modelling
Acceptable Programming Languages R

 

Statistical Properties of Speech: Guiding the Generation of Synthetic Speech Corpora

Project Title Statistical Properties of Speech: Guiding the Generation of Synthetic Speech Corpora
Keywords speech, hearing, acoustics, phonetics, signal processing
Project Listed 8 February 2024
Project Status Filled
Contact Name Lidea Shahidi
Contact Email lidea.shahidi@mrc-cbu.cam.ac.uk
Company/Lab/Department MRC Cognition and Brain Sciences Unit, University of Cambridge
Address 15 Chaucer Road, Cambridge, CB2 7EF
Project Duration July 8 to August 30, but we are flexible if an alternative is preferable
Project Open to Undergraduates, Master's (Part III) students
Background Information

Scientists use recordings of speech for many purposes, such as improving hearing aids, diagnosing hearing loss, and understanding how the brain processes speech more generally. Although several collections of speech material, referred to as speech corpora, are available, they are limited in the type, diversity and number of utterances available, leading to issues with test familiarisation, and limiting the type of tests and the number of conditions that can be tested. Further, the existing speech corpora are limited in their diversity of speakers and semantic content.

To address these limitations, we are developing a pipeline to rapidly generate, calibrate, and evaluate speech corpora using modern technologies, such as text generation with large language models, text-to-speech algorithms, and large-scale online testing.

Project Description

To facilitate the development of a pipeline for speech corpus development, this project will pursue the characterisation of speech material from various speech corpora. The tools developed during this project will be used during the design, validation, and implementation of the to-be-generated speech corpus. The student will develop tools for characterising the acoustic, phonetic, and semantic content of existing speech corpora using techniques from signal processing, such as envelope extraction and modulation frequency analyses, and machine learning, such as topic modelling, automatic speech recognition, and speech segmentation. The prevalence of the speech features will then be compared within and across speech corpora using statistical methods. A project outcome will be considered successful if tools are developed to successfully extract features from at least two levels of linguistic processing (e.g. acoustic, phonetic, or semantic).

The student will then have the option of implementing a sampling algorithm to select a unique subset of sentences from a large database of sentences, where the statistics of the subset match the statistics of the overall database. This algorithm will facilitate the composition of sentences into dynamically generated lists, enabling the reuse of speech material for which thousands of sentences are available.

Work Environment The student will be part of the Deep Hearing Lab based at the MRC Cognition and Brain Sciences Unit and will work with primary supervisor Dr. Lidea Shahidi and secondary supervisor Dr. Tobias Goehring. The student will benefit from working within the Cambridge Hearing Group, a world-leading and vibrant research network in Cambridge. They will have regular meetings with Dr Lidea Shahidi, as well as participate in weekly group meetings. To receive the most benefit from the experience, the student would ideally complete the majority of their work in-person at the MRC Cognition and Brain Sciences Unit, although some remote work will be possible. Work hours are flexible, but are expected to overlap with meeting times and normal working hours.
Prerequisite Skills Statistics
Other Skills Used in the Project Probability/Markov Chains, Data Visualization, Machine Learning
Acceptable Programming Languages Python, MATLAB, C++

 

Communicating mathematics – developing comprehensive and targeted coverage on Plus (plus.maths.org)

Project Title Communicating mathematics – developing comprehensive and targeted coverage on Plus (plus.maths.org)
Keywords Communication, public engagement, mathematics, statistics, data science
Project Listed 12 February 2024
Project Status Filled
Contact Name Rachel Thomas and Marianne Freiberger
Contact Email rgt24@cam.ac.ukmf344@cam.ac.uk
Company/Lab/Department Plus (plus.maths.org), Millennium Mathematics Project, DAMTP
Address Centre for Mathematical Sciences, Wilberforce Road, Cambridge, CB3 0WA
Project Duration 8 weeks full time
Project Open to Master's (Part III) students
Background Information

The mathematical sciences are becoming ever more visible as key tools in understanding the world we live in and addressing societal and individual challenges — from artificial intelligence and other advances in technology, to climate change and public and individual health. At the same time, mathematics remains one of the hardest fields to access for people who are outside the mathematics community. Many different audiences might want or need to engage with mathematics and related subjects: researchers from other fields, teachers and potential students, policy makers, the mainstream media, "users" of mathematics in industry and science, and the interested public. To enable these wider audiences to engage with the mathematical sciences, in particular with current research, translations are needed that are unbiased and clear, while retaining mathematical and scientific accuracy and rigour.

Plus provides a gateway to mathematics and related sciences, in particular to current research in these fields, for non-expert audiences through articles, podcasts and videos that are freely accessible on our website. The content ranges from basic explainers of particular concepts to in-depth explorations of particular areas or applications, and is produced by the Editors in direct collaboration with researchers.

Plus is part of the Millennium Mathematics Project (mmp.maths.org), directed by Professor Julia Gog. As well as providing communications expertise to the University of Cambridge's Maths Faculty, we also work directly with different research groups and organisations, including ongoing collaborations with the Isaac Newton Institute, the JUNIPER network and the Maths4DL research project. Previous collaborations have included the Stephen Hawking Centre for Theoretical Cosmology and Discovery+, the Cantab Capital Institute for the Mathematics of Information, the Cambridge Mathematics of Information in Healthcare Hub, among others.

Project Description

This project would help develop a change in direction for content development and dissemination on Plus. The output would be a collection of resources about a specific topic, providing a more accessible, comprehensive and tailored experience for our users.

The project could be tailored to the mathematical and communications interests and strengths of the student, but would focus on developing and curating coverage of a specific topic, targeting one of our key audiences (students and teachers, media and policy makers, curious public) for publication on Plus. Drawing on their own knowledge, interest and perspective on the area of maths, the student would bring together existing content from plus.maths.org, as well as have the opportunity to produce new content (including articles, podcasts and videos) to enhance the coverage to bring greater accessibility, depth or breadth. They would work with the editors to develop this topic coverage, and with the editors and website developer to present the content on plus.maths.org. We would also encourage any work on wider dissemination of the material through social media channels.

The aim would be for the student to develop a collection of resources published on Plus, that they could then point to as evidence of their work with us. They would develop their skills to communicate with different audiences by curating and developing material at a range of levels. They will gain experience with working with editors, planning content, writing content, online publishing, website presentation and user experience.

Work Environment

The student would be working with the Plus editors, Marianne Freiberger and Rachel Thomas. They will be part of the broader team of the Millennium Mathematics Project, meeting and interacting with the team as and when they are in at the CMS.

The student will work hybridly, with 1-2 days a week in the CMS, and the rest of their time remotely, with regular communication with the Plus editors throughout the week.

References

Here are some examples of collections of material on plus.maths.org covering different areas of mathematics.

Fermat's last theorem (https://plus.maths.org/content/fermat) – This collection of articles, podcasts and videos was developed to celebrate the 30th anniversary of the announcement of this proof. The content explores this theorem at a variety of levels, from personal accounts and general descriptions of the work suitable for a non-mathematical audience, to new mathematical research by a Fields medallist.

Disease modelling for beginners (https://plus.maths.org/content/epidemiology-beginners) – This collection of articles looks at some basic concepts in epidemiology to help wider audiences understand this important field, and provides links to further explore this area. This collection is aimed at any non-expert audience, including journalists, policy makers, and secondary school students and their teachers.

Telescope topology (https://plus.maths.org/content/telescope-topology) – This collection explores a recent breakthrough in the field of topology, announced in 2023. The articles range from brief explainers of basic concepts to an in depth exploration of the result that would be of interest to mathematicians from outside the area. It is accompanied by a podcast with the mathematicians involved.

Prerequisite Skills Enthusiasm for communicating science to non-expert audiences in any format;
Excellent written communication skills;
Specialist interest in any area of mathematics (any is suitable, but within pure maths is particularly welcome)
Other Skills Used in the Project  
Acceptable Programming Languages None Required

 

Comparative Analysis of AI accelerators with standardised Machine Learning benchmarks

Project Title Comparative Analysis of AI accelerators with standardised Machine Learning benchmarks
Keywords High Performance Computing, Application-based Benchmarking, Comparative Systems Analysis
Project Listed 15 February 2024
Project Status Filled
Contact Name Dr Yiannos Stathopoulos
Contact Email yas23@cam.ac.uk
Company/Lab/Department Cambridge Open Zettascale Lab, University Information Services
Address Roger Needham Building 7 J J Thomson Avenue Cambridge CB3 0RB
Project Duration 9 weeks, 1st July - 30th August, 2024
Project Open to Undergraduates, Master's (Part III) students
Background Information

Designers and administrators of High Performance Computing (HPC) systems rely on quantifiable measures of performance to evaluate the evolution of deployed systems and to identify requirements for next-generation facilities and system installations. Performance metrics for computational power, such as floating-point operations per second (FLOPs) and efficiency (performance per Watt) are key variables that guide the decision-making process.

Quantifying these variables partially relies on the availability of standardised benchmark tools, which simulate the operational setting of HPC systems using industry-standard computational workloads and data-sets. Wide adoption of these standardised benchmarks allows reliable performance comparisons between different system configurations across established and emerging AI workloads. As a result, standardised benchmarks are useful for informing decision makers about the operational risks (e.g., relating to energy requirements) of HPC systems and their ability to meet emerging workloads with respect to computational capacity.

MLPerf [1][2][3] is a benchmark suite specifically designed for AI by a consortium of AI researchers, industry practitioners and hardware vendors. MLPerf is able to quantify the performance across many facets of an AI system, including compute and I/O, through a set of standardised AI-related data-sets and compute tasks implemented in PyTorch and Tensorflow.

Announced in November 2023 by prime minister Rishi Sunak and built by a collaboration between Cambridge’s Research Computer Services (CSD3), Intel and Dell Technologies, Dawn is the UK’s fastest AI supercomputer. A key component of Dawn’s AI processing power is Intel’s state-of-the-art Sapphire Rapids CPUs and Intel Datacenter Max 1550 GPU AI accelerators. The objective of this project is to investigate the performance of these state-of-the-art components using MLPerf. Outcomes of this project will be key input to the decision-making at the Zettascale lab.

Project Description

This project is being proposed and run by the Cambridge Open Zettascale Lab (COZL) who represent the research and development function of the University of Cambridge Research Computing Services (RCS). The lab's role is to explore the broad technological and regulatory horizon in the HPC & AI sector to enable systems and services at Cambridge (and the UK) to remain at the cutting edge of scientific computing.

The student will undertake a short literature review surrounding evaluative performance benchmarking in HPC with emphasis on MLPerf. The student will also receive access, orientation and time to become familiar with the systems to be benchmarked.

First, the student will gain an understanding of the inner workings of MLPerf and carry-out a baseline benchmark run on NVidia A100 machine learning GPUs. Subsequently, the student will transfer experience and knowledge of MLPerf gained from the baseline run to quantify the performance of state-of-the-art Intel Saphire rapids CPUs and MAX1550 GPUs hosted at the Zettascale lab. This will lead into co-designing experiments for comparative analysis of the lab’s AI systems. The student will have significant input into the proposed work and will be supported by specialised staff at the Open Zettascale lab. The student will also be given the opportunity to develop the project in interesting directions, such as deploying their benchmarking methodology at scale using automated and semi-automated tools.

Expected outputs from this project are (i) a short technical report and presentation with a target audience the members of the Cambridge Open Zettascale Lab (including stakeholders from Intel and Dell EMC), (ii) digital assets such as, but not limited to, the experimental output dataset, scripting and code to allow further and/or independent analysis and reproducibility.

There is scope for extra-project activities such as touring data hall 1 at the West Cambridge Data Centre (WCDC) which houses all parts of CSD3 as well as engaging with our internal seminar series and knowledge sharing opportunities.

Work Environment The project will be conducted individually with regular supervision and access to specialists upon request. The work place is a flexible hybrid environment, office space is available 5 days a week at the Roger Needham Building on the West Cambridge site, but there is no expectation to attend everyday (unless this is the students preference). Supervisions will take place both in person and remotely. The student will be expected to work a 37.5 hour week, but not strictly 7.5 hours each weekday, precise working arrangements will be agreed at the start of the internship.
References [1] MLPerf: https://mlcommons.org/benchmarks/
[2] Peterson et al. MLPerf Training Benchmark. Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. https://arxiv.org/abs/1910.01500
[3] Reddi et al. MLPerf Inference Benchmark. ISCA '20: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture. https://arxiv.org/abs/1911.02549
Prerequisite Skills Statistics, Basic Unix skills including scripting, Python programming
Other Skills Used in the Project Data Visualization, Some familiarity with PyTorch and/or Tensorflow
Acceptable Programming Languages Python, Bash

 

A data-driven approach to carbon footprint reporting for High Performance Computing (HPC) at the University of Cambridge

Project Title A data-driven approach to carbon footprint reporting for High Performance Computing (HPC) at the University of Cambridge
Keywords High Performance Computing and AI, Net-zero, Data Science, Data Visualization, Carbon Footprint
Project Listed 15 February 2024
Project Status Filled
Contact Name Dominic Friend
Contact Email dmf38@cam.ac.uk
Company/Lab/Department Cambridge Open Zettascale Lab, University Information Services
Address Roger Needham Building 7 J J Thomson Avenue Cambridge CB3 0RB
Project Duration Full time - 37.5 hours per week 9 weeks :: 1st July - 30th Aug
Project Open to Undergraduates, Master's (Part III) students
Background Information

Annual estimates of the global carbon emissions created by High Performance Computing (HPC) stands at around 2 million metric tonnes[1]. This is likely to be a conservative estimate, at best, as it relies solely on data related to publicly declared systems as listed in the Top500[2]. But there are considerably more systems in this space, many hundreds, if not thousands, of smaller systems which do not make the list, alongside those held privately or by government agencies. Therefore it may not be unreasonable to consider doubling, or even tripling, this estimate to get closer to the real impact, and this is simply accounting for the carbon emissions due to electricity usage, no embodied carbon estimates; those associated with manufacturing, transportation and disposal of equipment, are made.

Closer to home, funders of UK-based HPC facilities are closing in on their net-zero emissions pledges. UK Research and Innovation (UKRI) is one such funder and is a primary source of direct and indirect funding for the systems hosted at the University of Cambridge. UKRI have set their net-zero target at 2040[3] and it is therefore important that the services at Cambridge, such as Dawn and the Cambridge Service for Data Driven Discovery (CSD3), both of which are hosted by Research Computing Services (RCS), are able to measure and report their carbon footprint. Although 2040 appears far off into the future, there is a lot of work to be done and carbon footprint measurement and reporting is only the first step towards a net-zero target.

Furthermore, there is a growing expectation within the community of HPC systems and services providers that demonstrating a carbon footprint monitoring and reporting capability will become an essential part of being eligible for future funding from UKRI and its partners. Therefore developing this capability is essential for the long-term future of facilities such as those provided at Cambridge. The Cambridge Open Zettascale Lab (COZL)[4] has recently deployed software infrastructure enabling power consumption monitoring across HPC systems at Cambridge. It is on top of this infrastructure that the capability to analyse, optimise, predict, account and communicate the carbon footprint of Cambridge HPC systems will be developed.

Project Description

This project is being proposed and run by the Cambridge Open Zettascale Lab (COZL) who represent the research and development function of the University of Cambridge Research Computing Services (RCS). The lab's role is to explore the broad technological and regulatory horizon in the HPC & AI sector to enable systems and services at Cambridge (and the UK) to remain at the cutting edge of scientific computing.

The student will undertake a short literature review to gain a foothold in the broad discussion around carbon footprint monitoring and in-particular what is currently (or not) being done in the context of HPC in the UK or abroad. They will receive access, orientation and time to become familiar with the available HPC and infrastructure monitoring systems hosted at Cambridge.

This will lead into co-designing a data-driven approach to exploiting the existing HPC systems power consumption data collected by the infrastructure monitoring system with the goal of exploring one or more ways to analyse, optimise, predict, account and communicate the carbon footprint of Cambridge HPC systems. There is scope for the student to propose new or improved methods as well as the inclusion of relevant external datasets to improve the accuracy of carbon footprint estimates, such as an approach handling embodied carbon. The student will have significant input and ownership of their contribution to the wider work being conducted in this area, and it is not our expectation that the student is able to solve the problem in its entirety, but to contribute in a meaningful way to the development of the future solution.

Expected outputs from this project are a short technical report with a target audience of members of the Cambridge Open Zettascale Lab including stakeholders from Intel and Dell EMC, along with digital assets such as any visualisations, scripts and code that is created to allow future work, independent analysis and reproducibility. There is scope for extra-project activities such as touring data hall 1 at the West Cambridge Data Centre (WCDC) which houses Dawn and CSD3 as well as engaging with our internal seminar series and knowledge sharing opportunities.

Work Environment The project will be conducted individually with regular supervision and access to specialists upon request. The work place is a flexible hybrid environment, office space is available 5 days a week at the Roger Needham Building on the West Cambridge site, but there is no expectation to attend everyday (unless this is the students preference). Supervisions will take place both in person and remotely. The student will be expected to work a 37.5 hour week, but not strictly 7.5 hours each weekday, precise working arrangements will be agreed at the start of the placement.
References https://www.hpcwire.com/2021/12/09/with-a-carbon-footprint-like-hpcs-it-matters-when-and-where-you-step/
https://www.top500.org/
https://www.ukri.org/wp-content/uploads/2020/10/UKRI-050920-SustainabilityStrategy.pdf
https://www.zettascale.hpc.cam.ac.uk/
Prerequisite Skills  
Other Skills Used in the Project Predictive Modelling, Data Visualization, App Building, Basic Unix including scripting
Acceptable Programming Languages Python

 

Discovery of novel biomarkers using unsupervised statistical learning

Project Title Discovery of novel biomarkers using unsupervised statistical learning
Keywords unsupervised statistical learning, (gaussian) mixture models
Project Listed 5 March 2204
Project Status Filled
Contact Name Solon Karapanagiotis
Contact Email sk921@cam.ac.uk
Company/Lab/Department MRC Biostatistics Unit
Address  
Project Duration 8 weeks
Project Open to Undergraduates, Master's (Part III) students
Background Information

Cell-free DNA (cfDNA) is a promising biomarker in oncology. Its main advantage lies in the ability to non-invasively extract clinically relevant information from a blood sample. Most DNA is inside a cell’s nucleus. As a tumour grows, cells die and are replaced by new ones. The dead cells get broken down and their contents, including DNA, are released into the bloodstream. New technologies now allow us to quantify this circulating DNA from the tumour. This can be helpful in many cases, such as detecting and diagnosing cancer, guiding treatment, monitoring treatment response and periods with no symptoms (remission of cancer).

At present, however, how to analyse cfDNA data is still under debate. Most existing (statistical) methods suffer from high noise levels making precise quantification of tumour burden difficult.

Project Description

We hypothesise that cfDNA fragmentation could serve as a biomarker for tumour burden quantification. This is motivated by distinct fragmentation patterns between healthy and tumour-derived cfDNA. For instance, compared to healthy cfDNA, patients with cancer had several distinct genomic differences with increases and decreases in fragment sizes at different regions (https://www.nature.com/articles/s41586-019-1272-6#Sec1).

This project will explore different statistical approaches to detect abnormalities in cfDNA by genome-wide analysis of fragmentation patterns. In particular, the intern will implement a gaussian mixture model to the dataset, study its performance, explore different configurations (eg estimations methods) and methodological extensions.

Work Environment The student will be part of lab at the MRC Biostatistics Unit (Biomedical Campus). The student will have the opportunity to interact with other lab members (https://www.mrc-bsu.cam.ac.uk/staff/oscar-rueda/) and other members of the department (https://www.mrc-bsu.cam.ac.uk/), attend talks and seminars in various topics.
References

An introduction to the application setting:
https://pubmed.ncbi.nlm.nih.gov/30380390/

An into to gaussian mixture models:
https://scikit-learn.org/stable/modules/mixture.html#:~:text=A%20Gaussian%20mixture%20model%20is,Gaussian%20distributions%20with%20unknown%20parameters.

Prerequisite Skills Statistics, Probability/Markov Chains
Other Skills Used in the Project Statistics, Probability/Markov Chains, Simulation
Acceptable Programming Languages R

 

Developing a Frameshift Peptide Database

Project Title Developing a Frameshift Peptide Database
Keywords Frameshift mutation, bioinformatic, database, immunogenic, cancer
Project Listed 5 March 2024
Project Status Filled
Contact Name Prof. Liz Soilleux
Contact Email ejs17@cam.ac.uk
Company/Lab/Department Department of Pathology, University of Cambridge
Address Division of Cellular and Molecular Pathology
University of Cambridge Department of Pathology
Addenbrookes Hospital
Cambridge Biomedical Campus
CB2 2QQ
Project Duration Full time; 8 weeks between late June and September (flexible dates)
Project Open to Undergraduates, Master's (Part III) students
Background Information

Lynch syndrome is a genetic condition with a prevalence of at least 1 in 200 that predisposes to various early onset (<45) cancers, most frequently of the large intestine and uterus. Identification of LS in a family leads to testing of family members, with subsequent screening of those at-risk by regular colonoscopy. For women, prophylactic hysterectomy and oophorectomy are offered. Although these measures reduce the risk of premature death from large intestinal and uterine cancer, LS patients continue to die prematurely from cancers that present late and cannot be screened for, e.g., cancers of the small bowel, urinary and hepatobiliary tracts, and brain. Furthermore, these are extremely invasive approaches to surveillance/ prevention and a non-invasive, screening blood test, with high sensitivity and specificity, would be highly desirable. We are investigating whether we can detect immune responses to the altered proteins of Lynch syndrome tumours.

Lynch Syndrome is caused by inherited variants of so-called mismatch?repair genes. Lack of mismatch repair function causes loss/ gain of 1 to 2 DNA bases. Because 3 bases encode each amino acid (protein building block), this changes the “reading frame” of the DNA and means that all the protein sequence following the mutation is completely different from the sequence proceeding it. Accumulation of these losses and gains of 1 to 2 DNA bases is known as ‘microsatellite instability’. The resulting mutant proteins are known as frameshift proteins. Because the body has never seen these frameshift proteins before, they are seen as ‘foreign’ by the immune system and so they trigger a strong immune response.

Project Description

Two PhD students in the Soilleux group have catalogued the identified frameshift proteins seen in Lynch Syndrome/ microsatellite instability in large intestinal and uterine cancer and we now wish to make a searchable online database, which contains this DNA and protein sequence data, including:
1. Mutation description a. nucleotide change b. position in the DNA sequence (affected codon) c. position in the protein sequence
2. Mutated, and corresponding unmutated, DNA sequence
3. Mutated, and corresponding unmutated, protein sequence
4. How mutation was identified
a. link to publication and/ or sequence data repository
b. brief description of method (DNA or RNA sequencing) how and where the mutation was identified

We wish to make the database searchable by:
1. Gene name
2. Gene database accession numbers
3. DNA mutation type (deletion/ insertion and numbers of nucleotides)
4. Length of frameshift peptide generated by mutation
5. Cancer type (e.g., colorectal, uterine etc)
6. Title of study
7. Year of study
8. Type of study: case/cohort
9. Frequency of mutation in cohort study (needs a range)
10. Cohort size (needs a range)

Many software tools already exist for creating the constituent parts of the database and the Soilleux group has an excellent bioinformatic/ mathematician collaborator (Dr Anna Fowler, University of Liverpool), who will assist by advising on the more technical aspects of the project.

We are looking for a mathematician with a good grasp of coding and considerable initiative to seek out relevant tools many of which already exist, in order to develop such a database. We anticipate writing a publication describing the database, its contents and its functionality and the CMP student would, of course, be a co-author on such a publication.

The person specification is as follows:
- Experience in coding (R and/ or Python) and webpage design (sufficient to make a reasonable user interface) 
- Design of an interface in which each field can be searched, sorted, and filtered
- Production of a database that biologists with less coding experience could add data to in the future
- Some understanding of DNA and protein sequences (but this can be learnt rapidly)

Work Environment Working in a team with PI and 3 PhD students, plus multiple collaborators in Cambridge and beyond.
References https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8238217/
Prerequisite Skills Database Queries, Data Visualization, Some understanding of DNA and protein sequences
Other Skills Used in the Project Data Visualization, Database building, with some user-friendly website design; bioinformatic analysis of DNA and protein sequences.
Acceptable Programming Languages Python, R, C++

 

Machine-learning enhanced cosmological parameter estimation

Project Title Machine-learning enhanced cosmological parameter estimation
Keywords Cosmology; Data Science; Machine Learning; Bayesian inference; Artificial Intelligence;
Project Listed 12 March 2024
Project Status Open
Contact Name Will Handley
Contact Email wh260@cam.ac.uk
Company/Lab/Department Physics (Astrophysics)
Address Kavli Institute for Cosmology/Battcock Center
Project Duration 10 weeks
Project Open to Undergraduates, Master's (Part III) students
Background Information  
Project Description

Cosmological parameter estimation is the inference task of determining the parameters of the universe from observations. This is a computationally and statistically challenging task, with the state of the art now involving a blend of Bayesian and machine-learning techniques.

This project will build on existing work in the group [1] exploring the extent to which neural ratio estimation [2] can be used as an alternative to density estimation. Students will require no more cosmology than in Part II Relativity, and Part II Python notebook-level programming. Students will be expected to interact with the rest of the group in weekly group & individual meetings and are welcome to join group social events. [2] https://arxiv.org/abs/2111.08030

Work Environment Working with othe PhD students, summer students and postdocs https://www.willhandley.co.uk/students/
References [1] https://arxiv.org/abs/2207.11457
[2] https://arxiv.org/abs/2111.08030
Prerequisite Skills Statistics, Probability/Markov Chains
Other Skills Used in the Project Mathematical physics, Numerical Analysis
Acceptable Programming Languages Python