skip to content

Summer Research Programmes

 

2025 Industrial CMP projects

Below you will find the list of industrial CMP projects hosted by external companies (jump to list).  Click here to see the list of academic projects hosted by other departments and labs within the university.  

New projects will be added throughout Lent Term so check back regularly!

 

How to Apply

Unless alternative instructions are given in the project listing, to apply for a project you should send your CV to the contact provided along with a covering email which explains why you are interested in the project and why you think you would be a good fit.  

Need help preparing a CV or advice on how to write a good covering email? 

The Careers Service are there to help!  Their CV and applications guides are packed full of top tips and example CVs.  

Looking for advice on applying for CMP projects specifically?  Check out this advice from CMP Co-Founder and Cambridge Maths Alumnus James Bridgwater.  

Remember: it’s better to put the work into making fewer but stronger applications tailored to a specific project than firing off a very generic application for all projects – you won’t stand out with the latter approach!  

Please note that to participate in the CMP programme you must be a student in Part IB, Part II, or Part III of the Mathematical Tripos at Cambridge.  

 

Want to know more about a project before you apply? 

Come along to the CMP Lunchtime Seminar Series in February 2025 to hear the hosts give a short presentation about their project.  There will be an opportunity afterwards for you to chat informally with hosts about their projects. 

Alternatively (or as well!), you can reach out to the contact given in the project listing to ask questions. 

 

Industrial CMP Project Proposals for Summer 2025

 

Enabling Large Models for Edge AI with Disentanglement and Compositionality

Project Title Enabling Large Models for Edge AI with Disentanglement and Compositionality
Keywords Disentanglement, Compositionality, Edge AI, Model Optimization, Resource Efficiency
Project Listed 8 January 2025
Project Status Open
Contact Name Orange Gao
Contact Email orangez@amazon.com
Company/Lab/Department Amazon Lab126
Address One Station Square, Cambridge, CB1 2GA
Project Duration 8 weeks; full-time
Project Open to Master's (Part III) students
Background Information

Deploying large AI models on edge devices is a significant challenge due to their limited computational, memory, and energy resources. These constraints often necessitate trade-offs between model performance and efficiency, making it difficult to use cutting-edge AI technologies in applications like IoT, wearables, and real-time systems.

This project explores the intersection of disentanglement and compositionality, two promising concepts in AI research, to address these challenges:

Disentanglement focuses on isolating meaningful, task-specific features from complex data representations. By enhancing interpretability and generalization, disentanglement makes it possible to optimize models while preserving essential functionality.

Compositionality allows models to break down tasks into smaller, reusable components that can be recombined to address a variety of tasks. This modular approach facilitates scalability and adaptability, especially in resource-constrained environments.

By leveraging these principles, the project aims to make large AI models lightweight and efficient while retaining strong performance. This approach offers the potential to unlock new applications for AI on edge devices, where real-time performance, adaptability, and energy efficiency are critical.

Project Description

This project involves exploring and developing methods to enable large AI models to operate efficiently on edge devices by leveraging disentanglement and compositionality. The work is open-ended, allowing flexibility to adapt the later stages based on findings from initial experiments. The student will undertake the following key activities:

Feature Disentanglement

  • Implement and analyze disentanglement techniques like β-VAE, InfoGAN, and diffusion-based methods to extract meaningful task-specific features from complex datasets.
  • Evaluate these methods using mathematical tools such as latent space analysis, information theory metrics (e.g., KL divergence), and mutual information estimation.

Model Pruning and Quantization

  • Use mathematical optimization methods to identify and remove redundant parameters, channels, or layers from large models.
  • Apply quantization techniques, involving numerical precision analysis and statistical error evaluation, to compress model size and reduce computational overhead.

Knowledge Distillation

  • Implement teacher-student learning paradigms, leveraging intermediate representations like logits or feature maps.
  • Use statistical and machine learning techniques to evaluate performance transfer and fidelity between the teacher and student models.

Compositional Representation Learning

  • Design and train modular, compositional models capable of combining smaller primitives for broader task applicability.
  • Explore combinatorics and graph-based algorithms for representing and evaluating the modular structure of tasks.

Edge Deployment Optimization

  • Adapt the optimized model for edge device constraints by integrating hardware-aware design principles such as depthwise convolutions and lightweight layers.
  • Utilize performance profiling to ensure real-time efficiency.

 

Successful Outcome

A successful outcome would include:

  • An optimized model capable of efficient and accurate inference on an edge device.
  • A clear understanding of how disentanglement and compositionality enhance model generalization and scalability.
  • Quantitative performance improvements (e.g., reduced model size, lower latency) validated with metrics like computational cost, memory usage, and task accuracy.
  • A well-documented workflow and reproducible codebase.

 

How It’s Interesting/Useful

This project combines cutting-edge AI techniques with real-world application in edge computing. The outcomes can be impactful for industries like IoT, wearables, and personalized AI, where resource efficiency is critical. The modular approach ensures the work is extensible, allowing integration into diverse AI tasks.

 

Use of Mathematical Skills

Students will actively use mathematical skills in areas such as:

  • Optimization: Minimizing loss functions and pruning redundant parameters.
  • Probability and Statistics: Understanding distributions in disentanglement and analyzing quantization impacts.
  • Linear Algebra: Matrix manipulations for model compression and feature extraction.
  • Information Theory: Metrics like entropy and mutual information for disentanglement and knowledge distillation evaluation.
  • Combinatorics: Designing and analyzing compositional representations and modular task recombinations.

By the end of the project, the student will gain experience applying theoretical mathematical concepts to practical problems in AI and edge computing, contributing to a rapidly evolving field.

Work Environment

The student will work independently on this project, with myself serving as the industrial supervisor. I will provide regular guidance and mentorship, helping the student define goals, troubleshoot challenges, and refine their approach throughout the project. Although the student will primarily work on their own, I will be readily available for discussions and feedback through scheduled meetings and as needed via email or video calls.

The student will have the flexibility to work remotely, allowing them to structure their schedule to maximize productivity. There are no fixed office or lab hours, but the student is encouraged to maintain consistent progress and attend periodic check-ins to review milestones and ensure alignment with the project goals.

Day-to-day, the student will engage in tasks such as implementing and testing machine learning models, analyzing results, and documenting findings. They will have access to tools, datasets, and resources necessary for the project, along with my guidance to navigate technical or conceptual challenges. This setup offers the student a hands-on, immersive experience while fostering independence and problem-solving skills.

References [1] beta-vae: Learning basic visual concepts with a constrained variational framework. ICLR 2017
[2] Infogan: Interpretable representation learning by information maximizing generative adversarial nets. NeurIPS 2016
[3] Wu, Cindy, et al. "What Mechanisms Does Knowledge Distillation Distill?." Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models. PMLR, 2024.
[4] Chen, H., Zhang, Y., Wang, X., Duan, X., Zhou, Y., & Zhu, W. (2023). Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation. ICLR 2024.
[5] Challenging common assumptions in the unsupervised learning of disentangled Representations," ICML 2019 [6] Jin, Zeng et al., ” Closed-Loop Unsupervised Representation Disentanglement with β-VAE Distillation and Diffusion Probabilistic Feedback.” In ECCV 2024
Prerequisite Skills Statistics, Probability/Markov Chains, Image processing, Mathematical Analysis
Other Skills Used in the Project Numerical Analysis, Mathematical Analysis, Simulation, Predictive Modelling
Acceptable Programming Languages Python

 

Accelerate CFD convergence with improved field initialisation and mixed precision solves

Project Title Accelerate CFD convergence with improved field initialisation and mixed precision solves
Keywords Simulation, Software Engineering, Numerical Analysis, Fluid Dynamics, Physics
Project Listed 8 January 2025
Project Status Open
Contact Name Laurence Cullen
Contact Email laurence@vanellus.tech
Company/Lab/Department Vanellus Technologies Ltd
Address Unit 6, The Courtyard, Sturton Street, Cambridge, CB1 2SN
Project Duration 8 weeks; full-time
Project Open to Master's (Part III) students
Background Information

In engineering, computational fluid dynamic (CFD) + thermodynamic simulations are an increasingly critical tool for designing performance optimised systems. However increasing design complexity and higher performance requirements means more pressure is put on current simulation tools. For many applications, current simulation tools are too slow, inaccurate or hard to use for effective design optimisation. Vanellus is developing a new GPU-based multiphysics simulation and optimisation engine in order to remove current bottlenecks on simulation usage.

At its core, CFD involves solving complex non-linear PDEs using numerical algorithms. Numerically solving non-linear PDEs almost always involves an iterative process, where an initial guess of the solution is gradually improved upon. A high-value research area is coming up with ways to improve your initial guess, so that it is closer to the true solution, therefore requiring fewer iterations to reach convergence. The challenge here is finding the balance between the quality of the initial guess and the amount of computing resources it takes to find it.

Project Description

As a small startup, we have a range of mathematical tasks to tackle with a flexible R&D roadmap, so it’s worth noting that depending on your research interests and our technical progress, we are open to adapting the project to suit our mutual needs.

What we would like you to do: 

  • Review background literature and required reading to get you up to speed on non-linear PDE solving.
  • In discussion with the team, decide on a promising and tractable research direction to focus on during the placement.
  • Identify required modifications to our code to test the approach.
  • Run experiments to evaluate the efficacy.

Some ideas for research directions we have found:

  • Using statistical/ML methods to directly deduce plausible initial guesses given simulation geometries and boundary conditions.
  • Running an initial cheap simulation in lower floating point precision, and then “upscaling” the result to high precision.

 

Successful outcome:

Improved non-linear PDE initialisation is a large open field, so we do not expect the problem to be fully solved during the time of the placement. Success for us will be if, at the project conclusion, we can have some preliminary results that either show the potential or disprove the usefulness of a particular numerical method.

 

How would it be interesting/useful?

This project will allow you to use your mathematical expertise in the context of an R&D-focused software engineering startup. In addition to improving your knowledge and skills in numerical analysis, we expect you to learn the basics of software engineering in a team, including using version control and software engineering principles such as unit testing. Our hope is you would leave this placement in a great position either to continue with academic research or to pursue a career demanding software skills.

For those who are interested, we predominantly program in Python, specifically using the JAX accelerated numerical computing library. We also sometimes make use of lower-level languages such as CUDA and Rust.

Work Environment

You will work in the office as part of our team including the company founders. We are based in Cambridge (10 min walk from the train station). We typically do 9-5:30 working hours.

We are a heavily collaborative team, so we’re sharing ideas and knowledge throughout the day. We start every day with a 15-minute meeting where we all share what we’ll be working on for that day and if we need any help. As well as your own project, we would love to get your insight on our day-to-day R&D mathematical problems at the whiteboard.

We have a strong emphasis on peer learning, and all of our code goes through review the team, where we share ideas on how to improve code quality and structure.

References

Fluid Mechanics 101 YouTube Channel: https://www.youtube.com/playlist?list=PLnJ8lIgfDbkoZ33CHr-p6z2CBkp9OTcWj
This is an excellent channel for learning the fundamentals of CFD from a mathematical perspective, and this playlist is a good place to start.

Notes on CFD from the developers on OpenFOAM: https://doc.cfd.direct/notes/cfd-general-principles/
This is an online textbook written by the developers of OpenFOAM (one of the most popular open-source CFD codes), which gives an excellent overview on some of the key algorithms behind CFD.

Numerical Linear Algebra by Trefethen & Bau: https://www.stat.uchicago.edu/~lekheng/courses/309/books/Trefethen-Bau.pdf
This is a more in-depth textbook for learning numerical linear algebra, which should be helpful in learning the fundamentals of iterative schemes.

CFDNet: A deep learning-based accelerator for fluid simulations, Obiols-Sales et al. https://arxiv.org/pdf/2005.04485
This is an interesting paper where deep neural network methods were used to find improved initial guesses for Reynolds Averaged Navier Stokes (RANS) simulations.

On floating point precision in computational fluid dynamics using OpenFOAM, Brogi et al. https://www.sciencedirect.com/science/article/pii/S0167739X23003813
This paper experimented with using different floating point precision for reference PDE solving problems.

Prerequisite Skills Fluids, Numerical Analysis, PDEs, Simulation
Other Skills Used in the Project Statistics, Mathematical physics, Predictive Modelling, Data Visualization, App Building
Acceptable Programming Languages Python, MATLAB, C++, Rust, CUDA, C

 

Utilising existing cut flower performance and quality data to inform and accelerate decisions for future developments and planting decisions

Project Title Utilising existing cut flower performance and quality data to inform and accelerate decisions for future developments and planting decisions
Keywords Horticulture, Predictive modelling
Project Listed 8 January 2025
Project Status Filled
Contact Name Richard Boyle
Contact Email richard.boyle@apexhorticulture.com
Company/Lab/Department APEX Horticulture
Address Pierson Road, The Enterprise Campus, Alconbury Weald, PE284YA
Project Duration 8 weeks
Project Open to Undergraduates, Master's (Part III) students
Background Information APEX Horticulture Ltd. is a professional research and development business, offering bespoke services for cut flowers and plants. APEX is based in a purpose-built testing centre, situated in Alconbury, Cambridgeshire (UK). APEX is part of the wider MM Flowers group, where MM is one of the UK’s leading cut flower importer/processing companies, with a unique ownership model and innovative practices. MM Flowers is owned by the AM Fresh Group, a leading breeder, grower and distributor of citrus and grapes; Vegpro, East Africa’s largest flower and vegetable producer; and Elite, the leading flower grower and breeder in South America. APEX is at the optimal position in the chain, able to deliver high quality, independent research and close-to-market proximity matched with the invaluable insight into the true performance of flowers and plants subjected to actual supply chain conditions. The infrastructure and specialised personnel of APEX aims to deliver robust, standardised and consistent research every week of the year, together with the ability to undertake large scale projects to match all client requirements, influencing all elements of the cut flower supply chain. APEX undertakes many different research projects covering the entire supply chain, from development of new flower types through to the manufacturing requirements for the final bouquets. Each of these projects generates a significant amount of data and insight, which is used to provide recommendations to the various stakeholders of each project.
Project Description

APEX tests up to 50k cut flower samples annually, with around 30-60 data points generated per sample. This data includes agronomic and freight data, through to performance data associated with flower longevity and aesthetic appeal. Many of the samples tested are part of long-term programmes focussed on understanding the performance of different cultivars and farms across seasons and years. Alongside this, prospective new cultivars are tested to understand if there are alternative and ‘better’ options available to the current selection. The process to develop and introduce new cultivars is inefficient however, taking anywhere up to 10 years. This is heavily reliant on intuition of breeders, and it can be a challenge to successfully introduce new cultivars that meet rapidly changing requirements. For example, whilst many of the cut flowers grown on the equator are transported to Europe by air freight, the entire industry is currently evaluating the possibility of transitioning much of this to sea freight. Whilst this presents many benefits including environmentally and availability, it substantially increases the freight time, which many existing flower types and cultivars are not able to withstand.

During the development process, the breeders and growers are presented with a dilemma, where there is a desire to be informed and led by data (such as from APEX), but this is a slow process due to limited numbers of samples available initially. Accelerated data generation would require significantly more plants and thus samples, which requires various resources (time, space and inputs), but at greater risk if the cultivars prove to be unviable - an abundance of data is available however where new cultivars have been introduced, with varying levels of success. This has obvious implications for the breeder/grower, but also for those along the supply chain, including suppliers and retailers. Flower types and cultivars selected that do not meet the required standards can result in significant waste, consumer dissatisfaction and potentially brand damage. As such, there is a desire to try and improve the efficiency/speed of the flower development process whilst either minimising/understanding the associated risks.

Given the above, there are different areas that APEX, MM Flowers and the wider group would like to explore, including –

  • Can historical data available be utilised to create models to predict the likelihood of success of cultivars currently being developed as part of breeding programmes?
  • Where data is available when new flower types and cultivars have been transported by sea/subject to sea freight simulations, what impact does that have on determine the viability for successfully commercial application?
Work Environment Student led project, supported by wider team. Hybrid working.

 

Filtering of the result of Monte Carlo simulation

Project Title Filtering of the result of Monte Carlo simulation
Keywords Statistics, Mathematical physics, Numerical Analysis, Monte Carlo simulation, Filtering
Project Listed 15 January 2025
Project Status Open
Contact Name Artem Babayan
Contact Email artem.babayan@silvaco.com
Company/Lab/Department Silvaco TCAD
Address SIlvaco Europe Ltd. 5, Compass Point, St Ives, PE27 5JL
Project Duration 8-12 weeks full time
Project Open to Undergraduates, Master's (Part III) students
Background Information Mathematical modelling of real-life physical problems
Project Description

Silvaco is the software engineering company developing the tools to assist in manufacturing of semiconductor devices. In UK office we mostly work on 'process simulation' side -- mathematical modelling of the processes used in manufacturing.

One of such processes is implantation -- bombardment of piece of (typically) Si with ions (dopants), to change the electrical properties of the target in specific areas. To predict the final ion distribution we use Monte Carlo simulation -- follow the path of large number of ions, as they fly through the structure. The final results show artefacts, typical for Monte Carlo simulation -- e.g. single stray particles or missed areas ('hot' and 'cold' spots correspondingly). We need to apply filter to these 'raw' results, to improve the overall quality.

Your task would be to review the literature and to suggest and to implement the required algorithms.

Work Environment The project assumes the high degree of independence. The development part is expected to be done in the office (in St Ives, near Cambridge).
References  
Prerequisite Skills  
Other Skills Used in the Project Statistics, Mathematical physics, Numerical Analysis, Simulation
Acceptable Programming Languages Python, MATLAB, C++

 

Prisoners Dilemma, LLMs as agents

Project Title Prisoners Dilemma, LLMs as agents
Keywords Game theory, LLM agents, Knowledge graphs, Stochastic modelling
Project Listed 15 January 2025
Project Status Open
Contact Name Uday Kiran
Contact Email kirannu@amazon.com
Company/Lab/Department Amazon Lab126
Address One Station Square, Cambridge, CB1 2GA
Project Duration 8 weeks; full-time
Project Open to Undergraduates, Master's (Part III) students
Background Information The prisoner's dilemma (PD) is a game theory paradox that illustrates how two rational individuals acting in their own self-interest can lead to a suboptimal outcome for the group. It's a thought experiment where each individual can choose to cooperate with their partner for mutual benefit or betray them for personal gain. The dilemma arises because while it's rational for each individual to defect, cooperation would result in a higher payoff for both. This project tries to model rational individuals with per-biased LLM agents to make problem more realistic to real world.
Project Description

The multiplayer prisoner's dilemma

The multiplayer prisoner's dilemma, also known as the n-person prisoner's dilemma (NPD), is a game theory scenario where multiple players must choose between cooperating or defecting:

  • Cooperation: Players work together for the common good
  • Defection: Players pursue their own short-term interests

The outcome for each player depends on their choice and the choices of all other players. If everyone chooses to defect, the outcome is worse for everyone than if they had cooperated.

The NPD became popular in the 1970s among economists and social theorists. It can be used to model many real-world social, political, and economic problems. For example, the tragedy of the commons is a multiplayer generalization of the prisoner's dilemma. In this scenario, villagers must choose between personal gain or restraint. If they all choose to defect, the commons are destroyed.

PD problem for LLMs and knowledge graphs:

This section outlines approach to redesigning the Prisoner's Dilemma (PD) problem using LLM agents with personalized characteristics. Here's a high-level outline of how you could implement this simulation:

1. LLM-based Agents: 

  • Utilize pre-trained large language models (e.g., GPT-3, BERT) as the foundation for each simulated individual.
  • Fine-tune or condition the LLMs with specific personality traits, using available metadata and knowledge graphs to capture individual characteristics.

2. Knowledge Graph Representation:

  • Construct a knowledge graph that encodes the metadata and relationships for each simulated individual, such as purchase history, search behavior, demographic information, and other relevant attributes.
  • Leverage the knowledge graph to inform the decision-making and behavior of the LLM agents during the Prisoner's Dilemma game.

3. Personality Trait Assignment:

  • Assign generic personality traits (e.g., greedy, mischievous, cooperative, humble, merciful) to the LLM agents based on the information in the knowledge graph.
  • Ensure that the personality traits are reflected in the agents' decision-making and interactions during the Prisoner's Dilemma game.

4. Prisoner's Dilemma Simulation:

  • Create a simulated group of 1,000 to 10,000 LLM agents and have them participate in the Prisoner's Dilemma game.
  • Implement the game mechanics, where each agent must choose between cooperating or defecting, based on their personalized characteristics and the game's payoff structure.

5. Multi-group Simulation:

  • Extend the simulation to include multiple groups of LLM agents, representing different demographics or nations.
  • Facilitate both intra-group and inter-group Prisoner's Dilemma interactions, allowing for the exploration of social behavior responses across different demographics or nations.

Key Considerations:

  • Ensure the LLM agents' decision-making and behavior are realistic and aligned with the assigned personality traits and knowledge graph information.
  • Explore techniques to fine-tune or condition the LLMs to exhibit the desired personality traits and decision-making processes.
  • Carefully design the knowledge graph structure and the mapping between metadata and personality traits to achieve accurate individual and group-level behavior.
  • Implement robust simulation mechanics and data collection to analyze the emergent behavior and outcomes of the Prisoner's Dilemma game across the different groups and scenarios.

Problems to solve for successful completion of the project:

Competitive pricing in the marketplace:

1. Modeling the Marketplace Dynamics:

  • Use multi-agent systems to simulate the interactions between different sellers, each represented by an AI agent.
  • Leverage knowledge graphs to capture the relationships between sellers, their products, pricing, branding, and marketing strategies.
  • Employ stochastic modeling techniques, such as Markov Decision Processes, to capture the uncertainty and dynamic nature of the marketplace.

2. Modeling Seller Behavior:

  • Develop AI agents that can learn and adapt their pricing, branding, and marketing strategies based on the actions of their competitors and the responses of buyers.
  • Utilize reinforcement learning or other machine learning techniques to allow the agents to learn and optimize their strategies over time.
  • Incorporate game-theoretic principles to model the strategic decision-making of the sellers, taking into account the potential responses of their competitors.
  • Incorporate in the knowledge graphs human personality traits (e.g. greedy, mischievous, cooperative, humble, merciful)

3. Modeling Buyer Behavior:

  • Integrate buyer preferences, price sensitivity, and decision-making processes into the model.
  • Leverage techniques like discrete choice modeling or agent-based modeling to capture the heterogeneity and complexity of buyer behavior.
  • Explore how buyer actions and preferences influence the pricing and marketing decisions of the sellers.
  • Incorporate in the knowledge graphs human personality traits (e.g. greedy, mischievous, cooperative, humble, merciful)

4. Incorporating the Tragedy of the Commons:

  • Explore how the shared nature of the marketplace, similar to the tragedy of the commons, can lead to suboptimal outcomes for the sellers & buyers.
  • Investigate strategies or mechanisms that can mitigate the tragedy of the commons, such as coordination, cooperation, or regulation.
  • Analyze how the presence of the tragedy of the commons affects the pricing, branding, and marketing decisions of the sellers.

5. Leveraging Multi-LLM Agents:

  • Utilize multiple large language models (LLMs) to represent different aspects of the marketplace, such as seller decision-making, buyer behavior, and market dynamics.
  • Explore techniques for integrating and coordinating the different LLM agents to create a cohesive and realistic simulation of the marketplace.
  • Investigate how the combination of knowledge graphs, stochastic modeling, and multi-LLM agents can provide a more comprehensive and accurate representation of the competitive pricing landscape.

By incorporating these advanced techniques, students can gain a deeper understanding of the complex dynamics and decision-making processes involved in competitive pricing within a marketplace. This exercise can help them develop skills in multi-agent modeling, stochastic optimization, and the application of knowledge graphs and LLMs to complex real-world problems

Work Environment

The student will work independently on this project, with myself serving as the industrial supervisor. I will provide regular guidance and mentorship, helping the student define goals, troubleshoot challenges, and refine their approach throughout the project. Although the student will primarily work on their own, I will be readily available for discussions and feedback through scheduled meetings and as needed via email or video calls.

The student will have the flexibility to work remotely, allowing them to structure their schedule to maximize productivity. There are no fixed office or lab hours, but the student is encouraged to maintain consistent progress and attend periodic check-ins to review milestones and ensure alignment with the project goals.

Day-to-day, the student will engage in tasks such as implementing and testing models, analyzing results, and documenting findings. They will have access to tools, datasets, and resources necessary for the project, along with my guidance to navigate technical or conceptual challenges. This setup offers the student a hands-on, immersive experience while fostering independence and problem-solving skills.

References  
Prerequisite Skills Statistics, Probability/Markov Chains, Mathematical Analysis
Other Skills Used in the Project Predictive Modelling, Database Queries, Data Visualization
Acceptable Programming Languages Python, R

 

Hallmarks of cancer regression

Project Title Hallmarks of cancer regression
Keywords predictive biomarkers, multimodal data, hierarchical regression, Hallmarks of cancer
Project Listed 15 January 2025
Project Status Filled
Contact Name Fabio Rigat
Contact Email fabio.rigat@astrazeneca.com
Company/Lab/Department AstraZeneca PLC
Address 36 Hills Road, Cambridge, CB2 8PA
Project Duration 8 weeks full time, ideally btw June 2nd and July 31st
Project Open to Undergraduates, Master's (Part III) students
Background Information In oncology, molecular features of prognostic or predictive value are key to matching patients with effective investigational treatment strategies. These features range from a small number of well understood markers, including expression of drug targets and molecular characterisations of the tumour microenvironment, up to high dimensional multi-modal data including genetic variants, gene expression and protein expression. When high dimensional molecular disease features are used, it is challenging to derive robust features providing accurate prediction of response to therapy in validation samples.
Project Description This project will focus in on assessment of a novel supervised learning methodology estimating low dimensional predictive markers by combining high dimensional disease molecular characteristics and established gene annotation systems based on the Hallmarks of Cancer. This assessment will include running computer simulations estimating true & false positive outcome probabilities under selected scenarios, application of the method to re-analysis of published datasets and applications to the method to exploratory analyses of internal unpublished data. The outcome of this project will be integrated with ongoing work to provide material towards a publication.
Work Environment Student will be working within the AstraZeneca Biometrics environment, supported specifically by members of the Statistical Innovations organisation.
References 1. Douglas Hanahan and Robert A. Weinberg (2011) Hallmarks of Cancer: The Next Generation, Cell, DOI 10.1016/j.cell.2011.02.013
2. Ádám Nagy, Gyöngyi Munkácsy, Balázs Győrffy (2021) Pancancer survival analysis of cancer hallmark genes, https://doi.org/10.1038/s41598-021-84787-5
3. Otília Menyhart, William Jayasekara Kothalawala, Balázs Győrffy (2024) A gene set enrichment analysis for the cancer hallmarks, https://doi.org/10.1016/j.jpha.2024.101065
4. Francesco C Stingo, Yian A Chen, Mahlet G Tadesse, Marina Vannucci (2011) Incorporating biological information into linear models: a bayesian approach to the selection of pathways and genes, https://doi.org/10.1214/11-AOAS463
Prerequisite Skills Statistics, Simulation, Predictive Modelling, Data Visualization, Effective collaboration skills
Other Skills Used in the Project Interest in oncology
Acceptable Programming Languages Python, MATLAB, R

 

Investigation of dopant activation and diffusion in SiC

Project Title Investigation of dopant activation and diffusion in SiC
Keywords tcad, modeling, activation, diffusion, SiC
Project Listed 20 January 2025
Project Status Open
Contact Name Alexandros Kyrtsos
Contact Email alexandros.kyrtsos@silvaco.com
Company/Lab/Department Silvaco, Process Engineering Team
Address Silvaco Europe Ltd. 5, Compass Point, St Ives, PE27 5JL
Project Duration 8 weeks, full time
Project Open to Undergraduates, Master's (Part III) students
Background Information Silvaco is a global leader in electronic design automation (EDA) software and technology computer-aided design (TCAD) solutions. Our cutting-edge tools empower semiconductor companies to design, simulate, and optimize next-generation devices and processes.
Project Description

As a TCAD intern focusing on process simulation, you’ll work alongside experts to investigate dopant activation and diffusion in SiC-4H, developing and enhancing models for these critical semiconductor processes. This is an opportunity to gain hands-on experience, contribute to advanced research, and be part of the innovation that drives the future of semiconductor technology.

The project involves literature search and research on the matter of activation and diffusion of various dopants in SiC-4H. Furthermore, it involves the development and validation of models to simulate these processes. The student will have the opportunity to enhance and develop skills such as data analysis and visualization, development of physical models, simulation techniques, programming.

Work Environment Hybrid (mix of on-site and remote work). High degree of independent work is required.
References https://www.iue.tuwien.ac.at/phd/simonka/index.html, chapter 3
Prerequisite Skills Mathematical physics, Simulation, Data Visualization
Other Skills Used in the Project Simulation, Predictive Modelling, Data Visualization
Acceptable Programming Languages Python, MATLAB, C++

 

Discrete Representations of Continuous Probability Distributions

Project Title Discrete Representations of Continuous Probability Distributions
Keywords Distributions, Probability, Representations, Statistics
Project Listed 20 January 2025
Project Status Open
Contact Name Laurence Weir
Contact Email careers@signaloid.com
Company/Lab/Department Signaloid Ltd
Address 4 Station Square, Cambridge, CB1 2GE
Project Duration 8 weeks, full time
Project Open to Undergraduates, Master's (Part III) students
Background Information Probability distributions provide a mathematical framework for understanding and modelling uncertainty, allowing us to quantify the likelihood of different outcomes in random processes. By characterising how data is distributed, they enable informed decision-making and are foundational to fields like statistics, machine learning, and risk assessment. Many of these distributions, such as the famous normal distribution (bell curve), are defined continuously, but in reality we need to represent these distributions with a finite number of discrete points so that we may perform statistical tasks quickly and efficiently on a computer.
Project Description In this project you will be working on new discrete representations of probability distributions to try and uncover better ways to capture the shape and form of many theoretical and real world distributions. First you will learn about distributions as a rigorous mathematical object and how you can perform arithmetic on them. You will also learn how we quantify the "closeness" of distributions using distance metrics and criteria. Then after researching existing methods to represent distributions discretely, you will get to try and conceive of new and improved methods. Finally, you will test and verify these methods both analytically and numerically through simulations (in Python or a similar language).
Work Environment Join a remote team of industry mathematicians discussing probability theory and real world statistical problems. You will have the chance to talk with your supervisor multiple times per week and have them guide you through the project and oversee your progress.
References https://link.springer.com/article/10.1007/s00362-022-01356-2
Prerequisite Skills Statistics, Probability/Markov Chains, Simulation
Other Skills Used in the Project Mathematical Analysis, Data Visualization, Metric Spaces
Acceptable Programming Languages Python

 

Finite Difference Approximation of Multiphase Stokes Flow with Free Interfaces on Staggered Cartesian Grids

Project Title Finite Difference Approximation of Multiphase Stokes Flow with Free Interfaces on Staggered Cartesian Grids
Keywords Multiphase Stokes Flow, Finite-Difference Methods, PDE, Applied Linear Algebra
Project Listed 24 January 2025
Project Status Open
Contact Name Vasily Suvorov
Contact Email vasily.suvorov@silvaco.com
Company/Lab/Department Silvaco Europe, TCAD
Address Silvaco Technology Centre Compass Point St Ives, Cambridgeshire, United Kingdom PE27 5JL
Project Duration 8 weeks, 40 hours/week
Project Open to Undergraduates, Master's (Part III) students
Background Information A modern semiconductor technology involves processes where materials with free interfaces undergo large and slow deformations. Such deformations can often be modelled by the incompressible Stokes flow. The project aims to analyse the company’s working numerical approach to model such flow with the aim of improving accuracy, stability and convergence.
Project Description

Silvaco uses finite-difference schemes on structured 2D and 3D Cartesian grids to simulate multiphase Stokes flow with free interfaces. A particular challenge in applying such schemes lies in the accurate approximation of boundary conditions at the interfaces between two different viscous liquids and in the approximation of momentum equations across these interfaces.

The student will assist in analyzing and improving the approximation and stability of the current numerical schemes, with the possibility of proposing better alternatives. Special attention will be given to developing numerical schemes that are well-suited for iterative methods such as BICGSTAB. The resulting matrices will be analysed using SVD and QR factorization, and other appropriate techniques from Numerical Linear Algebra.

Work Environment The student will work on his/her own with the support and guidance from the supervisor.
References  
Prerequisite Skills Numerical Analysis, PDEs, Algebra/Number Theory
Other Skills Used in the Project  
Acceptable Programming Languages Python, MATLAB

 

Exploring the use of Generative Adversarial Networks for synthetic data generation

Project Title Exploring the use of Generative Adversarial Networks for synthetic data generation
Keywords Generative adversarial networks (GANs), Synthetic data, Toxicology, Neural networks, Applied scientific computing
Project Listed 27 Jan 2025
Project Status Open
Contact Name Patrik Engi
Contact Email patrik.engi@unilever.com
Company/Lab/Department Unilever SERS
Address Colworth Science Park, Sharnbrook, Bedford MK44 1LQ
Project Duration 8-12 weeks
Project Open to Undergraduates, Master's (Part III) students
Background Information

In fast-moving consumer goods, it is vital that safety risk assessments are conducted to ensure products are safe for humans and the environment. Historically, these risk assessments have relied on the use of in vivo animal testing to identify detrimental impacts of chemicals on organisms. However, from a scientific, ethical, and legislative perspective, more recently developed non-animal methods are the preferred approach.

For more than 20 years, Unilever’s Safety, Environmental and Regulatory Science (SERS) has been developing novel in silico and in vitro based methods, which leverage recent advances in biology, genetics, computing, mathematics and statistics, to conduct safety assessments without the use of animal testing. [1, 2].

This evolution in the risk assessment paradigm presents new opportunities in terms of applying new deep learning and AI-based approaches. A key risk assessment step is to characterise the potential effects that a chemical may have on different cell types. This typically involves using high throughput transcriptomics (HTTr) to measure the genetic response of cells to different concentrations of a test chemical. Such data can be expensive to generate, particularly if it needs to be generated for multiple chemicals and cell types. Furthermore, it is common to encounter situations where not all the necessary data for a risk assessment is readily available.

Therefore, the use of approaches that maximise the utility of the available data is a high priority. Recent advances in deep-learning and AI may provide a way to generate so-called synthetic data. These could be used either to fill data gaps within a risk assessment or make predictions on the effects a chemical might cause at a gene transcriptional level. This project would focus on exploring the utility presented by Generative Adversarial Networks in this application.

Project Description

GANs generate synthetic data through two competing neural networks, a generator and discriminator, engaging in a zero-sum game. This project will therefore provide the opportunity to apply and expand existing knowledge in various areas, such as statistics, probabilistic machine learning, and game theory, while also building skills and experience in applied scientific computing.

We suggest that the student(s) approach the topic as an open-ended research project, focusing on recent developments using GANs in in vitro Toxicology [3]. We would like the student to demonstrate and develop from the existing science by: 

  • Reviewing the current literature landscape, highlighting relevant papers, tools, and resources.
  • Developing their technical knowledge of deep learning, statistics and scientific computing.
  • Identifying, implementing and test various tools that already exist in this space.

This phase of the project would involve the student familiarizing themselves with the current state-of-the-art regarding the application of GANs in toxicology, guided by the available literature as in Refs. [3-5]. Once achieved, we would like the student to advance this field of application by:

  • Generating synthetic data beyond the examples found in the literature.
  • Identifying suitable tools and making the necessary adaptions to generate data for in-house studies.

Throughout the project, the student will have opportunity to meet with experts from various mathematical backgrounds, as well as collaborate with other disciplines such as toxicology, human biology, and risk assessment.

Work Environment Student will mostly work independently, but will have the full support of a wider team + supervisors for questions, guidance and advice. We expect the student will be working remotely, but visiting/attendance to site (Sharnbrook, near Bedford) is encouraged if travel permits.
References [1] J. Reynolds, S. Malcomber and A. White, “A Bayesian approach for inferring global points of departure from transcriptomics data,” Computational Toxicology, vol. 16, p. 100138, November 2020.
[2] T. E. Moxon, H. Li, M.-Y. Lee, P. Piechota, B. Nicol, J. Pickles, R. Pendlington, I. Sorrell and M. T. Baltazar, “Application of physiologically based kinetic (PBK) modelling in the next generation risk assessment of dermally applied consumer products,” Toxicology in Vitro, vol. 63, p. 104746, March 2020.
[3] Chen X, Roberts R, Tong W, Liu Z. Tox-GAN: An Artificial Intelligence Approach Alternative to Animal Studies-A Case Study With Toxicogenomics. Toxicol Sci. 2022 Mar 28;186(2):242-259. doi: 10.1093/toxsci/kfab157. PMID: 34971401.
[4] Chen, X., Roberts, R., Liu, Z. et al. A generative adversarial network model alternative to animal studies for clinical pathology assessment. Nat Commun 14, 7141 (2023). https://doi.org/10.1038/s41467-023-42933-9
[5] Lee, M. Recent Advances in Generative Adversarial Networks for Gene Expression Data: A Comprehensive Review. Mathematics 2023, 11, 3055. https://doi.org/10.3390/math11143055
Prerequisite Skills Statistics
Other Skills Used in the Project Predictive Modelling, Data Visualization, Deep learning
Acceptable Programming Languages No Preference

 

Bucketed interest rate risk

Project Title Bucketed interest rate risk
Keywords Financial Mathematics, Interest Rates, Risk Management
Project Listed 30 January 2025
Project Status Closed
Contact Name Chris Hunter, Jennifer Shaeffer
Contact Email jshaeffer@pharo.com
Company/Lab/Department Pharo Management
Address 8 Lancelot Place, London, SW7 1DR
Project Duration 8 weeks, full time
Project Open to Undergraduates, Master's (Part III) students
Background Information

Pharo Management is a leading global macro hedge fund manager with a focus on emerging markets. Founded in 2000, the firm has offices in London, New York and Hong Kong and currently manages approximately $7 billion in assets across four funds. Pharo trades foreign exchange, sovereign and corporate credit, local market interest rates, commodities, and their derivatives.  We trade in over 70 countries across Asia, Central and Eastern Europe, the Middle East and Africa, Latin America as well as developed markets. Our investment approach combines macroeconomic fundamental research and quantitative analysis.

Pharo employs a diverse, dynamic team of over 125 professionals representing over 20 nationalities and 30 languages. We have a strong corporate culture anchored in core values such as collaborative spirit, creativity, and respect. We are passionate about what we do and are committed to attracting the best and brightest talent.

Project Description

Expected Outcomes

By the end of the internship, the intern will have developed a clear understanding of risk transformations in interest rate modeling, implemented practical computation methods for Jacobian-based transformations, and potentially explored advanced techniques using algorithmic differentiation. The project will contribute to more efficient and accurate risk management methodologies in fixed-income markets.

Project Overview

This internship project focuses on the transformation of bucketed interest rate risk using Jacobian matrices and, if time permits, the computation of bucketed risk using algorithmic differentiation (AD). The goal is to enhance methodologies for understanding and managing interest rate sensitivities in financial models.

Interest rate risk is commonly analyzed by measuring sensitivities to shifts in specific maturity buckets (e.g., 1Y, 5Y, 10Y). However, for risk aggregation, stress testing, and hedging, these bucketed sensitivities must often be transformed into different risk bases, such as principal component decompositions or forward-rate perturbations (for example, 1Y1Y, 2Y3Y and 5Y5Y). This transformation is mathematically represented as a Jacobian matrix operation, which maps one set of risk factors to another while preserving sensitivity structure.

Algorithmic Differentiation (AD) is a computational technique used to efficiently compute derivatives of functions expressed as computer programs. Unlike symbolic differentiation, which can lead to expression swell, or numerical differentiation, which suffers from truncation and rounding errors, AD systematically applies the chain rule of differentiation at the elementary operation level, allowing for highly accurate and efficient gradient computations. AD is particularly useful in financial applications such as risk management and derivatives pricing, where sensitivity analysis and risk calculations must be performed quickly and accurately.

Key Objectives

1. Jacobian Risk Transformations

  • Understand and implement transformations between different risk factor bases.
  • Construct and validate the Jacobian matrix that links different interest rate risk representations.
  • Analyze stability and efficiency of transformations in practical applications.

2. Algorithmic Differentiation for Risk Computation

  • If time permits, explore the use of algorithmic differentiation (AD) to compute bucketed interest rate risk.
  • Compare AD-based risk computation with traditional finite difference methods in terms of accuracy and performance.

Skills & Technologies

  • Mathematics & Finance: Linear algebra, calculus (Jacobian matrices), financial risk modeling.
  • Programming: Python (NumPy, SciPy), potential exposure to AD tools such as JAX or TAPENADE.
  • Computational Techniques: Matrix transformations, differentiation techniques, numerical stability considerations.
Work Environment Ideally the student will work at the Pharo office (central London), supported by myself and other members of the Quantitative Analytics team.  Remote working would be considered.
References

For an introduction to risk transformations using Jacobian matrices, refer to:
Darbyshire, J. (2017) – Pricing and Trading Interest Rate Derivatives.

For an introduction to algorithmic differentiation, refer to:
Burgess, N. – Algorithmic Adjoint Differentiation (AAD) for Swap Pricing and DV01 Risk.

Prerequisite Skills Python, Linear Algebra
Acceptable Programming Languages Python

 

Machine learning on multimodal and unstructured data for healthcare applications

Project Title Machine learning on multimodal and unstructured data for healthcare applications
Keywords AI, machine learning, healthcare, multimodal data
Project Listed 4 February 2025
Project Status Open
Contact Name Sam Genway
Contact Email sam.genway@lifearc.org
Company/Lab/Department LifeArc
Address Accelerator Building, Open Innovation Campus, Stevenage, SG1 2FX
Project Duration 9 weeks, full-time - starting 30 June 2025
Project Open to Undergraduates, Master's (Part III) students
Background Information

At LifeArc, our ambition is to make life science life changing. We do this by advancing scientific discoveries beyond the lab, faster, so that they can shape the next generation of diagnostics, treatments, and cures.

Working at the cutting edge of translational science and as the early-stage translation specialists, we progress scientific discoveries on their journey to becoming a medicine, diagnostic or intervention that can improve patients’ lives. Our work begins by seeking out innovative science, then helping to develop this to a point where there is a clinical and commercial pathway for others to invest the time and money to take it further forward.

Data Sciences is an integral part of LifeArc’s Science organisation. We work with our laboratory-based projects to analyse, visualise and interpret data in order to design future experiments; we build computational models to make predictions, often using the latest AI and machine learning methods; we develop computational workflows and write software; we work closely with LifeArc colleagues and with external collaborators in multiple project teams. Our methods are applied to tackle problems in chemistry, biology and medical/clinical science.

What we can offer you:
Because we understand everyone has different requirements, our flexible benefits allow you to choose those which are important to you. Our pension scheme offers employer contributions of up to 12%, private health insurance, and annual leave of 31 days PLUS bank holidays (prorated for duration of placement).

Join us, and you’ll have the scope to be creative and take measured risks. You’ll be rewarded for your curiosity, for working as one team, and for learning fast. And you’ll have everything you need to be your best every day.

We all have potential. At LifeArc, you’ll discover what you can really do with it.

Project Description

Job Title: Summer Placement Student (Data Sciences)
Location: Stevenage
Job Type: Temporary (9 weeks) full-time
Placement Start Date: 30 June 2025
Salary: £22,575 per annum (£21,500 base, plus £1,075 allowance, which can be taken as cash, or used for additional benefits) – prorated for duration of summer placement

At LifeArc, we want to hear from people who are as passionate as we are about making life science life changing that can improve patients’ lives.

A bit about the role:

This is an opportunity to get involved in exciting work within the Data Sciences team at our state-of-the-art facilities in Stevenage. LifeArc is a self-funded not-for-profit organisation with a mission to impact patient unmet needs. Artificial Intelligence (AI) brings new paths to patient impact through the development and translation of healthcare AI applications. However, several challenges exist in the application of machine learning methods to real-world challenges, such as predicting patients at risk of disease, or informing a diagnosis or prognosis. Particular challenges include leveraging multimodal data available in patient cohorts during training to create impactful models which provide utility when modalities are not available at inference time. Another challenge is in the formulation of machine learning methods for time-to-event predictions leveraging unstructured datasets. In each case, there are multiple approaches which could be valuable, and the aim of this project is to compare and identify those with real-world utility.

Project

The project will focus on one or two of the following challenges:

  • Multimodal machine learning with modalities available exclusively during training. This has broad relevance across applications where data modalities are available in clinical study data with the potential to create better representations and models even when these modalities are not available at inference time (for example, in a healthcare setting). Examples include imaging data alongside patient histories, biomarkers and demographic data. A wide variety of methods exist to tackle this challenge including imputation, joint representation learning, and model distillation. The aim of this project is to explore which approach(es) would be of utility in realistic real-world scenarios.
  • Machine learning time-to-event predictions from unstructured data. This is relevant for identifying patients at risk of disease or clinical events of interest using unstructured data such as images or text. A number of approaches have been developed, ranging from formulating appropriate loss functions to neural ODEs. The goal here is to explore each within the context of real-world applications and identify which are of utility.

About you:

Education & experience required:

  • You will currently be in your 2nd, 3rd or 4th year and studying for your first degree and available to commence a summer placement from 16 June 2025.
  • You will be studying a Mathematics degree.

Skills and Strengths, we are looking for during the recruitment and selection process: 

  • On track to achieve at least a 2:1 classification
  • Collaboration
  • Drive and determination
  • Analytical mindset
  • Accountability
  • Adaptability
  • Desire to learn
  • Effective communication skills

You are not expected to have a deep background in life sciences or healthcare. We want to hear from students who are passionate about the application of machine learning methods for real world impact in healthcare, with experience in predictive modelling and hands-on programming experience in python. Candidates must have the right to work in the UK.
 

Application Process: Applications are open from 5 February 2025.

As part of the application process, please send the following via email to Sam Genway (Scientific Director, Machine Learning and AI) at Sam.Genway@lifearc.org:

  • CV
  • Cover Letter - as well as telling us why you are interested in the project and why you think you would be a good fit, please include a short response to the following questions (max 150 words per question) 
    • What interests you about a role within our industry (Translational Science)?
    • What part of LifeArc, or the work we are involved in, do you personally believe in the most, and why?

Application Closing: 28 February 2025.

If your application is successful, you will be invited to a final stage virtual interview. Full instructions and guidance on how to approach and prepare for the assessment will be provided.

We are also proud to be using Rare Recruitment's Contextual Recruitment System (CRS) which allows us to consider your achievements in the context in which they were gained. We understand that not all candidate’s achievements look the same on paper – and we want to recruit the best people, from every background. We would therefore encourage you to submit your contextualised data using the Rare Contextual Recruitment System as part of your application using this link: https://lifearc.app.contextualrecruitment.com/apply/cf4cc979-16f6-435b-922d-ca52259fb839

Work Environment The student will work at our Stevenage site on their own project, but with regular supervision and with other members of the data sciences group available to talk to about the project.
References  
Prerequisite Skills Statistics, Predictive Modelling
Other Skills Used in the Project Image processing
Acceptable Programming Languages Python