skip to content

Summer Research Programmes

 

2026 Industrial CMP Projects

Below you will find the list of industrial CMP projects hosted by external companies (jump to list).  Click here to see the list of academic projects hosted by other departments and labs within the university.  

New projects may be added so check back regularly!

 

How to Apply

Unless alternative instructions are given in the project listing, to apply for a project you should send your CV to the contact provided along with a covering email which explains why you are interested in the project and why you think you would be a good fit.  

Need help preparing a CV or advice on how to write a good covering email? 

The Careers Service are there to help!  Their CV and applications guides are packed full of top tips and example CVs.  

Looking for advice on applying for CMP projects specifically?  Check out this advice from CMP Co-Founder and Cambridge Maths Alumnus James Bridgwater.  

Remember: it’s better to put the work into making fewer but stronger applications tailored to a specific project than firing off a very generic application for all projects – you won’t stand out with the latter approach!  

Please note that to participate in the CMP programme you must be a student in Part IB, Part II, or Part III of the Mathematical Tripos at Cambridge.  

 

Want to know more about a project before you apply? 

Come along to the CMP Lunchtime Seminar Series in February 2026 to hear the hosts give a short presentation about their project.  There will be an opportunity afterwards for you to chat informally with hosts about their projects. 

Alternatively (or as well!), you can reach out to the contact given in the project listing to ask questions. 

 


Industrial CMP Project Proposals for Summer 2026

 

Exploring Gene Embeddings for Biological Analysis

Project Title Exploring Gene Embeddings for Biological Analysis
Keywords Gene embeddings, networks, large language models, perturbation assays, gene interactions
Project Listed 9 January 2026
Project Status Open
Application Deadline 27 February 2026
Project Supervisor Marie Lisandra Zepeda Mendoza
Contact Name Marie Lisnadra Zepeda Mendoza
Contact Email vmnz@novonordisk.com
Company/Lab/Department Novo Nordisk Research Centre Oxford
Address Old Road Campus, Roosevelt Drive, Oxford, OX3 7FZ
Project Duration 8 weeks, full time
Project Open to Masters students (Part III)
Background Information

Genes are the basic units of heredity and encode the information for the synthesis of proteins and other molecules that perform various functions in living organisms. Understanding the relationships between genes and their functions is a fundamental challenge in biology and medicine. One way to approach this challenge is to represent genes as numerical vectors, also known as embeddings, that capture some aspects of their biological properties and interactions. Embeddings can be derived from various sources of data, such as gene sequences, gene expression, gene ontology, protein-protein interactions, and literature. Embeddings can then be used for various tasks, such as gene clustering, gene function prediction, gene-disease association, and gene pathway analysis.

The project is part of a broader and strategically critic goal for AI applied to computational biology in R&D in Novo Nordisk, which relates to the use, validation and control of gene embeddings for novel target and biomarker discovery and functional contextualization.

Project Description

Aim
The aim of this project is to explore how to define embeddings for genes and how they relate to each other, and to evaluate which embeddings are most useful and which data sources should be included. The specific objectives are:

  • To review the existing methods and tools for generating gene embeddings from different data sources
  • To compare and contrast different types of gene embeddings, such as sequence-based, expression-based, ontology-based, interaction-based, and literature-based. Of particular interest will be recent work on augmenting large language models with domain-specific tools such as database utilities for more precise access to specialized knowledge (e.g. GeneGPT),
  • To apply and test different gene embeddings on various biological analysis tasks, such as gene function prediction, gene-disease association gene pathway analysis and drug target prediction.

Methodology
The methodology of this project consists of the following steps:

  • To query existing public and internal databases for the general characterization of a gene including
    • Which tissue/cell type is the gene expressed in.
    • Which pathways is it part of and which other genes are in that pathway
    • Which diseases is it known to be associated to?
    • What is the interaction network of this gene in a particular tissue?
    • Has the gene been previously explored?
    • Is there patent data and/or human clinical data for the gene?
    • What assay, cell-type and conditions should be used for validation? 
  • To select and implement the appropriate methods and tools for generating gene embeddings from the different data sources, (eg. word2vec, doc2vec, autoencoders, graph neural networks) and with particular emphasis on transformers / large language models to represent literature information.
  • To evaluate and compare the quality and performance of different gene embeddings on various biological analysis tasks, such as gene function prediction, gene-disease association, gene pathway analysis and drug target predictions, using appropriate metrics and benchmarks.
  • Of particular interest, is the downstream comparison of the a-priori based information embeddings to the gene embeddings of cellular genetic perturbation in vitro imaging screening assays that Novo Nordisk has inhouse.

Expected Outcomes 

  • A comprehensive review of the existing methods and tools for generating gene embeddings from different data sources.
  • A comparative analysis of the different types of gene embeddings and their applications on various biological analysis tasks.
  • A critical evaluation of the strengths and limitations of different gene embeddings and data sources, and suggestions for possible improvements and extensions.
  • An extremely valuable comparison of the a-priori embeddings to the embeddings Novo Nordisk has inhouse from our perturbation assays.

The implications of this project are:

  • To provide a better understanding of the relationships between genes and their functions, and to facilitate the discovery of new biological insights and hypotheses.
  • To contribute to the advancement of the field of gene embeddings and their applications in biology and medicine.
  • To demonstrate the potential and challenges of applying natural language processing and machine learning techniques to biological data.
References Soman, Karthik, et al. "Biomedical knowledge graph-enhanced prompt generation for large language models." arXiv preprint arXiv:2311.17330 (2023).
Chen YT, Zou J. GenePT: A Simple But Hard-to-Beat Foundation Model for Genes and Cells Built From ChatGPT. bioRxiv [Preprint]. 2023, https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC10614824/
Integrating and formatting biomedical data as pre-calculated knowledge graph embeddings in the Bioteque. https://www.nature.com/articles/s41467-022-33026-0
In silico biological discovery with large perturbation models https://www.nature.com/articles/s43588-025-00870-1
Work Environment The student will work closely with the supervisor and will be able to interact with other colleagues of the bioAI Department, both in the Oxford as well as in the London site. We are a fully computational team, but in the Oxford site we have also various in vitro expert teams, to which the student can also be exposed to. If the student wishes to work in a hybrid mode, it is fine with the supervisor.
Prerequisite Skills Statistics, Image processing, Geometry / Topology, Mathematical analysis, Simulation, Predictive Modelling, Database queries, Data Visualisation, Probability / Markov Chains
Other skills used in the Project Statistics, Probability / Markov Chains, Image processing, Mathematical analysis, Geometry / Topology, Simulation, Predictive Modelling, Database queries, Data Visualisation
Acceptable Programming Languages Python, R
Additional Requirements Enthusiasm for biological applications of maths and a lot of willingness to learn Good communication and presentation skills are also desirable.
Application Instructions Send your CV to the contact provided above along with a covering email which explains why you are interested in the project and why you think you would be a good fit.

 

Virtual Cells with Large Language Models

Project Title Virtual Cells with Large Language Models
Keywords Virtual Cell, LLMs, Causality, In-contex learning
Project Listed 9 January 2026
Project Status Open
Application Deadline 27 February 2026
Project Supervisor Marc Boubnovski Martell and Josefa Stoisser
Contact Name Josefa Stoisser
Contact Email ofsr@novonordisk.com
Company/Lab/Department Novo Nordisk, BioAI team
Address Novo Nordisk R&D Digital Hub, Pancreas Rd, London, N1C 4AG, UK
Project Duration 8-10 weeks, full-time
Project Open to Masters students (Part III)
Background Information

A virtual cell is an in-silico model (a kind of “digital twin”) that lets us predict how a living cell will respond to interventions (e.g. adding a drug). This sits in a fast-moving area of BioAI, with community benchmarks such as the Virtual Cell Challenge [1] pushing models toward realistic generalisation settings.

At the core of the virtual cell is biological perturbation prediction: given a baseline cell state, predict how the cell changes after an intervention (e.g., knocking out a gene). Conceptually, this is a causal effect problem, made hard by biological confounding, incomplete measurements, and distribution shift (new cell types, new perturbations, new experimental settings).

Our recent work (“LangPert”, ICLR 2025 workshop spotlight [2]) suggests a practical path forward: use LLMs to retrieve and synthesise mechanistic biological context (gene function, pathways, interactions, etc.) and condition predictive models on that context. The key benefit is zero-shot or low-data generalisation to perturbations the model has not seen during training, while also producing explanations that are at least partially aligned with known biology. In parallel, LLM-powered causal analysis motivates a causality-first approach to virtual cells [3, 4].

Project Description

The project will explore LLM-informed causal modelling for perturbation prediction. The high-level aim is to use LLM knowledge as a contextual guidance—not as a replacement for data—so that models can better infer gene–gene relationships and predict outcomes of interventions, especially when faced with novel perturbations or shifted experimental conditions.

  1. Context building with LLMs: Retrieve and summarise biologically relevant information for a given perturbation. Explore different ways of representing the context.
  2. Causal / intervention-aware prediction: Combine LLM-derived context with state-of-the-art models for intervention prediction and causal effect estimation (e.g., Do-PFN [5]). Investigate how LLM context should enter the model (e.g., as an uncertainty-aware DAG prior), and whether identification-consistent in-context exemplars (baseline and singles) are sufficient for generalising to unseen combinations.
  3. Generalisation and robustness: Evaluate performance on held-out perturbations and under distribution shift (e.g., new cell types). Run ablations and compare outcomes to causal identification expectations to measure when context helps vs. hurts.
  4. Interpretability (if time): Assess whether model rationales and retrieved context align with known biology and identify failure modes (e.g. hallucinated context).

Because the AI landscape changes quickly the specific LLM, retrieval approach, and causal estimator will be updated at the project start to reflect the best available options.

Successful outcome:

  • A reproducible experimental pipeline (datasets, baselines, evaluation splits, and ablations).
  • A short technical report summarising findings and recommendations.
  • If results are strong and pass internal review, the work may be included in an AI workshop paper submission.
References [1] Virtual Cell Challenge. https://virtualcellchallenge.org/.
[2] Märtens, K., Boubnovski Martell, M., Prada-Medina, C. A., & Donovan-Maiye, R. (2025). LangPert: LLM-Driven Contextual Synthesis for Unseen Perturbation Prediction. MLGenX Workshop at ICLR 2025 (Oral). https://openreview.net/forum?id=Tmx4o3Jg55.
[3] Wang, X., Zhou, K., Wu, W., Singh, H. S., Nan, F., Jin, S., Philip, A., Patnaik, S., Zhu, H., Singh, S., Prashant, P., Shen, Q., & Huang, B. (2025). Causal-Copilot: An Autonomous Causal Analysis Agent. arXiv:2504.13263 [cs.AI]. https://doi.org/10.48550/arXiv.2504.13263.
[4] Kıcıman, E., Ness, R. O., Sharma, A., & Tan, C. (2024). Causal Reasoning and Large Language Models: Opening a New Frontier for Causality. Transactions on Machine Learning Research (TMLR). https://doi.org/10.48550/arXiv.2305.00050
[5] Robertson, J., Reuter, A., Guo, S., Hollmann, N., Hutter, F., & Schölkopf, B. (2025). Do-PFN: In-Context Learning for Causal Effect Estimation. NeurIPS 2025. https://doi.org/10.48550/arXiv.2506.06039.
Work Environment The student will join the BioAI team at Novo Nordisk, supervised by Josefa Stoisser and Marc Boubnovski Martell, with co-supervision from Jialin Yu (University of Oxford). The BioAI team develops AI/LLM/agentic systems for drug discovery and has a publication track record in top-tier AI venues (NeurIPS, ACL, ICML, ICLR). The office is in King’s Cross, London. Remote work is possible, but 2–3 days per week on-site is preferred.
Prerequisite Skills Statistics, Mathematical analysis
Other skills used in the Project LLMs, Causal Inference
Acceptable Programming Languages Python
Additional Requirements -
Application Instructions Send your CV to the contact provided above along with a covering email which explains why you are interested in the project and why you think you would be a good fit.

 

Fragmented Order-book Content Assessment and Liquidity-weighting (FOCAL)

Project Title Fragmented Order-book Content Assessment and Liquidity-weighting (FOCAL)
Keywords FX Markets, limit order book, microprice
Project Listed 9 January 2026
Project Status Open
Application Deadline 27 February 2026
Project Supervisor Jan Novotny
Contact Name Jan Novotny
Contact Email jan.novotny@nomura.com
Company/Lab/Department Nomura International Plc
Address 1 Angel Ln, London EC4R 3AB
Project Duration 8-10 weeks full time
Project Open to Masters students (Part III)
Background Information Foreign exchange markets present a particularly compelling use case for this research, as they represent the world's largest and most liquid financial market with daily trading volumes exceeding $7 trillion. Unlike centralized exchange-traded assets, FX markets operate as a decentralized, over-the-counter network where liquidity is highly fragmented across numerous market makers, electronic communication networks (ECNs), and trading platforms. This fragmentation creates significant challenges for price discovery, as there is no single consolidated order book or official exchange rate at any given moment. The decentralized nature of FX trading means that different liquidity providers may quote varying prices simultaneously, making the aggregation and analysis of order book information both more complex and more valuable for understanding true market conditions. Given the enormous scale and fragmented structure of FX markets, developing robust methods to classify information content across multiple liquidity pools could yield substantial improvements in price discovery, execution quality, and market efficiency.
Project Description

Primary Objectives:

  • Develop a robust methodology for aggregating order books across multiple liquidity pool
  • Create an information content classification system that quantifies the predictive value of different order book levels
  • Design metrics to identify which price levels contain the most relevant information for mid-price determination
  • Build a real-time nowcasting model for current market prices based on aggregated order book data

Secondary Objectives:

  • Analyze the relative importance of different liquidity pools in price discovery 
  • Investigate how information content varies across different market conditions (high/low volatility, different trading sessions)
  • Assess the temporal stability of information content metrics

Information Content Metrics:

  • Apply information theory measures (entropy, mutual information) to quantify price level importance
  • Implement machine learning techniques to identify patterns in order book informativeness
  • Develop weighted scoring systems based on historical price impact and predictive accuracy

Validation Framework:

  • Backtest the classification system against historical price movements
  • Compare nowcasting accuracy against benchmark models
  • Conduct out-of-sample testing to ensure robustness

Applications

  • Algorithmic trading strategy optimization
  • Risk management and position sizing
  • Market making and liquidity provision strategies
References n/a
Work Environment The internship will be in person in office (hybrid model possible), candidate will be closely working together with the team.
Prerequisite Skills Statistics
Other skills used in the Project Predictive Modelling, Simulation
Acceptable Programming Languages Python, kdb+/q
Additional Requirements Enthusiasm to learn on the real case study
Application Instructions Send your CV to the contact provided above along with a covering email which explains why you are interested in the project and why you think you would be a good fit.

 

Option pricing with quantum information

Project Title Option pricing with quantum information
Keywords Option pricing, quantum information
Project Listed 9 January 2026
Project Status Open
Application Deadline 27 February 2026
Project Supervisor Paul McCloud
Contact Name Paul McCloud
Contact Email paul.mccloud@nomura.com
Company/Lab/Department Nomura
Address 1 Angel Lane, London EC4R 3AB
Project Duration 8 weeks
Project Open to Masters students (Part III), Third year undergraduates (Part II)
Background Information Nomura is a global financial services group with an integrated network spanning over 30 countries. The Quantitative Research team supports Global Markets businesses by developing mathematical models for the pricing and risk management of derivative trades, in close partnership with the trading desks. The role requires an exceptional level of technical quantitative skills, ideally backed up by mathematical research experience (not necessarily related to finance).
Project Description Option pricing is the most elementary challenge of derivative modelling and is the foundation for many of the solutions needed by a Global Markets structured products business. Traditional methods employ classical stochastic calculus, but this approach can struggle when applied with complex boundary conditions, which potentially limits the product offering of the business. This project explores numerical methods for option pricing established on noncommutative information, to see if the novel degrees of freedom this introduces can facilitate more efficient schemes or generate better convergence and fitting to options markets. Abstracted as a pure mathematical challenge, the project considers the application of results from noncommutative algebra to well-posed problems whose solutions can be mapped onto option pricing.
References [1] McCloud, P. “Quantum bounds for option pricing” (2018) arxiv.org/abs/1712.01385
[2] McCloud, P. “Information and arbitrage: applications of quantum groups in mathematical finance” (2024) arxiv.org/abs/1711.07279
[3] McCloud, P. “The relative entropy of expectation and price” (2025) arxiv.org/abs/2502.08613
Work Environment You will research the project remotely, supported by a supervisor at Nomura and with occasional visits to the Nomura London office for progress updates.
Prerequisite Skills Mathematical Physics, Algebra / Number theory, Mathematical analysis
Other skills used in the Project Numerical Analysis, Partial Differential Equations, Probability / Markov Chains
Acceptable Programming Languages Python, MATLAB
Additional Requirements Curiosity and a willingness to apply ideas in novel contexts
Application Instructions Send your CV to the contact provided above along with a covering email which explains why you are interested in the project and why you think you would be a good fit.

 

Optimising the testing and selection process of cut flowers using historic performance and quality data

Project Title Optimising the testing and selection process of cut flowers using historic performance and quality data
Keywords Horticulture, varietal development, supply chain
Project Listed 9 January 2026
Project Status Open
Application Deadline 27 February 2026
Project Supervisors Lauren Hibbert and Richard Boyle
Contact Name Lauren Hibbert
Contact Email lauren.hibbert@apexhorticulture.com
Company/Lab/Department APEX Horticulture
Address Pierson Road, The Enterprise Campus, Alconbury Weald, PE284YA
Project Duration 8 weeks
Project Open to Masters students (Part III), Third year undergraduates (Part II)
Background Information

APEX Horticulture Ltd. is a professional research and development business, offering bespoke testing services for cut flowers and plants. APEX has three purpose-built testing centres in the UK and US. APEX is a division in the wider MM group, where the primary business, MM Flowers, is one of the UK’s leading cut flower importer/processing companies, with vertically integrated ownership model and innovative practices. More recently, the MM group has diversified its activities, including supplying plants, bulbs and other gifting products to the retailers in the UK and Europe. MM is owned by the AM Fresh Group, a leading breeder, grower and distributor of citrus and grapes; Vegpro, East Africa’s largest flower and vegetable producer; and Elite, based in South America, and the leading flower grower globally.

APEX is at the optimal position in the chain, able to deliver high quality, independent research and close-to-market proximity matched with the invaluable insight into the true performance of flowers and plants subjected to actual supply chain conditions. The infrastructure and specialised personnel of APEX aims to deliver robust, standardised and consistent research every week of the year, together with the ability to undertake large scale projects to match all client requirements, influencing all elements of the cut flower supply chain.

APEX undertakes many different research projects covering the entire supply chain, from development of new flower types through to the manufacturing requirements for the final bouquets. Each of these projects generates a significant amount of data and insight, which is used to provide recommendations to the various stakeholders of each project.

Project Description APEX tests over 50k cut flower samples annually, with around 30-60 data points generated per sample. Whilst this is often focussed on certain crop types, such as roses and lilies, many more types of flowers are tested across many different projects. The data generated includes agronomic and freight data, through to performance data associated with sample longevity (‘vase’/’shelf’ life) and aesthetic appeal. Several of the projects undertaken by APEX are long term with key strategic stakeholders, which allows for an assessment of flower performance and quality over many months and years. Each sample often has significant background information, including the type of flower, the growing location and agronomic practices, and the freight mode, for example. There are many influencing factors that can impact the above, such as weather conditions, freight delays and handling through the supply chain, which can often result in variability across a testing programme. Whilst APEX will design projects to try and account for this potential variation, there is a desire to use existing data to improve the efficiency and accuracy of the testing process. Selecting flower types and cultivars that do not meet the required standards can result in significant waste, consumer dissatisfaction and potentially brand damage, and therefore having the best insight possible reduces the risk of this. This clearly has implications across the supply chain, from the breeder/grower through to the suppliers and retailers. Therefore, can existing datasets be used to determine an appropriate model for assessing the viability of cut flowers (such as a new flower type, cultivar or treatment, for example), albeit more effectively and efficiently than the current process.
References
Work Environment Student will be part of a wider team, but will be leading the project. Working pattern can also be hybrid (and largely remote)
Prerequisite Skills Statistics, Mathematical analysis
Other skills used in the Project Statistics
Acceptable Programming Languages No preference
Additional Information Desire to operate in a commercial business, and provide insights that can inform real world decisions.
Application Instructions Send your CV to the contact provided above along with a covering email which explains why you are interested in the project and why you think you would be a good fit.

 

Agentic AI for Formalized Math

Project Title Agentic AI for Formalized Math
Keywords AI Lean 4 Agents LLM
Project Listed 9 January 2026
Project Status Open
Application Deadline 27 February 2026
Project Supervisor Nehal Patel and Charles Martinez
Contact Name Nehal Patel
Contact Email nehal.patel@gresearch.co.uk
Company/Lab/Department G-Research
Address 1 Soho Pl, London W1D 3BG
Project Duration Flexible, 8-12 week duration, summer of of 2026,
Project Open to Masters students (Part III), Third year undergraduates (Part II), Second year undergraduates (Part IB)
Background Information AI agents and interactive theorems provers have the potential to forever change the way mathematics is done. This project provides students with a hands on opportunity to learn and apply these tools in their area of research.
Project Description Students will formalize, in Lean 4, a topic of their choosing using agentic AI techniques. The initial toolset for AI theorem proving will be provided and students will have the opportunity to help shape the improvement of these tools. Depending on the student's interest, work may either focus primarily on formalization or may include working on the agentic theorem proving framework. Caveats: Not all branches of math are easy to model in Lean 4. Prior experience with Lean 4 is advisable. Prior experience with AI & LLMs not required, but helpful. Prior experience with programming and a hacker ethos are also highly desirable.
References

Introductions to Lean:
https://alexkontorovich.github.io/2025F311H/
https://adam.math.hhu.de/

Agentic Theorem Prover (One of Many):
HILBERT: RECURSIVELY BUILDING FORMAL PROOFS WITH INFORMAL REASONING
https://arxiv.org/pdf/2509.22819

Work Environment Work will be directed primarily from GR staff based in Boston. Student will work mostly independently and remotely, coordinating work via Github and video meetings (with a meeting cadence that will adapt as the project progresses). Twice during the summer, Boston staff will be present in England and will arrange 1-3 day intensive sessions with student for joint collaboration.
Prerequisite Skills Formal Math in Lean 4
Other skills used in the Project App Building
Acceptable Programming Languages Python, Lean 4
Additional Requirements Candidates should be prepared to propose some mathematics that they would like to formalize using AI tools in Lean. This could draw from their current research focus or their general interests. Topics from recreational math or applied topics are acceptable. Students are encouraged to investigate to what extent necessary background theories have already been formalized.
Application Instructions Send your CV to the contact provided above along with a covering email which explains why you are interested in the project and why you think you would be a good fit.

 

Correlation Between Forecast Accuracy, Stock Dwell Time and Retailer Waste on Customer Complaints: A Study of Yellow and White 40cm & 50cm Roses at MM Flowers

Project Title Correlation Between Forecast Accuracy, Stock Dwell Time and Retailer Waste on Customer Complaints: A Study of Yellow and White 40cm & 50cm Roses at MM Flowers
Keywords Forecast Accuracy; Stock Dwell Time; Retailer Waste; Customer Complaints; Statistical Analysis
Project Listed 16 January 2026
Project Status Open
Application Deadline 27 February 2026
Project Supervisor Ellanette van Zyl
Contact Name Ellanette van Zyl
Contact Email Ellanette.vanzyl@mm-flowers.com
Company/Lab/Department MM Flowers
Address Pierson Road, The Enterprise Campus, Alconbury Weald, Huntingdon, PE28 4YA
Project Duration 8 weeks, full-time 40 hours/week
Project Open to Masters students (Part III), Third year undergraduates (Part II)
Background Information

MM Flowers operates in a highly time-sensitive fresh flower supply chain, where product quality, availability and freshness are critical drivers of customer satisfaction. Even small inaccuracies in forecasting or delays in product movement can result in extended stock dwell time, increased waste at retailer level and ultimately higher customer complaint volumes. Roses, in particular, represent a high-volume and high-visibility product category where performance variability can have a significant commercial and reputational impact.

Forecast accuracy directly influences ordering decisions, inbound volumes and stock allocation. When actual arrivals deviate from forecasted volumes, this can lead to either stock shortages, impacting service levels, or excess stock levels, increasing dwell time and the risk of quality deterioration. Longer dwell times at MM Flowers or retail stores can accelerate deterioration, contribute to retailer waste and negatively affect the end consumer experience.

This project focuses specifically on yellow and white 40cm and 50cm roses, which are core SKUs within the MM Flowers portfolio and are particularly sensitive to demand variability and shelf-life constraints. By analysing the relationships between forecast accuracy, stock dwell time, retailer waste and customer complaints for these products, the project aims to identify whether measurable correlations exist across the supply chain.

The project is interesting and valuable because it connects operational planning decisions with downstream quality outcomes and customer feedback. Understanding these relationships will support more data-driven forecasting, stock management and waste-reduction strategies, while also providing insight into how supply chain performance ultimately affects customer satisfaction. The findings have the potential to inform targeted improvements for key rose lines and contribute to broader continuous improvement initiatives within MM Flowers.

Project Description

This project will investigate whether measurable correlations exist between forecast accuracy, stock dwell time, retailer waste and customer complaints for yellow and white 40cm and 50cm roses supplied by MM Flowers. The project is primarily data-driven and quantitative, making it well suited to a student with strengths in mathematics, statistics or data analysis.

The project will begin with a data familiarisation and definition phase, during which the student will work with historical supply chain data provided by MM Flowers. This will include forecast volumes, actual arrival quantities, stock dwell time metrics, retailer waste data and customer complaint records. The student will be responsible for defining and calculating appropriate performance measures, such as forecast accuracy metrics (e.g. absolute error or percentage error), dwell time distributions and waste rates.

In the next phase, the student will apply statistical and mathematical techniques to explore relationships between variables. This may include:

  • Descriptive statistics to summarise trends and variability;
  • Correlation analysis to quantify the strength and direction of relationships between forecast accuracy, dwell time, waste and complaints;
  • Regression analysis to assess the relative impact of each variable on customer complaints;
  • Time-series or lag analysis to investigate delayed effects, such as whether longer dwell times or excess stock in one period lead to increased complaints in subsequent periods.

The project is open-ended in nature, allowing findings from the initial analysis to guide deeper investigation. For example, if strong correlations are identified for certain rose lengths or colours, the student may focus further analysis on those segments or explore threshold effects where performance begins to deteriorate significantly.

A successful outcome would be:

  • A clear, evidence-based assessment of whether and how forecast accuracy and dwell time influence retailer waste and customer complaints;
  • Identification of key drivers or risk indicators that are most strongly associated with complaints;
  • Practical, data-backed insights that MM Flowers could use to improve forecasting, reduce waste and enhance customer satisfaction for core rose lines.

The project is interesting and useful because it links mathematical analysis directly to real-world operational and commercial outcomes. Students will gain experience applying statistical methods to complex, imperfect industry data, while MM Flowers will benefit from improved understanding of how planning and stock decisions affect product quality and customer experience across the supply chain.

References https://mm-flowers.com
Work Environment

The student will work independently on the core analytical aspects of the project, with regular guidance and supervision from the project lead at MM Flowers. The project will be based within a business and operational environment rather than a laboratory, giving the student exposure to real-world supply chain data and decision-making contexts.

In addition to the primary supervisor, the student will have opportunities to engage with forecasting, supply chain planning and quality teams at MM Flowers, allowing them to discuss data definitions, operational processes and practical implications of their findings. While there is no formal academic research group on site, the student will be supported through regular check-ins and access to subject matter experts across the business.

Working hours will be flexible, aligned with standard office hours, and can be adjusted to accommodate academic commitments. The project can be conducted in a hybrid format, combining remote analytical work with occasional on-site days at MM Flowers when beneficial for data access, collaboration and project reviews.

Day-to-day work will primarily involve data analysis, modelling and interpretation, with time allocated for meetings, progress reviews and refinement of the analytical approach. The student will be encouraged to manage their own time, structure their analysis and propose next steps, mirroring the autonomy expected in both industry and academic research roles.

This working environment offers a balance of independent mathematical problem-solving and practical business engagement, providing a supportive setting for a mathematics student to apply theoretical skills to a real operational challenge.

Prerequisite Skills Statistics, Predictive Modelling, Data Visualisation, Database queries, Applied statistics, regression analysis, exploratory data analysis, and translating real-world problems into quantitative models.
Other skills used in the Project Predictive Modelling, Statistics, Data Visualisation, Database queries, Simulation, Applied statistical analysis, handling real-world operational data, basic programming skills (e.g. Python or R), critical interpretation of quantitative results, and ability to communicate findings clearly to non-technical stakeholders.
Acceptable Programming Languages Python, R, No preference, SQL
Additional Requirements We are looking for a student who is curious, analytical and motivated to apply mathematical skills to real-world problems. The ideal candidate will demonstrate enthusiasm for data-driven analysis and a willingness to engage with complex, imperfect datasets typical of an operational business environment. Strong problem-solving ability, attention to detail and critical thinking are important, along with a willingness to question assumptions and explore findings independently. The student should be comfortable working autonomously while also being open to feedback and discussion. Good communication skills are essential, as the project will require explaining quantitative findings clearly to non-technical stakeholders and translating mathematical results into practical business insights. An interest in supply chain, forecasting or data analytics in an applied setting would be advantageous, but not essential. Above all, we value a positive attitude, intellectual curiosity and a willingness to learn, as the project offers scope for the student to shape the direction of the analysis based on their findings.
Application Instructions Send your CV to the contact provided above along with a covering email which explains why you are interested in the project and why you think you would be a good fit.

 

Implied Volatility Surface Construction, Diagnostics, and Decomposition

Project Title Implied Volatility Surface Construction, Diagnostics, and Decomposition
Keywords Implied Volatility Fitting, Surface Dynamics, Options, Options Pricing, Simulation, PCA, Risk Scenarios Generation,
Project Listed 23 January 2026
Project Status Open
Application Deadline 27 February 2026
Project Supervisor Antonio Zarrillo and Silvia Stanescu
Contact Name Antonio Zarrillo
Contact Email antonio.zarrillo@emcore.ch
Company/Lab/Department Emcore Asset Managment
Address Schochenmühlestrasse 6, 6340 Baar
Project Duration 10-12 weeks from June
Project Open to Third year undergraduates (Part II), Masters students (Part III), Second year undergraduates (Part IB)
Background Information Options are quoted on a discrete grid of strikes and maturities, but pricing and risk processes require a continuous implied volatility surface. For any strike and expiry, an observed option premium can be mapped (under an agreed pricing convention) to an implied volatility level; the collection of these points defines the market surface. In practice, quotes are noisy and incomplete, bid/ask spreads can be wide, and naive interpolation can produce unstable outputs or static arbitrage across strike or maturity. A robust surface workflow therefore enables consistent valuation inputs for backtesting and research, improves the reliability of derived sensitivities and signals, and reduces operational and model risk. Beyond fitting a surface on a given day, an important question is how the surface evolves through time. Empirically, a large share of surface variation is low-dimensional and can often be interpreted as level, skew, and curvature-type moves. PCA provides a transparent factor decomposition of these movements and can be used to build coherent stress scenarios and statistical scenario generation for risk measures.
Project Description

The project will build an end-to-end workflow that ingests option-chain quotes, produces a stable implied volatility representation across strike and maturity, and generates diagnostics for quality and stability. The student will implement a robust construction method, and assess sensitivity to data quality, filtering, and weighting choices. A PCA decomposition will then be built from a historical time series of model outputs evaluated on a fixed strike–maturity grid.

Successful outcome:

  • Fitting implied volatility surfaces under different modelling assumptions
  • Designing and evaluating quote weighting schemes for surface fitting (e.g., bid/ask and, liquidity/volume/open-interest based weights, spread- and vega-weighting, robust loss functions to downweight outliers and stale quotes).
  • Run surface quality controls including fit-to-bid/ask diagnostics and static arbitrage checks, with clear metrics and flagging.
  • Build a PCA decomposition from a historical time series of model outputs evaluated on a fixed strike–maturity grid, extract interpretable factors such as level, skew, and curvature-type moves, and produce factor time series.
  • A well-documented workflow and reproducible codebase.
References [1] Carr, P., & Madan, D. (2005). A note on sufficient conditions for no arbitrage. Finance Research Letters.
[2] Cont, R., & da Fonseca, J. (2002). Dynamics of implied volatility surfaces. Quantitative Finance.
[3] Cont, R., & Vuletić, M. (2023). Simulation of arbitrage-free implied volatility surfaces. Applied Mathematical Finance, 30(2), 94-121.
[4] Gatheral, J. (2006). The Volatility Surface: A Practitioner's Guide. Wiley.
[5] Gatheral, J., & Jacquier, A. (2014). Arbitrage-free SVI volatility surfaces. Quantitative Finance.
[6] Skiadopoulos, G., Hodges, S., & Clewlow, L. (1999). The dynamics of the S&P 500 implied volatility surface. Review of Derivatives Research.
[7] Zeliade Systems (2009). Quasi-Explicit Calibration of Gatheral's SVI model. White Paper (ZWP-0005).
Work Environment The research project can be conducted in a hybrid format, with guidance from at least one team member.
Prerequisite Skills Statistics, Numerical Analysis, Partial Differential Equations, Mathematical analysis, Simulation, Database queries, Data Visualisation
Other skills used in the Project Probability / Markov Chains, Algebra / Number theory, Predictive Modelling
Acceptable Programming Languages Python
Additional Requirements Strong interest in learning from a practical case study
Application Instructions Send your CV to the contact provided above along with a covering email which explains why you are interested in the project and why you think you would be a good fit.

 

Discrete Representations of Multivariate Continuous Probability Distributions

Project Title Discrete Representations of Multivariate Continuous Probability Distributions
Keywords (Multivariate) Statistics, Probability/Markov Chains, Simulation, (Numerical) Linear Algebra
Project Listed 26 January 2026
Project Status Open
Application Deadline 27 February 2026
Project Supervisor TBC
Contact Name Dr Michael Selby
Contact Email careers@signaloid.com
Company/Lab/Department Signaloid
Address 4 Station Square, Cambridge, CB1 2GE
Project Duration 8 weeks, full time
Project Open to Masters students (Part III), Third year undergraduates (Part II)
Background Information Probability distributions provide a mathematical framework for understanding and modelling uncertainty, allowing us to quantify the likelihood of different outcomes in random processes. By characterising how data is distributed, they enable informed decision-making and are foundational to fields like statistics, machine learning, and risk assessment. Many of these distributions, such as the famous normal distribution (bell curve), are defined continuously, but in reality we need to represent these distributions with a finite number of discrete points so that we may perform statistical tasks quickly and efficiently on a computer.
Project Description In this project you will be working on new discrete representations of probability distributions to try and uncover better ways to capture the shape and form of many theoretical and real world distributions. First you will learn about distributions as a rigorous mathematical object and how you can perform arithmetic on them. You will also learn how we quantify the "closeness" of distributions using distance metrics and criteria. Then after researching and analyzing existing methods to represent distributions discretely, you will get to try and conceive of new and improved methods, especially for high-dimensional distributions. Finally, you will test, verify and analyze the underlying numerical linear algebra of these methods both analytically and numerically through simulations (in Python or a similar language).
References https://signaloid.com/technology
Work Environment Join a remote team of industry mathematicians discussing probability theory and real world statistical problems. You will have the chance to talk with your supervisor multiple times per week and have them guide you through the project and oversee your progress.
Prerequisite Skills Statistics, Probability / Markov Chains, Simulation, (Numerical) Linear Algebra
Other skills used in the Project Mathematical analysis, Data Visualisation
Acceptable Programming Languages Python
Additional Requirements None
Application Instructions Send your CV to the contact provided above along with a covering email which explains why you are interested in the project and why you think you would be a good fit.

 

Exploring deep learning embeddings for chemical bioactivity prediction

Project Title Exploring deep learning embeddings for chemical bioactivity prediction
Keywords Deep learning, Embeddings, Representation learning, Autoencoders, Toxicology
Project Listed 26 January 2026
Project Status Open
Application Deadline 27 February 2026
Project Supervisor Patrik Engi and Hugh Barlow
Contact Name Patrik Engi
Contact Email patrik.engi@unilever.com
Company/Lab/Department Unilever SERS
Address Colworth Science Park, Sharnbrook, Bedford, MK44 1LQ
Project Duration 8-12 weeks
Project Open to Masters students (Part III), Third year undergraduates (Part II)
Background Information

In a fast-moving consumer goods environment, it is vital that safety assessments are conducted to ensure products are safe for humans and the environment. Historically, these assessments have required in vivo animal testing and so there is a pressing ethical and scientific need to develop of non-animal methods to support product safety risk assessment. For more than 20 years, Unilever’s Safety, Environmental and Regulatory Science (SERS) Group has been developing novel in silico and in vitro methods, which leverage recent advances in biology, genetics, computing, mathematics and statistics, to conduct safety assessments without the use of animal testing. [1, 2].

The current evolution in the risk assessment paradigm, presents new opportunities in terms of applying new deep learning and AI-based approaches. A key part of risk assessment is to characterise the potential effects that a chemical may have on different cell types, which would typically involve using high throughput transcriptomics (HTTr) to measure the transcriptional response of cells to different concentrations of a test chemical. Such data can be expensive to generate, particularly if it needs to be generated for multiple chemicals and cell types. Therefore, the use of data-driven modelling which maximises the utility of all the available data is a high priority to achieve cost effective implementation of non-animal approaches.

Many approaches in risk assessment of an unknown compound are based on determining the similarity to a known compound. Recent advances in deep learning have introduced powerful methods for generating embeddings, numerical representations that capture complex relationships between entities. By combining transcriptional and chemical information (such as structural representations like molecular fingerprints), these embeddings may provide valuable insights beyond traditional similarity metrics, even being able to predict responses for unseen chemicals [3, 4].

Project Description

Embeddings map complex (high-dimensional) data into a simplified (lower-dimensional) latent space, while preserving semantic relationships. Thereby enabling the discovery of relationships between datasets previously masked by the high dimensionality. Internal studies of these methods for biochemical in vitro responses have yielded promising results which we seek to further apply to relevant datasets.

For this project, the student(s) should start by familiarising themselves with literature surrounding this topic, upskilling around the use of representation learning /embedding models, transcriptomics and toxicology – with support from SERS experts. With this foundation, we recommend that the student progress this work by choosing one or more of the following paths;

  1. Methods: We are interested in the further exploration of literature to identify, implement and compare methods with those already covered, and the application of these methods to internal datasets. However, the student(s) will not be limited to using pre-built models – any produced model capable of producing high quality embeddings that can be used for measuring biological similarity are considered a success.
  2. Evaluation & Interpretability: Assessing the predictive capability of different methods can be difficult due to the black-box nature of these deep learning based approached. Testing embedding based approaches and ensuring the scientific justification of results is a key feature in the furthering the use of data driven approaches in all scientific research. The student would therefore endeavour to assess the robustness of any developed methods within a framework that allows for scientific interpretability and understanding where possible.
  3. Optimisation: As deep learning models of this nature require large volumes of data, it is commonplace to use broad, public datasets that may differ from the specific target scenarios of interest. Therefore, further downstream modifications and optimisation should be applied to fine-tune the models. Methods such as transfer learning could be applied to apply the models to specific risk assessment problems. For example, modelling differences in transcriptomics data methodologies (e.g. RNA-Seq vs Temp-O-Seq) or post-training optimisation based on internally labelled risk classifications.
References [1] J. Reynolds, S. Malcomber and A. White, “A Bayesian approach for inferring global points of departure from transcriptomics data,” Computational Toxicology, vol. 16, p. 100138, November 2020.
[2] T. E. Moxon, H. Li, M.-Y. Lee, P. Piechota, B. Nicol, J. Pickles, R. Pendlington, I. Sorrell and M. T. Baltazar, “Application of physiologically based kinetic (PBK) modelling in the next generation risk assessment of dermally applied consumer products,” Toxicology in Vitro, vol. 63, p. 104746, March 2020. [3] Kang, B., Fan, R., Yi, M., Cui, C. and Cui, Q., 2025. A large-scale foundation model for bulk transcriptomes. bioRxiv, pp.2025-06.
[4] Yoni Donner, Stéphane Kazmierczak, and Kristen Fortney, Drug Repurposing Using Deep Embeddings of Gene Expression Profiles Molecular Pharmaceutics 2018 15 (10), 4314-4325
Work Environment The student will be following on from the work of two supervisors who will be available to support throughout the project. The student(s) should gain experience in deep learning, applied scientific computing and bioinformatics, while also being able to meet and collaborate with experts from a variety of both mathematical and other backgrounds. As the project is hosted at a site near Bedford, we expect the student will mostly be working remotely. However, we encourage attendance in-person where travel permits.
Prerequisite Skills Predictive Modelling, Data Visualisation, Deep learning
Other skills used in the Project Statistics, Probability / Markov Chains, Mathematical Physics, Mathematical analysis, Algebra / Number theory
Acceptable Programming Languages Python, R
Additional Requirements We are seeking a proactive student who brings curiosity, clear communication, and a genuine drive to grow. Experience with training and evaluating models in pytorch/tensorflow is preferable, though not essential.
Application Instructions Send your CV to the contact provided above along with a covering email which explains why you are interested in the project and why you think you would be a good fit.

 

Foundation models for cancer biology

Project Title Foundation models for cancer biology
Keywords neural networks, foundation models, life sciences, single cell transcriptomics,
Project Listed 26 January 2026
Project Status Open
Application Deadline 6 March 2026
Project Supervisor Nicola Richmond and Sebastian Burgstaller-Muehlbacher
Contact Name Sebastian Burgstaller-Muehlbacher
Contact Email sebastian.burgstaller-muehlbacher@boehringer-ingelheim.com
Company/Lab/Department Boehringer Ingelheim Limited
Address 1 Pancras Sq, London N1C 4AG
Project Duration 8 weeks, full-time.
Project Open to Masters students (Part III), Third year undergraduates (Part II)
Background Information The Virtual Cell is an emerging concept which uses (mostly) foundation models (typically transformers) to model cell types and chemical or gene perturbations. This is enabled by large scale single cell transcriptomics (which genes are expressed in a cell) datasets now available to train such models. After training a foundation model, it can be finetuned to execute certain tasks, e.g. cell type identification. However, cells do not live alone, they exist in a tissue and organ context. Thus, it is of high scientific interest to understand whom the neighboring cell types are (cell niche) and how a certain cell talks to their neighbors (cell-cell communication). This is particularly important when trying to understand the tumor microenvironment in cancer.
Project Description

Aims:
For this specific project, we are first looking into internalizing transcriptomics foundation models of our choice. Then, we are aiming to fine-tune them, using our internal, high-performance AI training infrastructure. Fine-tuning tasks shall be prediction of cell niches and cell-cell communication, both in a broader tissue context. To predict cell niches, we are interested in cell densities, cell type composition, source tissue and more. For cell-cell communication, we are interested in key signaling molecules influencing the different cell types in a niche.

Key Tasks: 

  • Set up the foundation models on our infrastructure.
  • Prepare training, validation and test datasets, e.g. explore different tokenization strategies, summary statistic...
  • Explore potential fine-tuning architecture (e.g. Multitask MLP, dedicated loss functions for our data of interest), and training paradigms for the different downstream tasks.
  • Fine-tune foundation model, to achieve state of the art performance.
  • Evaluate model performance against relevant benchmarks and baselines.

What summer students will learn:

  • Exposure to biomedical foundation models.
  • Exposure to large scale foundation model training/fine-tuning infrastructure.
  • Exposure to biomedical data landscape and cancer biology.
  • Get to know a cutting-edge, AI-first research setting in a large, pharmaceutical enterprise.
  • New foundation model architecture design and model evaluations.
  • Coding skills, mostly Python (PyTorch, JAX).
  • Exposure to using coding assistants in a research environment.
  • Potential opportunity to publish in a scientific journal.
References

Single cell foundation model references:
Tahoe-x1: https://doi.org/10.1101/2025.10.23.683759
STATE: https://doi.org/10.1101/2025.06.26.661135
STACK: https://www.biorxiv.org/content/10.64898/2026.01.09.698608
VariantFormer: https://www.biorxiv.org/content/10.1101/2025.10.31.685862v1

Niche detection:
CellLetter: https://academic.oup.com/bib/article/26/6/bbaf693/8405044
NicheFormer: https://www.nature.com/articles/s41592-025-02814-z

Work Environment The students will be supervised by technical and domain subject-matter experts that have PhD-level education and have post-doctoral research experience. The roles will be full-time and in the office, for 3 - 5 days a week, depending on the student preferences. We are located in a vibrant part of London with easy access to public transporation to and from Cambridge.
Prerequisite Skills Statistics, Image processing, Geometry / Topology, Data Visualisation, Database queries
Other skills used in the Project Statistics, Image processing, Geometry / Topology, Data Visualisation, Database queries, App Building
Acceptable Programming Languages Python, Pytorch, JAX
Additional Requirements

Required/useful skills:

  • Basic Python skills (PyTorch, JAX would be a plus).
  • Basic understanding of neural network architectures.
  • Interest in applying AI to problems in biology.
  • Interest in machine learning and statistics.
  • Interest in linear algebra and optimization.
  • Fits students who would like to follow their own ideas and also those who like a more engineering focus.
  • Enthusiasm for foundation models and and interest in biomedical research questions
Application Instructions Send your CV to the contact provided above along with a covering email which explains why you are interested in the project and why you think you would be a good fit.