skip to content

Summer Research Programmes

 

This is a list of CMP industrial project proposals from summer 2019.

Understanding cut rose performance through long term quality monitoring and analysis

Contact Name Richard Boyle
Contact Email richard.boyle@mmflowers.co.uk
Company Name MM Flowers Ltd
Address Pierson Road The Enterprise Campus, Alconbury Weald, PE28 4YA
Period of the Project 8 weeks
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest  
Brief Description of the Project MM Flowers, established 11 years ago, is the UK's leading, integrated cut flower supplier, with a unique ownership model and innovative practices. MM Flowers is owned by the Munoz Group, a leading breeder, grower and distributor of citrus and grapes; Vegpro, East Africa's largest flower and vegetable producer; and Elite, the leading flower grower and breeder in South America. MM Flowers supplies many of the major high street retail brands, whether in store estate or directly to consumers. The UK cut flower industry can be challenging, where customers expect high quality flowers at competitive prices. The vast majority of species utilized are highly perishable, short life products, which are transported from many different regions around the world. Pre- and post-harvest management, logistics and environmental control are all factors that can positively or negatively impact upon flower quality. MM Flowers receives circa 400 million stems of cut flowers annually across 60 different species and 2000 individual product groups. There is a dramatic increase during periods such as Valentine's and Mother's Day. To ensure the quality of product is delivered successfully and is of the required standards for bouquet production, a dedicated quality control team undertake daily inspections of the flowers received, whilst MM has further developed its own dedicated R&D business, APEX Horticulture, to provide solutions to maintain or enhance flower quality. Through APEX, MM has established large and detailed data sets on the quality and performance of many key flowers, including roses and lilies. The rose data set alone is comprised of over 500,000 data points, from grower information through to quality and performance attributes. This data is typically used by the business in regular feedback to farms, to inform decisions on varietal selections and to provide baseline data for specific projects, for example. These data sets present an opportunity to undertake more detailed analysis of long term trends, and how various factors influence the end consumer quality and performance. In addition, there is the possibility to develop a process to allow for future data to be incorporated and analysed more efficiently, allowing for quicker and more accurate decision making. Further to this, the student can expect to gain valuable experience working within a fast-paced business in the fresh produce sector. This includes liaising with different departments, project management, communication skills, and working towards the needs of the business.
Skills Required Strong computer skills Experience with statistics and modelling Clear communicator Self motivated Demonstrates initiative Project management
Skills Desired  

 

Mathematical Finance in the Energy Sector

Contact Name Lee Momtahan
Contact Email lee.momtahan@centrica.com
Company Name Centrica Energy Marketing and Trading
Address 2nd Floor, Park House, 116 Park Street, London W1K 6AF
Period of the Project 8 weeks between late June and 30 September
Project Open to Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project Project is open ended. The following are examples of projects that we are currently working on but we may have changed by June next year. 1. Optimising a portfolio of Liquified Natural Gas contracts which contain optionality together with ship scheduling 2. Predicting hourly power (electricity) prices from the base-load and peak-load power prices as well as other factors such as the power-coal and power-gas spread
Skills Required Optimisation, Statistics
Skills Desired Mathematical Finance, Stochastic Calculus, Numerical Analysis, Python Programming

 

Too Many Traders Spoil the Return: How to Identify Crowded Strategies and Trades

Contact Name Charlotte Grant
Contact Email charlotte.grant@oxam.com
Company Name Oxford Asset Management
Address OxAM House 6 George Street
Period of the Project 8 weeks
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest 22 February
Brief Description of the Project When too many traders have exposure to similar positions, it becomes crowded. Such positions may have a lower expected return or higher expected risk, particularly when a few large players dominate a stock's liquidity or wish to exit a position at a similar time. The project will involve using market level transaction data to identify the temporal dynamics of crowded positions through studying intra-stock correlations on various time scales. If time allows, we will augment this analysis with metrics derived from other financial datasets.
Skills Required Python programming. Experience and interest in statistics and probability.
Skills Desired  

 

Numerical Modelling of Oxidation Process in Semiconductor Devices

Contact Name Vasily Suvorov
Contact Email vasily.suvorov@silvaco.com
Company Name Silvaco Europe, UK
Address Compass Point, St Ives, Cambridgeshire, PE27 5JL Phone: +44 (0) 1480 484400
Period of the Project 1 July - 1 September, flexible
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest  
Brief Description of the Project The fabrication of integrated circuit microelectronic structures and devices vitally depends on the thermal oxidation process of silicon and other materials for the formation of gate dielectrics, device isolation regions, spacer regions, and ion implantation mask regions. Particularly, the precise control of the silicon dioxide thickness is important as device geometries continue to scale . The project is well defined with the aims to analyze and improve the existing mathematical models of the process and the numerical algorithms implemented in our company. In particular, the aim is to explore various rheological models (e.g. viscous, viscoelastic) and the non-linear effects that these models present. The prospective student will assess the convergence of the existing numerical scheme and do some work on improving the convergence rate. This project combines both theory (algebraic derivation) and computation (coding the algorithm). The student is also expected to review the literature to give a feedback on the latest scientific development in this area. Reference: Journal of Applied Physics, Vol.36, p3770 (1965). SIAM J. Appl. Math., 77(6), p. 2012-2039.
Skills Required Linear Algebra, Real Analysis, some experience and interest in programming
Skills Desired Numerical Analysis

 

Signal analysis and automated fault detection for electric motors

Contact Name William Boulton
Contact Email will.boulton@faradaypredictive.com
Company Name Faraday Predictive
Address St Johns Innovation Centre, Cowley Road, Cambridge, CB4 0WS
Period of the Project 8 weeks between 1 July and 27 Sept
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest 22 February
Brief Description of the Project Rotating equipment, such as fans, pumps and compressors, are vital for many parts of today's industrial processes and infrastructure. We all rely on the smooth running of pumps to provide clean water or remove sewage, on compressors e.g. for air conditioning, and in industry almost every process is driven or controlled by a rotating machine. Faraday Predictive is a small company which has a novel method of sensing faults in this rotating equipment, using the voltage and current drawn by the motor driving the equipment to derive a “residual current“ the deviations of the current from an idealised linear voltage/current model of the motor. We then apply a fast Fourier transform to this residual current and look at peaks in the resulting spectrum corresponding to known fault frequencies. You will be working with other students from the engineering and computer science departments. In this project, we would like up to two students to investigate new methods of fault detection, on our test equipment. We have a rig of 8 cheap (and destructible) electric motors, as well as an archive of measurements from these motors, and you would be working together to come up with a series of experiments to collect data, invent your own methods of fault detection (covering common faults such as misalignment, imbalance, and bearing deterioration), and benchmark their performance against our existing system, which in essence is no more sophisticated than applying a look-up table of fault frequencies to the Fourier spectrum of the motor current; a method which we know has a number of shortcomings. We think that the mathematical processes in this project can be broken into two areas, one covering the signal processing side, including investigation of methods such as spectrograms, the Hilbert transform, and wavelet transformations, to display time/frequency data about motor current in the most economical way possible. The other half would be using these inputs to extract and track features in this data corresponding to faults - probably using machine learning and statistical techniques. However it would be up to you to decide if you wanted to approach the problem holistically or delegate tasks between yourselves. Whilst we have a large archive of data, we are limited to experimenting on only a handful of different motors, so think that good experimental design and some skilful analysis of the data will be required, rather than simply chucking as much data as possible into a neural network (though this may turn out to be a perfectly viable option). The output of this project would be well-documented code that can be integrated into our next generation product.
Skills Required

Python programming experience

Good communication skills with non-mathematicians

Some ingenuity!

Skills Desired Experience in signal processing e.g. wavelet transform/ Hilbert transform Experience using machine learning libraries in Python Finalist preferable

 

Uncertainty quantification in dynamical systems

Contact Name Philip Jonathan
Contact Email philip.jonathan@shell.com
Company Name Shell
Address York Road, London
Period of the Project 8 weeks in period June-Sept 2019
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest 22 February 2019
Brief Description of the Project 1. Consider physical systems described by differential equations with increasing complexity, including: (a) Simple Newtonian projectile trajectory (b) Fluid loading on a column of water (c) Advection-diffusion (e.g. atmospheric transport, groundwater transport) (d) Complex structural response to fluid loading (e.g. fixed and floating structures) (e) Possibly chaotic systems (El-Nino attractor model) 2. Objectives of project: (a) Review of literature on methods to propagate uncertainty in dynamical systems (emphasis on Bayesian inference) (b) Develop algorithms in MATLAB for propagation / simulation of solution uncertainty based on specified initial / boundary condition and parameter uncertainty; (c) Sampling / approximation of corresponding posterior distributions given observations made at known points in space/time. Student would be based part-time at Shell in London and/or Amsterdam, working with statistical modellers with experience of UQ for physical systems. Delighted to discuss project with interested students by skype. Please contact by email.
Skills Required Interest in computational statistics and Bayesian inference, programming in MATLAB, modelling the physical environment
Skills Desired  

 

Applied Mathematics for Trading in the Energy Markets

Contact Name Adrien GRANGE-CABANE
Contact Email adrien.grange-cabane@uk.bp.com
Company Name BP
Address 20 Canada Square, E14 5NJ London
Period of the Project 8 to 10 weeks
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22, 2019
Brief Description of the Project Energy derivatives pricing and risk management heavily rely on numerical methods, price modelling and supply-demand forecast. Recent research articles have shown these problems can be improved by using deep learning techniques. For example, model calibration can be performed significantly faster using Artificial Neural Networks. Deep learning can also be used to construct control variates and reduce the variance of Monte Carlo estimators. Another topic of interest is the wind energy modelling. Uncertainties in the generated volume and corresponding power price must be jointly modelled in order to find the optimal execution strategy for hedging a wind position. The objective of this internship project is to implement the models or methods from recent publications and then to compare the results with more standard approaches.
Skills Required Probabilities, calculus and stochastic calculus, programming, problem solving skills, optimisation
Skills Desired  

 

Algorithm development for security applications.

Contact Name Samuel Pollock
Contact Email careers@iconal.com
Company Name Iconal Technology
Address St Johns Innovation Centre Cowley Road, Cambridge, CB4 0WS
Period of the Project At least 8 weeks
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest Accepting CVs up to end of March
Brief Description of the Project We are looking at application of new and novel classification techniques to analyse sensor data for security applications, to automate the detection of forbidden or dangerous items. This may include exploration of deep learning techniques.
Skills Required Good familiarity with Python or similar. Comfortable with automated analysis of large volumes of data.
Skills Desired Image analysis techniques. Machine learning approaches and solutions. Interest in application of science & technology to solve problems in industry.

 

Measuring the impact of prior distributions on pharmacological model parameters

Contact Name Dr Fabio Rigat
Contact Email fabio.x.rigat@gsk.com
Company Name GlaxoSmithKline R & D
Address Gunnels Wood Rd, Stevenage SG1 2NY
Period of the Project flexible
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22nd is perfect for optimal planning
Brief Description of the Project Background: clinical pharmacology studies are a cornerstone for safety and efficacy evaluations in medicines development. To inform these assessments, physiological models describe the temporal evolution in the concentration of investigational compounds and of the molecules they interact with in relevant body compartments (e.g. plasma, organ systems). These concentrations are modelled using systems of ordinary differential equations (ODEs), which may not admit closed form solutions. Estimation of ODE parameters from experimental data and physiological knowledge typically relies on maximisation of the likelihood function using appropriate analytical and numerical methods. However, these estimators are not robust when sample sizes are small, which is often the case in early clinical development studies. A possible solution here is to use soft constraints to decrease the probability that estimates lie outside of known ranges (penalised maximum likelihood). When these constraints are integrable functions in their ODE parameters, they can represent prior probability distributions summarising knowledge about the value of physiological parameters pre-existing to the experimental data being analysed. In this case, formal Bayesian estimation is appropriate, potentially leading to more robust estimates of the ODE parameters for moderate sample sizes. Project goals: This project focuses in on assessing the impact of prior distributions defined on physiological model parameters in early development clinical studies. Prior impact will be defined using study design characteristics (true and false positive probabilities, distance metrics between true and estimated parameter values) compared to the corresponding ML estimates. The focus and goal of this project is to measure the prior impact specifically on the model parameters defining the posterior predictive probability distributions of engagement of therapeutic monoclonal antibodies to their targets measured in plasma. Operating characteristics will be estimated by data simulation and Bayesian inference based on existing computer code in R and Winbugs. Requirements for success: Enthusiasm about the project goals, basic knowledge of statistical inference, ODEs and programming skills, excellent time management skills, focus on delivery of a final report and internal presentation of the concepts and tools developed, availability to interact on a regular basis with GSK supervisors based in Stevenage (face to face and/or on-line). References [1] Bayesian Analysis of Population PK/PD Models: General Concepts and Software https://link.springer.com/content/pdf/10.1023/A:1020206907668.pdf [2] A survey of population analysis methods and software for complex pharmacokinetic and pharmacodynamic models with examples https://link.springer.com/content/pdf/10.1208/aapsj0901007.pdf [3] An application of Bayesian population pharmacokinetic/pharmacodynamic models to dose recommendation https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.4780140917 [4] A population pharmacokinetic model for docetaxel (Taxotere®): Model building and validation https://link.springer.com/content/pdf/10.1007/BF02353487.pdf
Skills Required basic knowledge of statistical inference, ODEs and programming skills
Skills Desired Enthusiasm about the project goals, basic knowledge of statistical inference, ODEs and programming skills, excellent time management skills, focus on delivery of a final report and internal presentation of the concepts and tools developed, availability to interact on a regular basis with GSK supervisors based in Stevenage (face to face and/or on-line).

 

Graph-Based Machine Learning for Biological Network Data

Contact Name Finnian Firth
Contact Email finnian.x.firth@gsk.com
Company Name GlaxoSmithKline
Address GSK, Gunnels Wood Road, Stevenage, SG1 2NY
Period of the Project 8-10 weeks, start date negotiable
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest March 1
Brief Description of the Project Machine learning on networks is a rapidly-growing focus of study, and applications in biology are widespread. For example, there have been many attempts to adapt deep learning strategies from other areas (e.g. imaging) to the network space, with some success. This project looks to utilize mathematical properties of biological networks to create powerful & informative analysis pipelines for metabolomics and genomics data. The student will explore the emerging landscape of tools, and help to design graph-based methodologies tailored for large biological data sets. This project is relatively open-ended, with a balance of theory and application likely to be involved.
Skills Required Familiarity with graph theory; familiarity with probability/statistics; some experience with a scientific programming language; ability to work in a cross-functional team
Skills Desired Proficiency in Python or R; familiarity with machine learning principles; basic experience working in a Linux environment

 

Characterisation of beating heart cell oscillations

Contact Name Denise Vlachou / Will Stebbeds
Contact Email denise.f.vlachou@gsk.com
Company Name GSK
Address Gunnels Wood Road Stevenage Herts SG1 2NY
Period of the Project 8-12 weeks
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project Potential toxic side effects of drugs on human heart function could be found by interpretation of the function of beating heart cells in a dish. These ECG like oscillatory profiles have complex shapes which create the need for complex descriptors. We have functional data for an array of commercial drugs with known cardiotoxic effects, and we would like a student to research and design the best way to describe these oscillations so that cardiotoxicity might be predicted from the oscillatory signals. Initial methods for signal processing could include Fourier analysis/sine + cosine series, non-parametric algorithms, or modelling with differential equations. Predictions based on the resulting parameters would then involve designing a probabilistic model or methods in machine learning. If the student is successful in designing the predictive model, they would then be supported in embedding their code into a user interface.
Skills Required Familiarity with coding; Mathematical modelling; Excellent communication; Knowledge/interest in Biology and medicine
Skills Desired Proficient coder in Python or R; Basic machine learning; Experience with signal processing

 

Large-scale population assessment of physical activity patterns in Rheumatoid Arthritis using actigraphy data from the UK Biobank

Contact Name Valentin Hamy
Contact Email valentin.x.hamy@gsk.com
Company Name GSK, R&D Biostatistics
Address Gunnels Wood Road Stevenage SG1 2NY
Period of the Project 8-10 weeks “ starting 01-Jul-2019
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project The current rapid evolution of wearable sensors and devices for the collection of health-related data is laying the foundation for the next revolution in clinical drug development. Wearable health monitors offer capabilities to collect semi-continuous, accurate physiological and activity data in near-real time. This emerging digital research platform has the potential not only to increase data accuracy and timeliness but most importantly enables the collection of ˜real-world' data, providing insights into the effect of therapies on patients' daily lives, ultimately allowing pharmaceutical companies to explain the value of their medications beyond traditional efficacy measurements. At GSK we are investigating the use of wearables in our clinical studies, with specific focus on actigraphy (remote monitoring of physical activity through inertial sensors). Diseases such as Rheumatoid Arthritis (RA) have a negative effect on physical activity, affecting the amount, type and way that patients perform certain activities and manoeuvres. Using wearable sensors in clinical trials enables us to monitor patients' physical activity and rest cycles regularly between clinical visits. However, physical activity has not been objectively characterised in large patient populations to reliably describe specific activity features that might discriminate healthy people from those suffering from a specific medical condition. This makes it challenging to identify measures that are sufficiently sensitive to accurately quantify the impact of our medicines in targeted patient populations. In this project we aim to identify differences in physical activity patterns, at population level, between healthy individuals and patients suffering from RA. To enable this investigation, we will use actigraphy data collected in the UK Biobank Study (data from approximately 100.000 subjects is available [1]) and will utilise a data driven approach to characterise activity patterns. By means of carefully trained Machine Learning or Deep Learning algorithms such as Long Short-Term Memory Recurrent Neural Networks (LSTM RNN), specific physical activity features that are likely to be affected by RA may be extracted. Another possible avenue for data exploration would consist in applying a Hidden Markov Model (HMM) approach such as that presented in [2] to derive circadian rhythm parameters from actigraphy data and further formalise the effect of RA on patients' daily activity patterns through comparison with other clinical parameters (e.g. from physical assessment or disease severity measure). This is an open-ended project, in which the student would be required to work closely with colleagues from GSK R&D Biostatistics with various backgrounds and skillset including wearable sensor technology experts, computer/data scientists, and clinical statisticians currently working on data analysis from physical activity monitoring in clinical trials. The following steps are suggested as a provisional work plan, nevertheless these are likely to be redefined through the conduct phase of the project and the associated timeline is flexible: - Preliminary exploration to derive an overall understanding of the data for RA population in UKB (prevalence in the whole databank is 1.13% by self-report and 0.57% by medication [3]): 1.5-2 weeks - Select most relevant mathematical approach and design analysis plan: 1.5 week - Generate and validate mathematical model(s) for data analysis: 4-5 weeks - Report findings (written report, presentation to team): 1-1.5 weeks [1] Doherty, Aiden, et al. "Large scale population assessment of physical activity using wrist worn accelerometers: The UK Biobank Study." PloS one 12.2 (2017): e0169649. [2] Huang, Qi, et al. "Hidden Markov models for monitoring circadian rhythmicity in telemetric activity data." Journal of The Royal Society Interface 15.139 (2018): 20170885 [3] Siebert, S., et al. Characteristics of rheumatoid arthritis and its association with major comorbid conditions: cross-sectional study of 502 649 UK Biobank participants. RMD open, 2(1) (2016): e000267
Skills Required Strong knowledge in programming for data science in Python (preferred), R or equivalent Excellent organisation skills, in particular with respect to data analysis Interest in solving real-world problems. Excellent effective interpersonal and communications skills
Skills Desired Specialised knowledge in advanced data analytics tools such as signal processing, machine learning and artificial intelligence Data mining experience in the context of large time series and big-datasets (data curation and analysis) Preferably demonstrated experience in analysing healthcare-related data sets

 

Automation of data production programming

Contact Name Paul Clarke
Contact Email paul.clarke@phe.gov.uk
Company Name Health Data Insight
Address CPC4 Capital Park Fulbourn Cambridge CB21 5XE
Period of the Project 2-3 months between June and September
Project Open to Undergraduates, Master's (Part III) students, PhD students (please note that it is unusual for PhD students to apply for CMP projects)
Deadline to Register Interest February 22
Brief Description of the Project The Office for Data Release (ODR), as part of Public Health England, is responsible for providing a common governance framework for responding to requests to access data held by PHE for secondary purposes, including service improvement, surveillance and ethically approved research. The ODR is responsible for ensuring data governance and protection principles are applied to each release. More information on the role of the ODR can be found here https://www.gov.uk/government/publications/accessing-public-health-engla... A large proportion of the data releases overseen by the ODR are to access cancer data. ODR staff work closely with analysts from PHE's National Cancer Registration and Analysis Service (NCRAS) to respond to these requests. The data for servicing these requests is held in a large collection of linked datasets in Oracle databases. ODR and NCRAS have identified that the tasks undertaken in processing requests are similar across all requests. These include; cohort definition, data extraction, data linkage, identifiability checks, pseudonymisation, aggregation, quality assurance and metadata production. However, a lot of this work is duplicated for each new request and is subject to differing interpretation. ODR have initiated a programme of work to undertake the development of automation tools to support and standardize this work, with various deliverables identified, including; Automated SQL code production for extracting and pseudonymising data according to customer cohort definitions Automated reporting on potential disclosure risk Testing of disclosure against published standards Reporting on options for data minimisation Production of scripts to apply statistical disclosure control. Production of documentation to accompany products, user guides, standard operating procedures etc The initial focus of this will be developed in the context of cancer data, it would be a further aim that the model developed through this project would provide an exemplar to support release of data from other data assets held by PHE. We are offering an intern placement to help deliver this useful work. The placement would suit an individual with a good analytical background who enjoys problem solving and attention to detail. Offering an opportunity to support a programme of work with clearly defined expectations and delivering operational software solutions. The outputs from this placement will improve the efficiency of both the ODR and PHE analytical teams, and provide an excellent intern opportunity to develop skills and knowledge whilst also demonstrating competency through successful project delivery. Responsibilities Understand and document the analytical needs and process of data release in terms of standard operating procedure. Plan and prepare scripts and interfaces supporting each part of this work. Improve the user experience for both analysts and external data requesters. Support other analytical and ODR work and identify other potential process improvements. Develop a good understanding of anonymisation and data minimisation techniques
Skills Required Data analysis and programming experience “ preferably using R, Excel, SQL. Ability to work well in a team Enthusiasm and willingness to learn Experience working with relational databases
Skills Desired Knowledge and interest in cancer incidence and treatment. Programming experience in a scripting language. Project management and delivery

 

Development of an epidemiology toolkit for rare cancer data using the National Cancer Registry

Contact Name Paul Clarke
Contact Email paul.clarke@phe.gov.uk
Company Name Health Data Insight
Address 5 St Philip's Place Birmingham
Period of the Project 2-3 months between June and September
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project We are seeking a new candidate to develop a new epidemiology toolkit that will enable NCRAS analysts perform new analysis and methods with ease which will increase the efficiency of our work. The toolkit will be developed targeted at rare cancers because of the issues with small numbers; if the methodology works with small numbers, the method(s) can be extended to larger cohorts. Part A The candidate will perform assessment of the current survival methodologies and whether it is feasible for rare cancers which have very small numbers. This is because the standard non-parametric methods usually only work well with groups that have a large sample size. A new adaptation (Brenner's alternative) has recently been coded in Stata to allow for non-parametric methods to cope with producing net survival for very small groups but this still has its disadvantages. Namely, the final output must be age-standardised and that the method requires there to be at least 1 person in each defined age group. The aim of the first part of this internship programme is to (1) assess the viability of the current non-parametric methods in the production of survival and mortality statistics in rare cancers and (2) to extend or develop new methods to accurately estimate survival and mortality for such small groups. Part B The candidate will perform simple regression models to assess the pattern of incidence over time and develop either a new model or adapt APC models to accurate project incidence into the future. This will be developed for the rare cancer setting first, as the method will work for larger cohorts if it is robust enough for small cohorts. The aim if the second part of this internship programme is to (1) assess the viability of trends of incidence over time for rare cancer sites and (2) to extend or develop new models to accurately project incidence into the future. Summary A combination of the following, depending on the intern's interests, or an alternative project if proposed by the intern and mutually agreed: Structured querying of national cancer database. Analyse cancer cohorts to produce survival estimates using current non-parametric methods. Analyse cancer cohorts to assess trends in data and to produce new models of incidence projection. Liaise with cancer site-specific leads within NCRAS to discuss expectation of the results. Extend or synthesis a new method for estimating survival for rare cancers. Compare new developed method to current methods and results. Conduct a sensitivity analysis where appropriate. Produce a ˜toolkit' program in Stata (similar to MATA) that will be circulated to the NCRAS analysts to use.
Skills Required The project involves analysing datasets containing anonymised personal information, so information governance training will be provided. Creativity and an interest in cancer research are expected. Some knowledge of mathematics, statistics and probability are required.
Skills Desired An interest in statistical theory and in particular, experience with survival or mortality analyses. Experience using statistical software (such as Stata) would be highly beneficial for this internship.

 

Data visualisation for improved engagement with public cancer data

Contact Name Paul Clarke
Contact Email paul.clarke@phe.gov.uk
Company Name Health Data Insight
Address CPC4 Capital Park Fulbourn Cambridge CB21 5XE
Period of the Project 2-3 months between June and September
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project The Get Data Out programme at Public Health England is seeking a data visualisation intern to derive insights from and enable public engagement with openly-available data on cancer. The Get Data Out programme routinely publishes cancer statistics produced by PHE in a consistent Standard Output Table - a table that collects patients into groups with common characteristics, and then publishes information such as incidence, survival, treatment rates and routes to diagnosis for these standard groups. Currently Get Data Out covers brain, ovarian, pancreatic and testicular tumours, and we hope to expand this output in the near future. All data and metadata can be found on our website: https://cancerdata.nhs.uk/standardoutput. Get Data Out has been welcomed by the cancer community, including analysts and charities focussed on rare and less common cancers, but we believe that more can be done to make these data accessible to as wide an audience as possible “ and we hope to use data visualisation to improve engagement with this information. You will have scope to influence the output of the project based on your own preliminary findings, existing skills and learning preferences. Most of our existing projects use R and RShiny. Responsibilities: - Use visualisation and data analysis tools to explore Get Data Out data. - Design and produce visualisations based on Get Data Out data. These could include: o Exploratory visualisations to enable analysts and researchers to seek out something new from this dataset, o Explanatory visualisations to engage the public with particular findings from the data, o Dashboards to give service commissioners, charities and clinicians an overview of key information, o Infographics and other public engagement tools based on the data. - Establish standard procedures and documentation to enable future sustainability of visualisations.
Skills Required Education in a relevant discipline, including data analysis and visualisation. Some knowledge of mathematics and statistics. Some programming experience, preferably using R or Python to work with data. Creativity and an interest in cancer research. Enthusiasm and willingness to learn. Verbal, written and data communication skills.
Skills Desired Data visualisation tools, especially with a web focus (RShiny or d3.js a particular bonus).

 

Learning from failure: overcoming survivorship bias

Contact Name Maciej Hermanowicz / Shahla Salehi
Contact Email maciej.x.hermanowicz@gsk.com
Company Name GlaxoSmithKline
Address GSK R&D Hub, Gunnells Wood Road, Stevenage
Period of the Project 8-12 weeks
Project Open to Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project The Misconception: You should focus on the successful if you wish to become successful. The Truth: When failure becomes invisible, the difference between failure and success may also become invisible (https://youarenotsosmart.com/2013/05/23/survivorship-bias/) Biopharmaceuticals (or Biopharms) are any medicines or other medical products manufactured (expressed) in biological systems rather than synthetised chemically. Across the pharmaceutical Industry significant efforts have been undertaken to iteratively improve Biopharm expression platforms, largely as a consequence of optimising growth assays and cell engineering. However, the vast majority of our antibody expressing cell lines generated internally are ultimately deemed '˜unsuitable' due to failure to achieve desired expression, growth, stability or product quality criteria. Given the vast numbers of cell lines that fall into this category, a wealth of unmined data exists that could, and should, provide crucial information as to the root-causes underpinning such failure. At present this data is confined to history as efforts are focused on our best-performing cell lines and understanding what makes them stellar. With thousands of cell lines and tens to hundreds of thousands of associated data points, mining of this '˜forgotten' data set using machine learning capabilities presents an opportunity to identify patterns, trends and relationships that are occurring in our '˜failed' cell lines. The successful applicant will work with a multidisciplinary team to investigate both historical and newly acquired data in pursuit of the failure state signatures. The student will be setting multi-omics data integration and analysis workflows, combining results from diverse experiments including metabolomics, transcriptomics, timeseries data monitoring bioreactor production and high throughput imaging from a parallelized nanofluidics Beacon system (https://www.youtube.com/watch?v=6gTGJhja0oI). Outputs of this project will present novel opportunities to unravel the plethora of root-causes underlining cell line failure and leverage them to design an expression platform truly pre-programmed for success.
Skills Required familiarity with coding; mathematical modelling; Excellent communication; Knowledge/interest in Biology and medicine
Skills Desired proficiency in Python, sparse data analysis, Bayesian modelling

 

Analysis and quality of synthetic radiotherapy data

Contact Name Paul Clarke
Contact Email paul.clarke@phe.gov.uk
Company Name Health Data Insight
Address CPC4 Capital Park Fulbourn Cambridge CB21 5XE
Period of the Project 2-3 months between June and September
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project As part of the Simulacrum project, Health Data Insight CIC is developing a synthetic version of the National Radiotherapy Dataset. The National Radiotherapy Dataset has been collected and organised by Public Health England since April 2016. The purpose of the Radiotherapy Dataset is to collect consistent and comparable data across all providers of radiotherapy services in England in order to provide intelligence for service planning, commissioning, clinical practice and research and the operational provision of radiotherapy services across England. The synthetic radiotherapy dataset is in the early stages of development. As part of the Simulacrum team, you will run quality checks on the synthetic data. Some of these checks will involve computing metrics of comparison between distributions in the real and synthetic data. In addition you will use clinical insight on radiotherapy treatment to sense check the synthetic data and then advise on potential improvements. You will analyse and critique the standard techniques for generating synthetic data, and devise new and innovative approaches for fixing specific issues. During this internship you will work very closely with analysts and developers operating at the forefront of synthetic data development. Your solutions for improving the synthetic data will be tested and the development team will work to implement your solutions in the generation process. There will also be opportunities to learn about the details of synthetic data generation and some of the mathematical techniques used to protect patient confidentiality. Responsibilities Explore and understand the synthetic radiotherapy dataset and how it relates in structure and content to the National Radiotherapy Dataset using clinical insight and knowledge of cancer data Run quality checks on the synthetic data computing metrics of comparison between real and synthetic data Advise the Simulacrum team on potential improvements to the generation process for synthetic radiotherapy data Work closely with the Simulacrum development team to test your solutions and implement your fixes in the synthetic data generation process
Skills Required Data analysis experience “ preferably using SQL but also R, Python or Excel Programming experience in at least one scripting language “ Python, Ruby, Matlab, R etc. Ability to work well in a team Enthusiasm and willingness to learn
Skills Desired Knowledge and interest in clinical practice, particularly with regard to radiotherapy treatment for cancer Programming experience in a querying language e.g. SQL, Postgres Experience working with relational databases

 

Developing Machine Learning Models for Cancer Prediction and Patient Phenotyping

Contact Name Paul Clarke
Contact Email paul.clarke@phe.gov.uk
Company Name Health Data Insight
Address CPC4 Capital Park Fulbourn Cambridge CB21 5XE
Period of the Project 2-3 months between June and September
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project Health Data Insight has worked with Public Health England and the NHS Business Services Authority to develop the methodology to create a database of England's primary care prescriptions data. This has been linked to the Cancer Analysis System (CAS), a national database of all cancer diagnoses and treatment in England. The aim of the Index of Suspicion project is to use machine learning to identify patterns in medication prescribed prior to the diagnosis of cancer and other patient data to derive an “index of suspicion” that will predict when a patient is at increased likelihood of developing subsequent cancer. As an intern working on this project, you will work with a team of analysts and developers to further develop and improve upon the current machine learning methods and algorithms to better understand prediagostic prescribing and to strengthen prediagnostic indicators of cancer into a strong predictive signal. Key difficulties of the project are the size and complexity of the prescriptions dataset, with over a billion rows and the general issues that arise from working with real patient data. As part of this internship, you will have the opportunity to learn a range of skills in machine learning in healthcare and data analysis, as well as core transferrable skills such as working as part of a team, managing and delivering projects, and developing technical solutions to meet the needs of the HDI-NCRAS team and the patients, clinicians, and other individuals and organisations who will use the findings of the work. The internship also offers valuable experience working in the competitive data science industry. Possible specific directions for the project would be given by combinations of the following, or alternative directions suggested, to be mutually agreed: Develop methodologies and computational algorithms to better understand and refine the prediagnostic prescriptions signals for cancer identified by the machine learning models. With these, identify, extract, and strengthen patterns in prescriptions data indicative of future cancer diagnosis. Use high-power computing resources to develop methodologies to scale current and future machine learning and computational methods to increasingly large datasets. Develop visualisations of the patient data, specifically for aims such as better identification of prediagnostic prescription markers of cancer and to explore prescribing to the machine learning cohorts. Work with the analytical team to incorporate other cancer registry datasets into the machine learning models. Work with clinicians to develop clinically-led computational algorithms and machine learning methods, and/or better refine the models for the specific nature of prescriptions data and of cancer as a disease. Investigate alternative machine learning approaches that meet the needs of the problem. Responsibilities Data extraction and analysis using Oracle SQL, R, Python, or other appropriate languages. The refinement and development of computational algorithms and machine learning techniques. Implementation of algorithms and machine learning methods using R, TensorFlow and/or Keras, Python, Theano, or other suitable languages or packages. Planning and managing the delivery of the project. Collaborative working with the HDI-NCRA teams as well as clinicians and other individuals and organisations. Sharing of technical knowledge with the wider analytical team through documentation, peer-to-peer learning, and seminars. Supporting other analytical work within the team.
Skills Required Interest in data science and machine learning. Creativity and an interest in cancer research. Enthusiasm and willingness to learn. Interest in working as part of a team. Demonstrable ability to complete the project to a high standard.
Skills Desired Experience with a relevant programming language or library, e.g. TensorFLow, Keras, Theano, R, Python. A mathematical or computer science background. Experience with SQL or other query language. Experience working with large datasets.

 

Visualisation: Unlocking the Potential of Cancer Data

Contact Name Paul Clarke
Contact Email paul.clarke@phe.gov.uk
Company Name Health Data Insight
Address CPC4 Capital Park Fulbourn Cambridge CB21 5XE
Period of the Project 2-3 months between June and September
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project From informed patient choices and symptom awareness campaigns to communicating complex discoveries in cancer research to the clinicians who will decide an individual's treatment options, high quality patient care relies on effective communication. A good visualisation has the power to communicate a message with far greater impact than text or raw data: It can highlight key points, provide easy summaries and comparisons, and reveal patterns or trends over time. Health Data Insight works with Public Health England's National Cancer Registration and Analysis Service (NCRAS) to generate new insights into healthcare data to improve outcomes in healthcare. NCRAS aims to collect data on all cases of cancer diagnosed in England for the purposes of improving cancer services and outcomes, improving patient care, and to complete in-depth research into understanding all aspects of cancer causes, symptoms, progression, and treatment effects. This data is stored in large, linked datasets, the aggregate of which contains information on all areas of the cancer pathway. The aim of this internship will be to work with a team of analysts to come up with and develop a high-impact visualisation(s) to convey a key message(s) identified from the cancer datasets. This internship will combine strong technical skills with a high level of creativity and innovation, and will provide the opportunity to gain experience working in the competitive data science industry and to develop skills in a wide range of aspects of a collaborative working environment working with a large public sector organisation, including managing and delivering projects, working as part of a team, and developing technical solutions to meet the needs of both the HDI-NCRAS team and the patients, clinicians, and other individuals and organisations who will use the visualisation to access and understand patient data. Principal questions to consider would be: What are the key messages that patients, their families, and clinicians need to know? How can we effectively convey these messages in a clear, engaging, and user-friendly way using the tools available (R, D3.js, JavaScript, or p5.js)? How can we implement the ideas precisely and to a very high standard using appropriate technologies to create high-quality, professional visualisations? How can we use text and design to enhance the image? The role will include: Liaising with members of the HDI-NCRAS team to establish areas where visualisation would be beneficial and identify key messages and to identify and work to a chosen visualisation's requirements. Planning and managing the delivery of the project. Designing high-impact visualisations to effectively convey one or more identified messages. Data extraction and analysis using Oracle SQL, R, Python, or other appropriate languages. Technical implementation of the visualisation in a relevant language, e.g. R, D3.js, JavaScript, p5.js. Sharing of technical knowledge with the wider analytical team through documentation, peer-to-peer learning, and seminars. Supporting other analytical work within the team.
Skills Required Interest in data science and visualisation. Creativity and an interest in cancer research. Enthusiasm and willingness to learn. Interest in working as part of a team. Demonstrable ability to complete the project to a high standard.
Skills Desired A mathematics or computer science background. Experience with SQL or other query language. Experience with a relevant programming language or library, e.g. R, D3.js, JavaScript, p5.js. Experience with data visualisation or web design.

 

Tracking the beat in heart cell videos

Contact Name Maciej Hermanowicz / Will Stebbeds
Contact Email maciej.x.hermanowicz@gsk.com
Company Name GlaxoSmithKline
Address GSK Medicines Research Centre Gunnels Wood Road Stevenage Herts SG1 2NY
Period of the Project 8-12 weeks
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project Potential toxic side effects of drugs on human heart function could be found by interpretation of the function of beating heart cells in a dish. Microscope based imaging systems can be used to capture images of these beating heart cells at up to 10 frames per second and can therefore be used to assess the beating in real time (see https://www.youtube.com/watch?v=13Hdy-xfhoc for an example). The major limitation to using such a system is the lack of an appropriate commercial solutions for handling these large, time resolved, multiparametric data sets. The student will work as part of our multidisciplinary team and the work will consist of: - Analysing image stacks corresponding to beating heart cells. - Assessing various methods of tracking movement and contraction of cells, for example techniques derived from information theory such as patch-based measurements [1] or by treating the image stacks as a 3D vector fields such as methods in optical flow [2]. - Upon achievement of this, the student would then be supported in designing a user interface that embeds the code in a platform interface. 1. Huebsch, N. et al. Automated Video-Based Analysis of Contractility and Calcium Flux in Human-Induced Pluripotent Stem Cell-Derived Cardiomyocytes Cultured over Different Spatial Scales. Tissue Eng Part C Methods 21, 467–479 (2015). 2. Czirok, A. et al. Optical-flow based non-invasive analysis of cardiomyocyte contractility, Nature Scientific Reports, 10404 (2017)
Skills Required familiarity with coding; mathematical modelling; Excellent communication; Knowledge/interest in Biology and medicine
Skills Desired Python proficiency, interest / experience in computer vision

 

Exploring Deep Learning models for segmentation of MRI scans of lungs at different disease stages

Contact Name Laura Acqualagna
Contact Email laura.x.acqualagna@gsk.com
Company Name GlaxoSmithKline - AI/ML group
Address GSK Medicines Research Centre, Gunnels Wood Road, Stevenage, UK
Period of the Project 8-10 weeks
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest 3rd March 2019
Brief Description of the Project Deep learning models such as Convolutional Neural Networks have been widely shown to be successful in the radiological imaging space (MRI, CT) [1] for segmentation of organs [2] and structures of interest, e.g. lesions [3], tumours [4,5], nodules. Automated organ segmentation is an important prerequisite for many computer-aided diagnosis systems and provide a significant reduction in time and costs compared to a manual segmentation done by an expert radiologist. In an early study run in-house, we trained a deep learning model to automatically segment images in pre-clinical drug discovery, specifically lungs from rodent MRI scans. Our results indicate that deep learning significantly conventional methods based on computer vision algorithms, i.e. intensity thresholding and morphological operations. Consistent and robust segmentation of anatomy of interest is a critical step in image quantification, and there is historically an appreciable inter-operator variability in organ segmentation. Given the complexity of anatomy, combined with the constant motion during image acquisition, then lung segmentation rules are very challenging to consistently apply across multiple users. A consistent segmentation process will potentially greatly add to the robustness thus reducing animal numbers and/or increasing sensitivity of in vivo models. The successful student will work as part of our multidisciplinary AI/ML team to: Extend the previous work by investigating deep learning techniques for the segmentation of lungs in different ranges of disease progression. This is a non-trivial problem due to many differing complexities such as different manifestations of disease state depending on which disease model is being assessed; movement of the subject due to non-uniform breathing; scans alignment consistency not always possible to adhere to; complexity of anatomy. Data curation and preparation, design of appropriate data analysis pipeline. The capability to automatically segment lungs developed in this project will have a profound impact on Bioimaging operations, increasing throughput, decreasing variability and adding to the robustness of animal models. Moreover, this methodology could be potentially applied to different imaging modalities such as microCT for wider applicability and impact across the business. References 1. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., van der Laak, J.A., Van Ginneken, B. and Sánchez, C.I., 2017. A survey on deep learning in medical image analysis. Medical image analysis, 42, pp.60-88. 2. Avendi, M.R., Kheradvar, A. and Jafarkhani, H., 2016. A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Medical image analysis, 30, pp.108-119. 3. Christ, P.F., Elshaer, M.E.A., Ettlinger, F., Tatavarty, S., Bickel, M., Bilic, P., Rempfler, M., Armbruster, M., Hofmann, F., D'Anastasi, M. and Sommer, W.H., 2016, October. Automatic liver and lesion segmentation in CT using cascaded fully convolutional neural networks and 3D conditional random fields. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 415-423). Springer, Cham. 4. Pereira, S., Pinto, A., Alves, V. and Silva, C.A., 2016. Brain tumor segmentation using convolutional neural networks in MRI images. IEEE transactions on medical imaging, 35(5), pp.1240-1251. 5. Işın, A., Direkoğlu, C. and Şah, M., 2016. Review of MRI-based brain tumor image segmentation using deep learning methods. Procedia Computer Science, 102, pp.317-324.
Skills Required 1) Knowledge of Python coding language 2) Proficiency in linear algebra, statistics and probability theory 3) knowledge of machine learning fundamentals
Skills Desired 1) knowledge of Convolutional Neural Networks 2) experience with one of the main Deep Learning frameworks e.g. Keras, Tensorflow, Caffe, PyTorch, 3) experience or interest in image processing

 

Online imaging under extreme conditions

Contact Name Priscilla Canizares
Contact Email pcanizares@slb.com
Company Name Schlumberger Cambridge Research
Address Schlumberger, High Cross, Madingley Road, Cambridge, CB3 0EL, United Kingdom
Period of the Project 8-10 weeks
Project Open to Part III students, PhD students
Deadline to Register Interest January 28
Brief Description of the Project Real time imaging in boreholes under construction is fundamental to assess their robustness and avoid environmental disasters and human and economic losses. However, this is a very challenging problem that requires designing specialized data compression algorithms able to perform with small computational resources and low encoding rate. The student will have the opportunity to collaborate closely with the research group at Schlumberger in improving and enhancing current borehole imaging algorithms with novel image compression techniques. This is a well defined project, but there is scope for the student to contribute with their own approaches.
Skills Required Knowledge of sampling theory, Compressed Sensing and Mathematical Signal Processing
Skills Desired Knowledge of Matlab and/or python

 

Study of the effect of noise and sparsity on residual CNNs

Contact Name Harry Clifford
Contact Email  
Company Name Cambridge Cancer Genomics
Address 72 Hills Road, Cambridge, CB2 1LA
Period of the Project 8 weeks
Project Open to Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project DNA Sequencing data can be represented as what are known as pileups images. These are pictures of sorted chunks of DNA for a given region in the genome. When it comes to determining if a position contains a DNA mutation, CCG uses convolutional neural networks applied to these images to solve a classification problem. When we run a pipeline that outputs these pileup images, we can decide either to compress and minimize the dimensionality, or to use the raw information with higher dimensionality. We can decide to generate more/fewer sparsity/noise in the pileups fed to the CNNs. We would like to: 1 - define the most relevant mathematical tool to quantify noise and sparsity 2 - define the relationship between those metrics and the performance of a CNN
Skills Required Experience applying neural networks to data; autonomy; highly motivated
Skills Desired  

 

Force to Radiated Noise Transfer Functions

Contact Name Jake Rigby
Contact Email jake.rigby@bmtglobal.com
Company Name BMT
Address 210 Lower Bristol Rd, Bath, BA2 3DQ
Period of the Project 8 weeks between late June and 30 September
Project Open to Undergraduates, Part III students
Deadline to Register Interest February 22
Brief Description of the Project When ships move through the water they create noise that can disturb the local marine environment. This project will seek to mathematically derive the force to radiated noise transfer function which defines how vibration on board a ship is turned into radiated noise in the water. This function is easily defined for a flat plate in the water however with some additional work it could also be defined for a structurally stiffened curved plate.
Skills Required  
Skills Desired  

 

Support solution optimisation using Monte Carlo Simulation

Contact Name Jake Rigby
Contact Email jake.rigby@bmtglobal.com
Company Name BMT
Address Spectrum Building, Solent Business Park, 1600 Parkway, Whiteley, Fareham PO15 7AH
Period of the Project 8 weeks between late June and 30 September
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project This project would look aim to develop a simulation learning algorithm to both model and optimise a equipment engineering support solution. Considerations may include the availability and location of resources (spares, support equipment, maintainers etc.), as well as repair locations and maintenance/repair policy. The project would involve: Definition and development of a mathematical model of a support solution (discrete event or other suitable methodology) Identification of key input parameters which drive performance* Generation of a learning algorithm to optimise the model by self-analysing results, and varying key input parameters identified in order to achieve an optimal solution Note that current methods within industry often focus on deterministic techniques considering one parameter at a time (e.g. number of spares). Therefore the impact of variances and dependencies within the supply chain (e.g. variable repair times) may go unknown. An additional objective of this project may be to quantify the benefits of an integrated model. * may be possible to automatically derive key input parameters based on a separate modelling algorithm
Skills Required  
Skills Desired  

 

Multiparametric Imaging for Tumour Characterisation

Contact Name Marius de Groot
Contact Email marius.x.de-groot@gsk.com
Company Name GSK R&D, Clinical Imaging
Address GSK, Addenbrooke's Hospital, Hills Road, Cambridge CB2 0GG / GSK, Gunnels Wood Road, Stevenage SG1 2NY
Period of the Project 8 weeks between July 8 and end of September
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project To identify the optimal therapy for the patient, tumours are often characterised on the basis of histology and stratified into sub-categories that have a different clinical pathway associated to them. Advanced magnetic resonance imaging (MRI) techniques like dynamic contrast-enhanced (DCE) MRI and diffusion-weighted (DW) MRI capture some of microstructural properties of the tissue that are relevant in these characterisations. Perhaps the use of multiparametric MRI might one day obviate the need for biopsies? What's more, while relying on invasive biopsies it is difficult to characterise heterogeneity in the tumour pre-surgically. Imaging techniques on the other hand allow tissue characterisation across the entire tumour. These techniques may therefore have potential value to aid in clinical decision making and in research, but many questions remain. In this project we will focus on developing an end to end characterisation solution for grading tumours on the basis of multiparametric imaging. In principle we will focus on the PROSTATEx challenge (https://prostatex.grand-challenge.org/), which is an image analysis challenge to predict Gleason scores for pre-surgical prostate imaging on a publicly available dataset. However, other Grand-Challenges may be considered if appropriate and of particular interest for the student. The PROSTATEx challenge comprises a set of 3D MR images per patient, including structural imaging, DCE and DW MRI modalities. For each tumour in the training set (204 patients), the clinical significance of the lesion is available. This should allow training of a model needed to predict the scores for the 140 test patients. Evaluation is done through the public grand-challenge framework. While the aim of the project is clear, there are many multiple parts of the solution that need to be defined and multiple approaches that could be considered. The many millions of pixels per patient present a very high dimensional feature space for which either a deep-learning architecture such as a convolutional neural network can be trained and optimised, but aFrom our perspective, this project is open to deep learning approaches more as well as to a traditional machine-learning approach with crafted features is similarly possible.s . Publication of results is encouraged.
Skills Required Basic programming and machine learning experience. Preferably in Matlab or Python.
Skills Desired Some experience with medical image analysis. Experience handling large multidimensional datasets.

 

Lightning Location

Contact Name Malcolm Kitchen and Ed Stone
Contact Email malcolm.kitchen@metoffice.gov.uk
Company Name The Met Office
Address Met Office, Fitzroy Road, Exeter EX1 3PB
Period of the Project 8 weeks
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 22
Brief Description of the Project

An existing method for locating lightning strikes around the world is being upgraded. The system employs a distributed network of receiving stations to record the arrival times of radio signals associated with lightning. The lightning location is calculated by triangulation from the times of arrival at the different receivers, using a method which minimises the uncertainty in the resulting lightning location. Before the triangulation step can proceed, it is necessary to correctly assign a set of signals arriving at the different receiving stations to a given lightning event. This assignment is not straightforward, because the chronological order of arrival of signals is not necessarily the same as the chronological order of the lightning strikes. ‘Sledgehammer iterative schemes are available to perform the assignment, but as the overall sensitivity of the system increases, these become computationally expensive. The project is to devise a more efficient method that exploits the available constraints/reasonable inferences to maximise the number of correct assignments and the number of lightning strikes that are accurately located.

Please note that this project is eligible for partial bursary support; interested students should apply just as for academic projects.

Skills Required Some experience of programming is essential. Knowledge of Python and Linux would be helpful.
Skills Desired The ideal candidate would have an interest in and some previous experience in data modelling or remote sensing.

 

Making sure that your next Amazon order doesn't cost the Earth!!

Contact Name Martin Robinson
Contact Email Martin.Robinson@transfaction.com
Company Name Transfaction
Address 3 Southern Belle Close, BSE, IP32 7PA
Period of the Project 3-months (June - September 2019)
Project Open to Undergraduates, Part III students, PhD students
Deadline to Register Interest February 28th
Brief Description of the Project

Background: The Freight Transport Association Annual Report (2017) stated that it costs £1 per minute to run a 44-tonne heavy goods vehicle (HGV) and trailer. And yet; 27% of the time trucks are run empty! We can prove the true level of waste is higher!! The logistics & transport sector is plagued by uncertainty (unreliability) and this leads to (excuses) excessive waste.

Opportunity: Transfaction is dedicated to reducing uncertainty (unreliability) and to the cutting the wastes in logistics & transport operations. To work with a start-up, and as funding becomes available, to possibly join the co-founder team. Contribute to making a real difference to the way logistics & freight transport is transacted!

Project Description: Several tasks are to be completed by several resources, but the number (types) of eligible resources might not be known and need to be determined from the data. These tasks and resources are influenced by several influencers and these need first to be identified. What is directly known from the data are the times, when a task arrived and the time that it completed. However the duration between arrive and complete needs splitting into indirect time, when the task was waiting to be worked and direct time when the tasked was worked on. The split is not known. Also, the tasks could have been split into several phases and these splits need also to be determined. Once number (and types) of resources, influencers and splits are known and modelled, advice on the least wasteful match of task/resource can be given. The final goal, which this project will help to define, is a system which will predict and react in real time to new tasks arriving or influencers changing and updates the schedule and the model if necessary.

Problem example: The first example are trucks arriving at a yard and need to be processed. Influencers can be the type of truck, type of cargo, time of day and week, weather conditions, etc. The second example is where there are trucks on the road either delivering or being re-sited. Influencers are type of roads (urban, rural, motorway, etc.), road conditions (congestion, roadworks, etc.), time of day and week, weather conditions, regulations etc.

Skills Required An interest in big-data, machine learning and algorithmic decision-making assistance (gaming.) An interest in/understanding for these Techniques: Finding intelligent ways to merge data-sets. The identification of change-points. Data clustering, possibly mixture models or Dirichlet process. Probabilistic predictive / risk modelling using Bayes rules. Real-time monitoring systems (prediction correction.)
Skills Desired The ability to be self-directed. The willingness to work as part of a team of Transfaction experts where ideas/approaches are pooled and worked on.

Compressed sensing at massive scale

Contact Name Andrew Thompson
Contact Email andrew.thompson@npl.co.uk
Company Name National Physical Laboratory
Address Maxwell Centre, JJ Thomson Avenue, Cambridge, CB3 0HE
Period of the Project 8 July - 30 August (8 weeks)
Project Open to Undergraduates, Master's (Part III) students, PhD students (please note that it is unusual for PhD students to apply for CMP projects)
Deadline to Register Interest February 22
Brief Description of the Project The title of the project is 'Compressed Sensing at Massive Scale'. Compressed sensing (CS) is about finding sparse solutions to systems of linear equations. NPL are interested in developing capability to solve compressed sensing problems at massive scale. Almost all CS algorithms that have been proposed have complexity which scales at least linearly with dimension, which typically limits the problems that can be practically solved to dimension 10^10. But can we go larger? Suppose we want to solve compressed sensing problems of size 10^30? This task requires CS algorithms with sublinear complexity. An algorithm called CHIRRUP (see https://people.maths.ox.ac.uk/thompson/chirrup.pdf) was recently proposed which is able to do this, but CHIRRUP only finds sparse solutions where all the nonzero coefficients are equal to 1. This project will focus on extending the CHIRRUP algorithm so that it can solve more general compressed sensing problems at massive scale. It is possible that the project may result in a journal or conference publication. The project would appeal to a student with an interest in computational mathematics and numerical algorithm development. The placement would be based in Cambridge.
Skills Required An enthusiasm for coding, and some exposure to Matlab.
Skills Desired  

 

Developing optimisation approaches to be used with simulation of bioprocess chromatography

Contact Name Nehal Patel
Contact Email nehal.2.patel@gsk.com
Company Name GSK, Biopharmaceutical Process Development
Address GSK R&D, Medicines Research Centre, Gunnels Wood Road, Stevenage
Period of the Project 8 weeks
Project Open to Undergraduates, Part III students
Deadline to Register Interest February 22nd 2019
Brief Description of the Project

What is the best method to optimise a process which can be described using mathematical simulations? Is it to run the simulations as if you were conducting real experiments e.g. using a design of experiment approach? Or is it to use deterministic algorithms such as gradient descent and simplex? How about non-deterministic heuristic-based algorithms such simulated annealing and genetic algorithms? These are the type questions that need to be answered before modelling tools can be deployed for widespread use by scientists and researchers. The Biopharm Process Research group within GSK is tasked with developing robust, reliable manufacturing processes for early stage biotherapeutic drugs. As part of the manufacturing process, new drug candidates need to be purified to ensure that they meet strict quality criteria. Typically, most of purification is conducted using large scale chromatography columns, therefore, it is critical to understand how these processes work and can be improved. Mathematical modelling of chromatography separation processes can help answer these questions. The models typically consist of a series of partial differential equations which are solved numerically to yield a simulation which can then be used to assist the scientist in making important decisions. This project will focus on better understanding the mathematical models, specifically looking to answer the following questions: 

Currently calibration of models is computationally expensive. Can model calibration be improved using better optimisation algorithms and machine-learning technologies? 

Given a representative model, what is the best way of operating the process to meet different criteria such as product quality and cost of goods? 

How confident can we be in using these models to make predictions of the real world given the uncertainty in real world experimental data used to calibrate models?  Where do the models work well? Where do they not?

Skills Required  Good knowledge of MATLAB, Python or equivalent high-level programming languages. Ability to work both independently and as part of a wider team.
Skills Desired

 A keen interest in using maths to represent real world systems.

 Knowledge of optimisation algorithms and machine learning algorithms.

Knowledge of solving systems of differential equations

 

Quantum computing for molecular simulation

Contact Name Ophelia Crawford
Contact Email team@riverlane.io
Company Name Riverlane
Address 3 Charles Babbage Road, Cambridge, CB3 0GT
Period of the Project 10-12 weeks
Project Open to Part II students, Part III students, PhD students
Deadline to Register Interest 22 February
Brief Description of the Project Riverlane is a University of Cambridge spin-out who develop software and algorithms for quantum computers. In particular, we are building simulation tools for microscopic systems that accurately account for quantum effects. Such tools have applications within, for example, materials and drug discovery. The internship program will involve working on projects either individually or in a small team. There will be regular feedback meetings to discuss and monitor progress and plenty of opportunities to work closely with research staff. Typically, a project will involve: - Reading and discussing academic papers with your teammates and converting the ideas into code. - Designing, implementing, testing, reviewing, debugging and optimising research code in consultation with collaborators. - Analysing data and communicating results in the form of discussions and reports. We will work with you to develop specific project ideas depending on your background and interests.
Skills Required - At least third year undergraduate. - Strong skills in mathematics and computer programming. - Good critical and logical thinking. - Comfortable working in a team, you are able to communicate ideas and write code that can be understood by others and builds on the wider project. - Ability to work independently and manage your time well. - A background in quantum computing is not necessary as the relevant training will be given.
Skills Desired Background knowledge in computational chemistry, optimisation or machine learning is a plus.