skip to content

Features: Faculty Insights


An international team including researchers from DAMTP has launched a new collaboration leveraging the same technology underpinning ChatGPT to build an AI-powered tool for scientific discovery.

While ChatGPT has been trained on text from many different languages, this new research collaboration's AI will learn from numerical data from many different scientific fields. The aim of the research is to help develop improved tools to aid scientists in modelling everything from supergiant stars to the Earth's climate.

Scientists are using AI tools already - it's one of the fastest-paced and most rapidly growing areas in research - but currently they have to build and train each of these tools from scratch. This presents a huge up-front cost, requiring large datasets and larger training times. The team's hope is that they can help to make AI tools for scientific research more accessible and more powerful by developing a general foundation model for scientific data, which other researchers could then fine tune for their specific needs.

The researchers launched the initiative, called Polymathic AI, in October 2023, alongside the publication of a series of related scientific papers on the open access repository.

Foundation models - learning the language of the Universe

The idea behind Polymathic AI "is similar to how it's easier to learn a new language when you already know five languages," said Polymathic AI principal investigator Shirley Ho, a group leader at the Flatiron Institute's Center for Computational Astrophysics in New York City.

The form of AI that lies behind applications like ChatGPT is called machine learning. Here an algorithm can be used to construct a model (a mathematical description) of some phenomenon learning to spot patterns within a set of training data. The model (usually a neural network) can use what it has learnt to answer questions about data it hasn't seen before. For a simple example, looking through many labelled pictures of cats and dogs, the model can learn the patterns associated with each animal, and then identify cats and dogs even in pictures that aren't labelled with the right answer and which it hasn't seen before.

"Right now, in science, machine learning can be thought of as teaching a toddler to do a highly-specialised task. Our initiative aims to change this."Miles Cranmer

However, if you start with a large, pre-trained model, known as a foundation model, it can be trained in fewer steps and to a greater accuracy than building a smaller model and training it from scratch. That can be true even if the training data isn’t obviously relevant to the problem at hand. The last few years have seen significant advances made in the fields of machine learning for vision and natural language processing (NLP) by training general models on vast and diverse datasets, unlocking a jump in performance and capabilities.

This approach has led to the emergence of foundation models with the capacity to leverage information gathered from a variety of sources when attempting to solve unseen tasks. Foundation models are large neural networks that have typically been pre-trained on massive datasets without the use of explicit labels. The remarkable thing about this approach is that access to these larger unlabeled datasets allows the models to learn broadly useful, generalisable features that are representative of shared patterns across much more general domains. When researchers need to solve a new problem, they are then able to fine-tune these models quickly and with less data because many intrinsic properties of the data distribution are already understood by the model. This improves both the accuracy and the accessibility of large-scale deep learning.

The Polymathic AI collaboration hopes to tackle the challenge of accomplishing the same for applications of machine learning on scientific datasets. "Right now, in science, all machine learning can be thought of as teaching a toddler to do a highly-specialised task," says co-investigator Miles Cranmer, Assistant Professor in Data Intensive Science at the Department of Applied Mathematics and Theoretical Physics and Institute of Astronomy. "Our initiative aims to change this. We want to switch the paradigm of machine learning to teaching a general scientist – who already understands the world very well – and can pick up new tasks quicker."

"You can think about this like language translation. ChatGPT, despite not being designed for language translation, can actually perform the task better than Google Translate, for example," adds Cranmer. "It's because ChatGPT has been trained on such a vast amount of data that it already understands grammatical structure, before it starts seeing examples of language translation in its dataset."

Connections and challenges

The Polymathic AI collaboration includes researchers from the Simons Foundation and its Flatiron Institute, the University of Cambridge, New York University, Princeton University and the Lawrence Berkeley National Laboratory,  bringing together expertise in physics, astrophysics, mathematics, artificial intelligence and neuroscience. Polymathic AI's project aims to learn using data from diverse sources across physics and astrophysics (and eventually, its creators hope, fields such as chemistry and genomics) and apply that multidisciplinary savvy to a wide range of scientific problems.

This involves tackling significant challenges. The research team needs to build AI models which can leverage information from heterogeneous datasets across different scientific fields. Unlike areas like natural language processing, where datasets are typically plain text, science faces the issue of learning from many diverse types of data and measurements which are represented in different ways. In addition, simply using a model exactly like ChatGPT may not work, as it has well-known limitations when it comes to mathematical precision (for instance, the chatbot says 2,023 times 1,234 is 2,497,582 rather than the correct answer of 2,496,382). 

"It's been difficult to carry out academic research on full-scale foundation models due to the scale of computing power required," said Miles Cranmer. "Our collaboration with the Simons Foundation has provided us with unique resources to start prototyping these models for use in basic science, which researchers around the world will be able to build from – it's exciting."

Unlocking potential

Current scientific AI tools have primarily been purpose-built and trained using relevant data. "Despite rapid progress of machine learning in recent years in various scientific fields, in almost all cases, machine learning solutions are developed for specific use cases and trained on some very specific data," said co-investigator Francois Lanusse, a cosmologist at the Centre national de la recherche scientifique (CNRS) in France. "This creates boundaries both within and between disciplines, meaning that scientists using AI for their research do not benefit from information that may exist, but in a different format, or in a different field entirely."

"Polymathic AI can show us commonalities and connections between different fields that might have been missed," said co-investigator Siavash Golkar, a guest researcher at the Flatiron Institute's Center for Computational Astrophysics and former postdoctoral researcher at DAMTP. "In previous centuries, some of the most influential scientists were polymaths with a wide-ranging grasp of different fields. This allowed them to see connections that helped them get inspiration for their work. With each scientific domain becoming more and more specialised, it is increasingly challenging to stay at the forefront of multiple fields. I think this is a place where AI can help us, by aggregating information from many disciplines."

"How far we can make these jumps between disciplines is unclear," said Ho. "That's what we want to do — to try and make it happen."

Building a foundation model for science is an ambitious task, but the team is excited by the opportunity it presents. As an initial step the collaboration has published research on key architectural components, including adapting language models for numerical data, and exploring the transferability of surrogate models. In a recent paper, members of the Polymathic AI team demonstrated how a broadly pre-trained AI can match or outperform an AI trained specifically for the complex task of replicating the physics of turbulent fluid flow.

Transparency and openness are a big part of the project. "We want to make everything public," said Ho. "We want to democratise AI for science in such a way that, in a few years, we'll be able to serve a pre-trained model to the community that can help improve scientific analyses across a wide variety of problems and domains."


This article is partly adapted from news items from the University of Cambridge, the Simons Foundation and the Polymathic AI initiative.