‘Machine Scientists’ Distill the Laws of Physics from Raw Data

In 2017, Roger Guimerà and Marta Sales-Pardo discovered a cause of cell division, the process driving the growth of living beings. But they couldn’t immediately reveal how they learned the answer. The researchers hadn’t spotted the crucial pattern in their data themselves. Rather, an unpublished invention of theirs — a digital assistant they called the “machine scientist” — had handed it to them. When writing up the result, Guimerà recalls thinking, “We can’t just say we fed it to an algorithm and this is the answer. No reviewer is going to accept that.”

The duo, who are partners in life as well as research, had teamed up with the biophysicist Xavier Trepat of the Institute for Bioengineering of Catalonia, a former classmate, to identify which factors might trigger cell division. Many biologists believed that division ensues when a cell simply exceeds a certain size, but Trepat suspected there was more to the story. His group specialized in deciphering the nanoscale imprints that herds of cells leave on a soft surface as they jostle for position. Trepat’s team had amassed an exhaustive data set chronicling shapes, forces, and a dozen other cellular characteristics. But testing all the ways these attributes might influence cell division would have taken a lifetime.

Instead, they collaborated with Guimerà and Sales-Pardo to feed the data to the machine scientist. Within minutes it returned a concise equation that predicted when a cell would divide 10 times more accurately than an equation that used only a cell’s size or any other single characteristic. What matters, according to the machine scientist, is the size multiplied by how hard a cell is getting squeezed by its neighbors — a quantity that has units of energy.

“It was able to pick up something that we were not,” said Trepat, who, along with Guimerà, is a member of ICREA, the Catalan Institution for Research and Advanced Studies.

Because the researchers hadn’t yet published anything about the machine scientist, they did a second analysis to cover its tracks. They manually tested hundreds of pairs of variables, “irrespective of … their physical or biological meaning,” as they would later write. By design, this recovered the machine scientist’s answer, which they reported in 2018 in Nature Cell Biology.

Four years later, this awkward situation is quickly becoming an accepted method of scientific discovery. Sales-Pardo and Guimerà are among a handful of researchers developing the latest generation of tools capable of a process known as symbolic regression.

Symbolic regression algorithms are distinct from deep neural networks, the famous artificial intelligence algorithms that may take in thousands of pixels, let them percolate through a labyrinth of millions of nodes, and output the word “dog” through opaque mechanisms. Symbolic regression similarly identifies relationships in complicated data sets, but it reports the findings in a format human researchers can understand: a short equation. These algorithms resemble supercharged versions of Excel’s curve-fitting function, except they look not just for lines or parabolas to fit a set of data points, but billions of formulas of all sorts. In this way, the machine scientist could give the humans insight into why cells divide, whereas a neural network could only predict when they do.

Researchers have tinkered with such machine scientists for decades, carefully coaxing them into rediscovering textbook laws of nature from crisp data sets arranged to make the patterns pop out. But in recent years the algorithms have grown mature enough to ferret out undiscovered relationships in real data — from how turbulence affects the atmosphere to how dark matter clusters. “No doubt about it,” said Hod Lipson, a roboticist at Columbia University who jump-started the study of symbolic regression 13 years ago. “The whole field is moving forward.”

Rise of the Machine Scientists
Occasionally physicists arrive at grand truths through pure reasoning, as when Albert Einstein intuited the pliability of space and time by imagining a light beam from another light beam’s perspective. More often, though, theories are born from marathon data-crunching sessions. After the 16th-century astronomer Tycho Brahe passed away, Johannes Kepler got his hands on the celestial observations in Brahe’s notebooks. It took Kepler four years to determine that Mars traces an ellipse through the sky rather than the dozens of other egglike shapes he considered. He followed up this “first law” with two more relationships uncovered through brute-force calculations. These regularities would later point Isaac Newton toward his law of universal gravitation.

The goal of symbolic regression is to speed up such Keplerian trial and error, scanning the countless ways of linking variables with basic mathematical operations to find the equation that most accurately predicts a system’s behavior.

The first program to make significant headway at this, called BACON, was developed in the late 1970s by Patrick Langley, a cognitive scientist and AI researcher then at Carnegie Mellon University. BACON would take in, say, a column of orbital periods and a column of orbital distances for different planets. It would then systematically combine the data in different ways: period divided by distance, period squared times distance, etc. It might stop if it found a constant value, for instance if period squared over distance cubed always gave the same number, which is Kepler’s third law. A constant implied that it had identified two proportional quantities — in this case, period squared and distance cubed. In other words, it stopped when it found an equation.

Despite rediscovering Kepler’s third law and other textbook classics, BACON remained something of a curiosity in an era of limited computing power. Researchers still had to analyze most data sets by hand, or eventually with Excel-like software that found the best fit for a simple data set when given a specific class of equation. The notion that an algorithm could find the correct model for describing any data set lay dormant until 2009, when Lipson and Michael Schmidt, roboticists then at Cornell University, developed an algorithm called Eureqa.

Their main goal had been to build a machine that could boil down expansive data sets with column after column of variables to an equation involving the few variables that actually matter. “The equation might end up having four variables, but you don’t know in advance which ones,” Lipson said. “You throw at it everything and the kitchen sink. Maybe the weather is important. Maybe the number of dentists per square mile is important.”

One persistent hurdle to wrangling numerous variables has been finding an efficient way to guess new equations over and over. Researchers say you also need the flexibility to try out (and recover from) potential dead ends. When the algorithm can jump from a line to a parabola, or add a sinusoidal ripple, its ability to hit as many data points as possible might get worse before it gets better. To overcome this and other challenges, in 1992 the computer scientist John Koza proposed using “genetic algorithms,” which introduce random “mutations” into equations and test the mutant equations against the data. Over many trials, initially useless features either evolve potent functionality or wither away.

Lipson and Schmidt took the technique to the next level, ratcheting up the Darwinian pressure by building head-to-head competition into Eureqa. On one side, they bred equations. On the other, they randomized which data points to test the equations on — with the “fittest” points being those which most challenged the equations. “In order to get an arms race, you have to set up two evolving things, not just one,” Lipson said.

The Eureqa algorithm could crunch data sets involving more than a dozen variables. It could successfully recover advanced equations, like those describing the motion of one pendulum hanging from another.

Meanwhile, other researchers were finding tricks for training deep neural networks. By 2011, these were becoming wildly successful at learning to tell dogs from cats and performing countless other complex tasks. But a trained neural network consists of millions of numerically valued “neurons,” which don’t say anything about which features they’ve learned to recognize. For its part, Eureqa could communicate its findings in human-speak: mathematical operations of physical variables.

When Sales-Pardo played with Eureqa for the first time, she was amazed. “I thought it was impossible,” she said. “This is magic. How could these people do it?” She and Guimerà soon began to use Eureqa to build models for their own research on networks, but they felt simultaneously impressed with its power and frustrated with its inconsistency. The algorithm would evolve predictive equations, but then it might overshoot and land on an equation that was too complicated. Or the researchers would slightly tweak their data, and Eureqa would return a completely different formula. Sales-Pardo and Guimerà set out to engineer a new machine scientist from the ground up.

A Degree of Compression
The problem with genetic algorithms, as they saw it, was that they relied too much on the tastes of their creators. Developers need to instruct the algorithm to balance simplicity with accuracy. An equation can always hit more points in a data set by having additional terms. But some outlying points are simply noisy and best ignored. One might define simplicity as the length of the equation, say, and accuracy as how close the curve gets to each point in the data set, but those are just two definitions from a smorgasbord of options.

Sales-Pardo and Guimerà, along with collaborators, drew on expertise in physics and statistics to recast the evolutionary process in terms of a probability framework known as Bayesian theory. They started by downloading all the equations in Wikipedia. They then statistically analyzed those equations to see what types are most common. This allowed them to ensure that the algorithm’s initial guesses would be straightforward — making it more likely to try out a plus sign than a hyperbolic cosine, for instance. The algorithm then generated variations of the equations using a random sampling method that is mathematically proven to explore every nook and cranny in the mathematical landscape.

At each step, the algorithm evaluated candidate equations in terms of how well they could compress a data set. A random smattering of points, for example, can’t be compressed at all; you need to know the position of every dot. But if 1,000 dots fall along a straight line, they can be compressed into just two numbers (the line’s slope and height). The degree of compression, the couple found, gave a unique and unassailable way to compare candidate equations. “You can prove that the correct model is the one that compresses the data the most,” Guimerà said. “There is no arbitrariness here.”

After years of development — and covert use of their algorithm to figure out what triggers cell division — they and their colleagues described their “Bayesian machine scientist” in Science Advances in 2020.

Oceans of Data
Since then, the researchers have employed the Bayesian machine scientist to improve on the state-of-the-art equation for predicting a country’s energy consumption, while another group has used it to help model percolation through a network. But developers expect that these kinds of algorithms will play an outsize role in biological research like Trepat’s, where scientists are increasingly drowning in data.

Machine scientists are also helping physicists understand systems that span many scales. Physicists typically use one set of equations for atoms and a completely different set for billiard balls, but this piecemeal approach doesn’t work for researchers in a discipline like climate science, where small-scale currents around Manhattan feed into the Atlantic Ocean’s gulf stream.

One such researcher is Laure Zanna of New York University. In her work modeling oceanic turbulence, she often finds herself caught between two extremes: Supercomputers can simulate either city-size eddies or intercontinental currents, but not both scales at once. Her job is to help the computers generate a global picture that includes the effects of smaller whirlpools without simulating them directly. Initially, she turned to deep neural networks to extract the overall effect of high-resolution simulations and update coarser simulations accordingly. “They were amazing,” she said. “But I’m a climate physicist” — meaning she wants to understand how the climate works based on a handful of physical principles like pressure and temperature — “so it’s very hard to buy in and be happy with thousands of parameters.”

Then she came across a machine scientist algorithm devised by Steven Brunton, Joshua Proctor and Nathan Kutz, applied mathematicians at the University of Washington. Their algorithm takes an approach known as sparse regression, which is similar in spirit to symbolic regression. Instead of setting up a battle royale among mutating equations, it starts with a library of perhaps a thousand functions like x2, x/(x − 1) and sin(x). The algorithm searches the library for a combination of terms that gives the most accurate predictions, deletes the least useful terms, and continues until it’s down to just a handful of terms. The lightning-fast procedure can handle more data than symbolic regression algorithms, at the cost of having less room to explore, since the final equation must be built from library terms.

Zanna re-created the sparse regression algorithm from scratch to get a feel for how it worked, and then applied a modified version to ocean models. When she fed in high-resolution movies and asked the algorithm to look for accurate zoomed-out sketches, it returned a succinct equation involving vorticity and how fluids stretch and shear. When she fed this into her model of large-scale fluid flow, she saw the flow change as a function of energy much more realistically than before.

“The algorithm picked up on additional terms,” Zanna said, producing a “beautiful” equation that “really represents some of the key properties of ocean currents, which are stretching, shearing and [rotating].”

Smarter Together
Other groups are giving machine scientists a boost by melding their strengths with those of deep neural networks.

Miles Cranmer, an astrophysics graduate student at Princeton University, has developed an open-source symbolic regression algorithm similar to Eureqa called PySR. It sets up different populations of equations on digital “islands” and lets the equations that best fit the data periodically migrate and compete with the residents of other islands. Cranmer worked with computer scientists at DeepMind and NYU and astrophysicists at the Flatiron Institute to come up with a hybrid scheme where they first train a neural network to accomplish a task, then ask PySR to find an equation describing what certain parts of the neural network have learned to do.

As an early proof of concept, the group applied the procedure to a dark matter simulation and generated a formula giving the density at the center of a dark matter cloud based on the properties of neighboring clouds. The equation fit the data better than the existing human-designed equation.

In February, they fed their system 30 years’ worth of real positions of the solar system’s planets and moons in the sky. The algorithm skipped Kepler’s laws altogether, directly inferring Newton’s law of gravitation and the masses of the planets and moons to boot. Other groups have recently used PySR to discover equations describing features of particle collisions, an approximation of the volume of a knot, and the way clouds of dark matter sculpt the galaxies at their centers.

Of the growing band of machine scientists (another notable example is “AI Feynman,” created by Max Tegmark and Silviu-Marian Udrescu, physicists at the Massachusetts Institute of Technology), human researchers say the more the merrier. “We really need all these techniques,” Kutz said. “There’s not a single one that’s a magic bullet.”

Kutz believes machine scientists are bringing the field to the cusp of what he calls “GoPro physics,” where researchers will simply point a camera at an event and get back an equation capturing the essence of what’s going on. (Current algorithms still need humans to feed them a laundry list of potentially relevant variables like positions and angles.)

That’s what Lipson has been working on lately. In a December preprint, he and his collaborators described a procedure in which they first trained a deep neural network to take in a few frames of a video and predict the next few frames. The team then reduced how many variables the neural network was allowed to use until its predictions started to fail.

The algorithm was able to figure out how many variables were needed to model both simple systems like a pendulum and complicated setups like the flickering of a campfire — tongues of flames with no obvious variables to track.

“We don’t have names for them,” Lipson said. “They’re like the flaminess of the flame.”

The Edge of (Machine) Science
Machine scientists are not about to supplant deep neural networks, which shine in systems that are chaotic or extremely complicated. No one expects to find an equation for catness and dogness.

Yet when it comes to orbiting planets, sloshing fluids and dividing cells, concise equations drawing on a handful of operations are bafflingly accurate. It’s a fact that the Nobel laureate Eugene Wigner called “a wonderful gift we neither understand nor deserve” in his 1960 essay “The Unreasonable Effectiveness of Mathematics in the Natural Sciences.” As Cranmer put it, “If you look at any cheat sheet of equations for a physics exam, they are all extremely simple algebraic expressions, but they perform extremely well.”

Cranmer and colleagues speculate that elementary operations are such overachievers because they represent basic geometric actions in space, making them a natural language for describing reality. Addition moves an object down a number line. And multiplication turns a flat area into a 3D volume. For that reason, they suspect, when we’re guessing equations, betting on simplicity makes sense.

The universe’s underlying simplicity can’t guarantee success, though.

Guimerà and Sales-Pardo originally built their mathematically rigorous algorithm because Eureqa would sometimes find wildly different equations for similar inputs. To their dismay, however, they found that even their Bayesian machine scientist sometimes returned multiple equally good models for a given data set.

The reason, the pair recently showed, is baked into the data itself. Using their machine scientist, they explored various data sets and found that they fell into two categories: clean and noisy. In cleaner data, the machine scientist could always find the equation that generated the data. But above a certain noise threshold, it never could. In other words, noisy data could match any number of equations equally well (or badly). And because the researchers have proved probabilistically that their algorithm always finds the best equation, they know that where it fails, no other scientist — be it human or machine — can succeed.

“We’ve discovered that that is a fundamental limitation,” Guimerà said. “For that, we needed the machine scientist.”

Editor’s note: The Flatiron Institute is funded by the Simons Foundation, which also funds this editorially independent publication.

Correction: May 10, 2022

A previous version of this article omitted the names of two coauthors of a sparse regression algorithm developed at the University of Washington.

Correction: May 11, 2022

A word was added to the article to clarify that John Koza proposed using genetic algorithms to generate new equations, rather than inventing genetic algorithms himself.