Scientists have created a contest that invites anyone to use machine learning—the kind of computing that allows Facebook to spot your friends in photos and Netflix to recommend your next film—to search for the Higgs boson.
More than 1000 individuals have already joined the race. They’re vying for prizes up to $7000, but according to contest organizers, the real winner might be the particle physics community, whose new connections with the world of data science could push them toward new methods of discovery.
The contest works like this: Participants receive data from 800,000 simulated particle collisions from the ATLAS experiment at the Large Hadron Collider. The collisions can be sorted into two groups: those with a Higgs boson and those without.
The data for each collision contains 30 details—including variables such as the energy and direction of the particles coming out of it. Contestants receive all of these details, but only 250,000 of the collisions are labeled “Higgs” or “non-Higgs.”
They must use this labeled fraction to train their algorithms to find patterns that point to the Higgs boson. When they’re ready, they unleash the algorithms on the unlabeled collision data and try to figure out where the Higgs is hiding.
Contestants submit their answers online to Kaggle, a company that holds the answer key. When Kaggle receives a submission, it grades, in real time, just a portion of it—to prevent people from gaming the system—and then places the contestant on its public leaderboard.
At the end of the Higgs contest, Kaggle will reveal whose algorithm did the best job analyzing the full dataset. The top three teams will win $7000, $4000 and $3000. In addition, whoever has the most useable algorithm will be invited to CERN to see the ATLAS detector and discuss machine learning with LHC scientists.
The contest was conceived of by a six-person group led by two senior researchers at France’s national scientific research center, CNRS: physicist David Rousseau, who served from 2010 to 2012 as software coordinator for the ATLAS experiment, and machine-learning expert Balázs Kégl, who since 2007 has been looking for ways to bring machine learning into particle physics.
The company running the contest, Kaggle, based in San Francisco, holds such challenges for research institutions and also businesses such as Liberty Mutual, Allstate, Merck, MasterCard and General Electric. They have asked data scientists to foresee the creditworthiness of loan applicants, to predict the toxicity of molecular compounds and to determine the sentiment of lines from movie reviews on the film-rating site Rotten Tomatoes.
Kaggle contests attract a mixed crowd of professional data scientists looking for fresh challenges, grad students and postdocs looking to test their skills, and newbies looking to get their feet wet, says Joyce Noah-Vanhoucke, Kaggle data scientist and head of competitions.
“We’re trying to be the home of data science on the internet,” she says.
Often contestants play for cash, but they have also competed for the chance to interview for data scientist positions at Facebook, Yelp and Walmart.
Kaggle is currently running about 20 contests on its site. Most of them will attract between 300 and 500 teams, Noah-Vanhoucke says. But the Higgs contest, which does not end until September, has already drawn almost 970. Names appear and drop off of the leaderboard every day.
“People love this type of problem,” Noah-Vanhoucke says. “It captures their imagination.”
A couple of the top contenders are physicists, but most come from outside the particle physics community. The team spent about 18 months working on organizing the contest in the hopes that it would create just this kind of crossover, Rousseau says.
“If due to this challenge physicists of the collaboration discover they have a friendly machine learning expert in the lab next door and they try to work together, that’s even better than just getting a new algorithm.”
Machine learning—known in physics circles as multivariate analysis—played a small role in the 2012 discovery of the Higgs. But physics is still about 15 years behind the cutting edge in this area, Kégl says. And it could be just what the science needs.
Until a couple of years ago, the Higgs was the last undiscovered particle of the Standard Model of particle physics.
“Physics is getting to a place where they’ve discovered everything they were looking for,” Kégl says.
Questions still remain, of course. What is dark matter? What is dark energy? Why is gravity so weak? Why is the Higgs so light?
“But the Higgs is a very specific, predicted thing,” Kégl says. “Physicists knew if it had this mass, it would decay in this way.
“Now they’re looking for stuff they don’t know. I’m really interested in methods that can find things that are not modeled yet.”
In 2012, the Google X laboratory programmed 1000 computers to look through 10 million randomly selected thumbnail images from YouTube videos. This was an example of unsupervised machine learning: The computers had no answer code; they weren’t given any goal other than to search for patterns.
And they found them. They grouped photos by categories such as human faces, human bodies—and cats. People teased that Google had created the world’s most complicated cat video detector. But joking aside, it was an impressive example of the ability of a machine to quickly organize massive amounts of data.
Physicists already research in a similar way, sorting through huge amounts of information in search of patterns. The clue to their next revolutionary discovery could lie in an almost unnoticeable deviation from the expected. Machine learning could be an important tool in finding it.
Physicists shouldn’t consider this a threat to job security, though. In the case of the Higgs contest, scientists needed to greatly simplify their data to make it possible for algorithms to handle it.
“A new algorithm would be a small piece of a full physics analysis,” Rousseau says. “In the future, physics will not be done by robots.”
He hopes they might help, though. The team is already planning the next competition.