A joint Fermilab/SLAC publication

Big data


Big science takes both big data and big cooperation. For the Large Hadron Collider at CERN, storing, analyzing and accessing 25 petabytes of data each year requires a worldwide effort that spans more than 100 institutions in 36 countries. Here’s how it works.

The Large Hadron Collider, the world’s largest particle accelerator, produces a million gigabytes of data every single second. It’s an incredible amount of information—too much for any single institution or computing center to handle.

Fortunately, out of the billions of collisions produced, only a fraction of the data is scientifically interesting enough to keep. Imagine searching for a needle in a haystack. Now imagine searching for a couple of needles in a football field full of haystacks. It takes a while to find the needles, but once you do, there’s no need to keep the hay.

In all, CERN keeps about 25 petabytes—that's about 26,000,000 gigabytes—of data per year for physicists to analyze, but even that’s more data than all of the text available in the American Library of Congress, multiplied by a thousand.

It’s just not realistic for one facility to house and analyze that much information, so to share the load, CERN outsources some of the data storage and processing to more than 150 computing centers all around the world via the Worldwide LHC Computing Grid.

Deconstruction: Big data

Once each experiment at the LHC decides which collisions are interesting enough to keep, it stores one complete copy of those raw data at CERN, while also dividing the same data among 11 “Tier 1” centers in Asia, Europe and North America. At CERN (“Tier 0”) and at these Tier 1 centers, collision events are reconstructed from the raw data. The reconstructed events are then stored at both the Tier 0 and Tier 1 centers.

The United States is home to two Tier 1 computing centers, with ATLAS experiment data making its way to Brookhaven National Laboratory in Upton, New York, and CMS experiment data to Fermi National Accelerator Laboratory in Batavia, Illinois.

These Tier 1 centers—and the nine others like them around the world—then make the reconstructed data available to about 140 Tier 2 centers, most of which are located at universities.

Researchers at Tier 2 centers can request specific data from their designated Tier 1 center; they then host these smaller amounts of data and provide storage and computing space for physicists who want to analyze that data.

Researchers at the University of Wisconsin, for example, might want to study specific data from the CMS experiment. At the researchers’ request, the data is streamed from Fermilab to the university, where it’s saved and analyzed.

In addition, any scientist who collaborates on an LHC experiment can access the data needed for his or her analysis, and authorize students to access the data as well.

In this way, thousands of researchers—from preeminent physicists to high school students—participate in the data analysis for experiments at the largest scientific laboratory in the world.