Skip to main content

LHC data at your fingertips

The CMS collaboration has released 300 terabytes of research data.

Photo of LHC data at your fingertips
Courtesy of CERN

Today the CMS collaboration at CERN released more than 300 terabytes (TB) of high-quality open data. These include more than 100 TB of data from proton collisions at 7 TeV, making up half the data collected at the LHC by the CMS detector in 2011. This release follows a previous one from November 2014, which made available around 27 TB of research data collected in 2010.

The data are available on the CERN Open Data Portal and come in two types. The primary datasets are in the same format used by the collaboration to perform research. The derived datasets, on the other hand, require a lot less computing power and can be readily analyzed by university or high school students.

CMS is also providing the simulated data generated with the same software version that should be used to analyze the primary datasets. Simulations play a crucial role in particle physics research. The data release is accompanied by analysis tools and code examples tailored to the datasets. A virtual machine image based on CernVM, which comes preloaded with the software environment needed to analyze the CMS data, can also be downloaded from the portal.

GIF: exploring CMS data

“Once we’ve exhausted our exploration of the data, we see no reason not to make them available publicly,” says Kati Lassila-Perini, a CMS physicist who leads these data preservation efforts. “The benefits are numerous, from inspiring high school students to the training of the particle physicists of tomorrow. And personally, as CMS’s data preservation coordinator, this is a crucial part of ensuring the long-term availability of our research data.”

The scope of open LHC data has already been demonstrated with the previous release of research data. A group of theorists at MIT wanted to study the substructure of jets—showers of hadron clusters recorded in the CMS detector. Since CMS had not performed this particular research, the theorists got in touch with the CMS scientists for advice on how to proceed. This blossomed into a fruitful collaboration between the theorists and CMS.

“As scientists, we should take the release of data from publicly funded research very seriously,” says Salvatore Rappoccio, a CMS physicist who worked with the MIT theorists. “In addition to showing good stewardship of the funding we have received, it also provides a scientific benefit to our field as a whole. While it is a difficult and daunting task with much left to do, the release of CMS data is a giant step in the right direction.”

Further, a CMS physicist in Germany tasked two undergraduates with validating the CMS Open Data by reproducing key plots from some highly cited CMS papers that used data collected in 2010. Using openly available documentation about CMS’s analysis software and with some guidance from the physicist, the students were able to recreate plots that look nearly identical to those from CMS, demonstrating what can be achieved with these data.

“We are very pleased that we can make all these data publicly available,” adds Lassila-Perini. “We look forward to how they are utilized outside our collaboration, for research as well as for building educational tools.”


A version of this article was originally published on the CMS website.