The Linac Coherent Light Source X-ray laser, located at SLAC, produces a beam of ultrabright X-rays powerful enough to illuminate single molecules, to selectively knock out electrons from the core of atoms, to create superhot and superdense plasmas, and to explore delicate nanocrystals of biological samples.
The LCLS’s high-speed, high-resolution detectors, which collect information on these atomic- and molecular-scale processes and structures, require data storage at an extremely fast rate and enormous scale. An experiment at one LCLS instrument produces an average of about 10 million X-ray images in 48 hours. Larger LCLS experiments generate 150 to 200 terabytes (about 154,000 to 205,000 gigabytes) of data in the same timeframe.
Luckily for LCLS scientists, the particle physics community has been building expertise in extreme data collection for years.
From particle physics to photon science
The LCLS data acquisition system was built by the same group of experts who worked on the data acquisition system for BaBar, a particle physics experiment that took data at SLAC between 1999 and 2008.
The particle physics community is “where we come from,” says Amedeo Perazzo, who leads the Photon Controls and Data Systems Department at SLAC, which manages LCLS data. As a member of the BaBar core data-flow group, Perazzo built a diagnostic tool for the experiment’s parallel-processor computer architecture and was responsible for other enhancements to the BaBar trigger and data acquisition systems.
Igor Gaponenko, a research software developer for LCLS data systems who came to SLAC as a visiting scientist in 1997 to work on the BaBar collaboration, says, “Those of us who came with a heavy background in high-energy physics and its information culture—for us [the LCLS’s unprecedented data rates are] not so surprising.
“That was the whole idea, to transfer the experience of high-energy physics to photon science, because it was anticipated that LCLS would be producing a lot of data.”
LCLS adopted the same format to store its raw data as the BaBar experiment, a format called eXtended Tagged Container. This is because it’s fast, easy to extend and allows read-back even as a data file is being written, Perazzo says.
Needles and haystacks
Many common threads exist between data demands at LCLS and at particle physics experiments.
One of them is the need to analyze data on the fly and to store and analyze multiple petabytes of data.
In particle physics, scientists often look for specific, rare events set against a backdrop of large numbers of background events. It’s like spotting a needle in a haystack. Physicists have devised data acquisition systems that keep only the bales of hay most likely to contain a needle and discard the rest. These automated data storage systems rely on sophisticated software triggers that rapidly and automatically identify the most promising bales.
The data is also typically further processed so that a subsection of information about the hay deemed most interesting for further analysis is immediately accessible to the community of researchers, while those hay bales deemed less interesting may be stored on less-costly tape.
Data for the potentially needle-containing hay bales may be sorted based on a variety of characteristics and distributed across high-speed networks to computer centers across the globe. This ensures that all researchers are not trying to access the same set of data from the same place at the same time. It also allows research centers to grab different portions of the data, sharing the load rather than separately analyzing the same data in its entirety.
Perazzo says that data demands will greatly increase at LCLS over the next few years as LCLS moves to more sensitive detectors, more complex experiments, multiple simultaneous experiments and other planned upgrades.
A likely solution will be a data distribution strategy similar to particle physics’ "needle in a haystack" systems, whereby researchers can routinely initiate high-speed transfers of data to other data storehouses around the world, where it can remain accessible for longer periods of time, freeing up LCLS to accept data from new experiments.
“Pretty soon we will be taking a factor of 20 more data than we are taking today,” Perazzo says. “At that point we will not be able to operate in the same way we do now.”
New machine, new demands
While LCLS computing has grown out of particle physics, it’s also breaking new ground.
In high-energy physics, scientists have a common vocabulary and standardized data systems, and individual experiments can run for years. But at LCLS, experiments typically run for only a few days, and scientists need immediate access to their data so they can decide whether to change samples or X-ray energies in the middle of an experiment.
LCLS users want “reliability, flexibility and immediacy,” says Sebastian Carron Montero, an engineering physicist who works on data systems for the Atomic, Molecular and Optical Science instrument at LCLS. “To have all of them at the same time is very demanding. And each one of them is using different tools.”
There is a push to improve the user interface to make LCLS data tools more accessible to scientists, offer more real-time data during experiments, train staff to work more closely with users on learning the data systems and continue to work toward common data standards.
Perazzo says his department’s system for handling LCLS data, and that of a planned expansion called LCLS-II, is scalable to meet these challenges.
“I do believe we are in the right spot,” Perazzo says. “Other X-ray laser facilities are taking note of what we’ve accomplished.”