Building the biggest scientific databases
This story first appeared in SLAC Today on September 16, 2009.
The third annual Extremely Large Databases Workshop took place in Lyon, France, in late August, 2009—the first year that the workshop has taken place away from its roots at SLAC National Accelerator Laboratory. SLAC database engineer Jacek Becla and his team members Kian-Tat Lim and Daniel Wang organized the event, co-locating it with the 35th Very Large Database research conference. This year's XLDB Workshop focused on reaching out to non-US communities and scientific communities, such as geoscience, radio astronomy, and biology, that have been underrepresented at the past workshops.
Two years ago, Becla established the invitation-only Extremely Large Databases Workshop series to bring together scientific and commercial users of extremely large databases, two groups that Becla said previously had very little contact on this issue. He also invited members of academia and database manufacturers to discuss advances in database technology and, most importantly, let them hear about problems that users are having or features they'd like to see in future database engines.
Previous XLDB workshops led to the creation of a project to build a new open-source database engine, called SciDB, geared specifically toward complex scientific analysis at extremely large scales. When SciDB is released, Becla said, it will revolutionize the way scientific analyses are done. The SciDB project has already attracted more than 20 database professors and engineers world-wide who are collaboratively designing and building the software. These include database giants Michael Stonebraker and David DeWitt, who pioneered database research and helped create technologies such as those used in today's automatic teller machines. The group demonstrated an early prototype of SciDB to several hundred people at the Very Large Database conference and again at the XLDB workshop.
"The system we are building is very different [than commercially manufactured databases]," Becla said. "We finally understand well what science's needs are, and we are building an engine that will fully address these needs, taking advantage of numerous commonalities between how different science domains want to ultimately analyze their data sets."
Today's extremely large systems are measured in petabytes. One petabyte, or one million billion bytes is non-trivial to manage. As one blogger explained, if one letter of text represented one byte, a petabyte-long line of letters in a typical font size would stretch from the Earth to the Sun ten times. As unfathomable as that number is, science experiments such as BaBar and internet companies such as Google or eBay are already producing petabytes of data, and the software must keep up. Becla's team, who built the BaBar database are now responsible for the design of the Large Synoptic Survey Telescope database, estimates that the LSST will generate more than one hundred petabytes of data.
More and more scientific disciplines are acquiring such massive amounts of data and need software to store, share, and process it, but commercial database manufacturers don't produce systems that can handle these extremely large datasets at reasonable cost and performance levels. As a result, XLDBs are typically built in-house using custom software, making them more expensive to develop and maintain. Becla began the workshop to provide a forum for sharing solutions to these problems.
The XLDB workshops include very few presentations. Instead, attendees come prepared to discuss in open forum specific issues in the field. The results of this kind of discussion have been tremendously positive, Becla said, and, as the open source SciDB database system shows, highly productive.