Introduction to the BABAR Dataset Bookkeeping

The purpose of the BABAR Dataset Bookkeeping is to locate and manage the available data used for analysis within the BABAR community. This includes:


Concept of the Dataset Bookkeeping and the Table Schema

The Dataset Bookkeeping is organized around data collections and most database tables used to describe various features of a collection.

Data collections in BABAR are, as the name suggests, either collections of interactions recorded at the BABAR detector, or simulated interactions (so called MC data). To illustrate the uniqueness of a data collection and to distinguish the term collection from the more generic use of the word, the bookkeeping introduced the term dsentity (dataset entity) to describe a data collection. In the database table layout, all information directly related to a dsentity is stored in the bbk_dsentites table. The actual data corresponding to a collection (dsentity) is stored in one or more data files in the so called event store and the summary information on these files is stored in the bbk_files table.

Database Table Schema

There are several different types of data collections. Any collection is the result of processing of data, often in the form of another collection. For any type of collection the software release used for processing is an important attribute since the use of a collection in an analysis requires a compatible release of the analysis software (data files associated with a collection are not required to be backward compatible between major releases, i.e. one may not be able to read new data with old software. It is possible however to read old data with new software - though this is strongly discouraged.). Information on the software release used to create a data collection is stored in the bbk_releases table. If the collection is based on the reprocessing of other data collections, information on these collections is stored in the bbk_dsetodse table. In addition several other tables describe the data collections in further detail and are used to manage and select datasets by interested users.

A truly innovative feature of the BABAR bookkeeping is the combination of the conventional use of a relational database with organizing the data in the same table schema. In simple terms, the BABAR bookkeeping combines the collecting of data and, if one wants to call it that way, data mining in one entity. This arose out of the shortcomings of the old bookkeeping schema which only consistent of the relational database. At some point user queries to find data for analyses failed due to a combination of complex queries and the sheer volume of data stored in the database and users often had to wait several hours to get a list of data collections. It was found however that most of these queries were very similar - i.e. a unique list of data collections processed with a software release that was compatible with an analysis release. So why not organize the data by running a query once and store the results in the database instead of having each user run an individual query. That led to the introduction of the so called datasets.

Datasets are simply unique lists of data collections (dsentities). In the database schema these are comprised of two tables, the dataset table (bbk_dataset) and an association table (bbk_dstodse) which keeps track of the collections belonging to any dataset. Datasets are dynamic objects, i.e. the list of collections stored in a dataset changes as new data collections are added and/or existing data collections replaced or found to be faulty. Users looking for data for their analyses now select collections over these datasets. Not only does this relieve the user from the 'art' of having to keep in mind all the attributes needed to find their data, but also reduces the load on the database significantly.

Finally, the bookkeeping database is used for distributing data to various BABAR Tier Sites, with the master database located at SLAC (Stanford Linear Accelerator Lab) . Local mirrors of the database are fully supported and software for synchronizing the mirror databases are provided. Additional software is provided for data distribution and data management and allows the import of individual data collections as well as complete datasets without exposing the underlying details on data transfer.

The BABAR Bookkeeping Tools

The bookkeeping tools consist of a large set of utilities written in OO-Perl to access and manage the information in the bookkeeping database.

The raw information stored in the bookkeeping database is often not very useful for users. Sure, it is important to know just how much data is available, but the names of the individual data collections available and the number of events in each collection is not very interesting. Instead the bookkeeping prepares configuration files for each analysis job to be run. These locate the collections and events within these collections which are meant to be processed.

In addition the bookkeeping tools manage the creation and updates of the datasets, the inserts of new collections into the database, the synchronization with data storage, just to name a few.

For more on the BABAR bookkeeping see also the contribution to Computing in High Energy and Nuclear Physics (CHEP), 2004.