The BABAR Dataset Bookkeeping
Introduction to the BABAR Dataset Bookkeeping
The purpose of the BABAR Dataset Bookkeeping is to locate
and manage the available data used for analysis within the BABAR community.
- Find available data to be used for analysis.
- Manage Data:
- Update datasets to include data added by data production.
- Synchronize the data inventory with the local event store (data storage).
- Import data to remote tier sites.
Concept of the Dataset Bookkeeping and the Table Schema
The Dataset Bookkeeping is organized around data collections and most database
tables used to describe various features of a collection.
Data collections in BABAR are, as the name suggests, either collections of
interactions recorded at the BABAR detector, or simulated interactions (so
called MC data). To illustrate the uniqueness of a data collection and to distinguish
the term collection from the more generic use of the word, the bookkeeping
introduced the term dsentity (dataset entity) to describe
a data collection. In the database table layout, all information directly related
to a dsentity is stored in the bbk_dsentites table. The actual data corresponding
to a collection (dsentity) is stored in one or more data files in
the so called event store and the summary information on these files
is stored in the bbk_files table.
|Database Table Schema
There are several different types of data collections. Any collection is
the result of processing of data, often in the form of another collection.
For any type of collection the software release used for processing is an important
attribute since the use of a collection in an analysis requires
a compatible release of the analysis software (data files associated with a
collection are not required to be backward compatible between major releases,
i.e. one may not be able to read new data with old software. It is possible
however to read old data with new software - though this is strongly discouraged.).
Information on the software release used to create a data collection is stored
in the bbk_releases table. If the collection is based on the reprocessing
of other data collections, information on these collections is stored in
the bbk_dsetodse table. In addition several other tables describe the
data collections in further detail and are used to manage and select datasets
by interested users.
A truly innovative feature of the BABAR bookkeeping is the combination of
the conventional use of a relational database with organizing the data
in the same table schema. In simple terms, the BABAR bookkeeping combines the
collecting of data and, if one wants to call it that way, data mining in one
entity. This arose out of the shortcomings of the old bookkeeping schema which
only consistent of the relational
database. At some point user queries to find data for analyses failed due
to a combination of complex queries and the sheer volume of data stored in
the database and users often had to wait several hours to get a list of data
collections. It was found however that most of these queries were very similar
- i.e. a unique list of data collections processed with a software release
that was compatible with an analysis release. So why not organize the data
by running a query once and store the results in the database instead
of having each user run an individual query. That led to the introduction
of the so called datasets.
Datasets are simply unique lists of data collections (dsentities). In
the database schema these are comprised of two tables, the dataset table (bbk_dataset)
and an association table (bbk_dstodse) which keeps track of the collections
belonging to any dataset. Datasets are dynamic objects, i.e. the list of
collections stored in a dataset changes as new data collections are added and/or
existing data collections replaced or found to be faulty. Users looking for
data for their analyses now select collections over these datasets. Not only
does this relieve the user from the 'art' of having to keep in mind all the
attributes needed to find their data, but also reduces the load on the database
Finally, the bookkeeping database is used for distributing data to various
BABAR Tier Sites, with the master database located at SLAC (Stanford Linear
Accelerator Lab) . Local mirrors of the database are fully supported and software
for synchronizing the mirror databases are provided. Additional software is
provided for data distribution and data management and allows the import of
individual data collections as well as complete datasets without exposing the
underlying details on data transfer.
The BABAR Bookkeeping Tools
The bookkeeping tools consist of a large set of utilities written in OO-Perl
to access and manage the information in the bookkeeping database.
The raw information stored in the bookkeeping database is often not very
useful for users. Sure, it is important to know just how much data is available,
but the names of the individual data collections available and the number of
events in each collection is not very interesting. Instead the bookkeeping
prepares configuration files for each analysis job to be run. These locate
the collections and events within these collections which are meant to be processed.
In addition the bookkeeping tools manage the creation and updates of the datasets,
the inserts of new collections into the database, the synchronization with
data storage, just to name a few.
For more on the BABAR bookkeeping see also the contribution
to Computing in High Energy and Nuclear Physics (CHEP), 2004.