IceCube
IceCube: Cracking the Cosmic Code
PDD - System Elements

Preliminary Design Document

By Section | Whole Paper | PDF (233 pages, ~6.62mb)

8 Data Handling

8.1 System Elements

IceCube data is similar to that of other high-energy and astrophysics experiments. In particular, the concept of uncorrelated triggered events allows us to reuse many of the tools and concepts of other projects. More generally, the IceCube data handling system can be built with current technology. The main elements of the Data Handling system are:

These elements are discussed in the following subsections.

8.1.1 Software Management

The complexity of IceCube software and the dispersed nature of the IceCube collaboration strongly suggest the use of standard software management practices to organize communication, manpower, money, and project reviews. Industry software project management practices will be followed to implement the IceCube software.

Project-wide software management has several broad areas including oversight of:

A common development environment and software architecture will be most efficient for both implementation and long-term maintenance.

Interfaces will be defined between subsystems and testing procedures established. Software professionals will be employed to build the database and network programs. Physicists will likely implement many of the procedures. To ensure robust operation, software consultants will be involved at every stage–the design phase, a construction or development phase, and finally a maintenance phase–and perform regular reviews of the code. The data will be stored for many years, and the analysis is anticipated to continue for up to 15 years. Long-term software maintenance must be considered.

8.1.2 System Engineering

System engineering involves identifying the most logical set of subsystems and specifying completely all interfaces between them. In broad strokes this defines the structure of the software project. Subsystem specification will then occur in parallel by people with appropriate software skills. The specifications will be reviewed to ensure overall system integrity.

A unified architecture minimizes the number of people needed for development and more importantly, long term maintenance. The main cost of architecture is in the initial stage of system engineering. Common interfaces will be used by all subsystems. Single implementations are easier to maintain and more flexible to replacement. In long-term projects, flexible application program interfaces (APIs) can be re-implemented as commercial products become old and better options become available.

8.1.3 Development Environment

A development environment is used to support the software development. Design tools, debuggers, etc. are needed in all stages of software development. In addition, both code and documentation need to be added, removed, and modified in an organized fashion by all developers. It is especially important for developers in different subsystems to be able to discuss the meaning of the specification and modify it as needed in a rational system. To save in overhead a common development environment will be used to implement IceCube software.

Design packages such as Rational Rose [162] or SourceForge [163] provide visual modeling tools which are useful for creating a blueprint of the software architecture, a critical first step in the design of a complex software system. We will evaluate such tools to determine which, if any, might suit our needs. If we opt for a tool that the collaboration is not familiar with, it will be important to provide training in its use. The freeware GNU debugger GDB, the GNU compilers for FORTRAN, C and C++, the source code manager CVS and the release management and distribution system based on RPMs are all standard in the physics community, currently in use in AMANDA, and meet our requirements for ease of use and effectiveness. Tracking and distribution of important design and technical documents can be accomplished with existing packages, such as the commercial EDMS system used at CERN [164].

8.1.4 Analysis Framework

The framework is used to define how users interact with pieces of data. It is a set of rules and methods for selecting data, applying algorithms and adding new information to the dataset. Several frameworks are under consideration including the CERN packages Gaudi, LHC++ and Root [165]. Importantly, these packages will be supported over the lifetime of IceCube.

Implementation of the framework for IceCube will include a flexible API to event, calibration and monitor data. Time correlation of these types of data will also be provided by the framework. For example, this allows an analyzer to correlate measurements of the South Pole environment with physics data. Data objects are stored in memory so that all algorithms act on the same data. The input data is protected from corruption by making copies of it as deemed necessary.

The framework also supports an arbitrary number of modular algorithms through a standard API. Algorithms may add or remove data from the data stream. A standard set of options is made available to select needed operational modes.

A users manual including documentation to the APIs is provided to ensure optimal user access to the data. Except for visualization components, all components support execution via batch mode.

8.1.5 Database

Organizational elements in the data sample are supported by one or several databases. Calibration, file/stream organization, and possibly event data are organized using database tools. The calibration and monitoring database contains all the information needed for reconstruction, simulation and analysis (like geometry, PMT gain, and OM status) but also slow control information, hardware configuration, trigger status, and an electronic logbook. This database requires and is optimized for fast access through APIs by the software on the basis of the event time stamp. Quantities needed, for instance to define an OM status and calibration, are defined by a starting time and a validity time range. Its size, which depends on system stability and our chosen time granularity, is small enough to work on local copies regularly distributed to slave nodes. Access to the master copy is, of course, write-restricted. Multiple calibrations can be performed and checked with possibility for rollback.

A similar approach has been successfully developed and implemented for AMANDA-II using MySQL. Relational databases are well-suited for this kind of conditional (temporal) database. Many database managers using SQL exist and are well-tested. Commercial products like Oracle provide in addition a high level of security, advanced developing and managing tools and permit object-oriented interfaces to user software.

The file/stream organization database contains the complex data sample/stream organization. The large number of chunks into which data are sliced requires a database (most probably relational) to speed up random access of events. Placing the full sample of events in a large database is under consideration. For this kind of application, object oriented database management systems (ODBMS) like Objectivity are probably best suited and are used by large HEP experiments in the US (Babar) and Europe (the current choice of all LHC experiments at CERN). However, one must keep in mind that IceCube has a large number of events of small size and a simple event and detector structure. As a consequence, requirements are quite different from those needed for accelerator/general purpose detectors. ANTARES, which is quite similar to IceCube, uses Oracle, a relational database management system (RDBMS), instead of Objectivity. Yet another possible option is one of the object-relational database management systems (ORDBMS), like Oracle 9i.

The strong points of the ODBMS approach are: the database mirrors the structure of events and objects in the software, links and relations between objects are easier to modify, a lot of tools have already been developed by the HEP community, the user does not need to know where the data are physically located, and physics is done by event tagging instead of the usual miniDST/Ntuple strategy. On the other hand, there are also strong drawbacks to an approach that relies heavily on network performance and introduces large overheads which do not scale with the database size. Large-scale realistic tests are necessary to check that for IceCube where simple, well-tested solutions exist for event handling, this kind of ODBMS solution would confer a true benefit.

The database is designed so that it can be interfaced to and used for online, offline and Monte Carlo reconstruction and analysis. In any case, the physical organization of the database (internal tables, physical location on disk/tape) is designed to optimize data access. Graphical user interfaces and histogramming facilities to display and study database contents are provided. An adapter layer of software is used to plug the database into the analysis framework. Improvements in commercial databases and the support of a few carefully chosen file formats are expected to result in several database implementations and adapters. The framework and user interface are isolated from changes in the details of the database implementation via the adapters.

Benchmark tests were performed (using Linux on a 700 MHz PII and using TrueUnix on a 500 MHz Alpha) using a simple relational data base manager like MySQL with a 300 MB conditional (temporal) data base (corresponding to about six years of IceCube calibration/monitoring). Mean access time, based on a user key like "instant of validity" ranged from 0.5 to 5 ms for random search on randomly generated data depending on the cache state, demonstrating the feasibility of this approach. If it is decided that the full event sample is also stored in a database, the size of the sample may reach many terabytes (even after selection of a given physics stream). Even if careful organization of the data structure can keep the space overhead small, time overhead and data transfer time can become very large. Since database managers like Objectivity do not scale simply with the size of the data sample, large scale tests have to be performed before adopting such a solution.

Projected sizes of the individual database components are as follows:

8.1.6 Visualization

Software tools will be provided to display the data in several forms:

8.1.7 Development Interfaces

Collaborators are anticipated to interact with the software at some level either as algorithm/filter developers or as users of the visualization packages. Developer physicists will have access to user manuals and local experts to aid in their work. Visualization packages will have user manuals, and commercial products will be used whenever possible to minimize the amount of specialized knowledge needed to work with the data.

8.1.8 Integration at Pole

Operation of the detector at the pole requires all processes be directed by an experimental control process. The online filters, trigger, disk, tape and satellite bandwidth need to be implemented within this system. Embedding the framework modules and data access (not necessarily natively) into the South Pole system is crucial to the smooth operation of the detector.

8.1.9 Hardware

Computers and storage are needed both at the Pole and in the northern hemisphere. A cluster of roughly 50 CPUs (with 1 GHz processors) is needed to satisfy the peak demand for quick turnaround post-deployment calibrations, and the steady-state demand for reconstructing and filtering 1.5 kHz of events. This cluster provides sufficient peak processing power to keep pace with deployments, and allows an ample 33 ms/event of processing to reduce the data by a factor of 30 down to 50 Hz. Several additional machines are needed at the Pole for satellite transfer, tape backup and disk cleanup. Four terabytes of disk is needed at the Pole for storage of thirty days' worth of raw files. A total of 500 tapes (at 100 GB/tape) is needed to store the full raw data sample each year, and 50 additional tapes for backing up the filtered data. In the northern hemisphere, the filtered and raw data are stored for > 15 years at 2 TB/year and 50 TB/year, respectively. A disk archive of 60 TB will be needed to cache 15 years' worth of filtered and simulated data. Several clusters totaling 300 CPUs (with 1 GHz processors) are needed to support the offline analysis. This cluster is used for data reorganization for the archive long-term access, additional reconstructions and filters, and simulations of background and signals.

The data from the Digital Optical Modules (DOMs) is transferred to standard 100BaseT network hardware once reaching the surface. From this point on requirements for computing, networking, and storage can be met by commercial hardware. The design of data flow in the hardware is modular, allowing for scalability and maximizing robustness. It is possible to make estimates of upper limits on data flow rates, and it is found that presently available commercial hardware will be adequate.

Once the data signals from the OMs are being transferred to a standard network, a cascading design of switches with redundant paths will be used. At 10 Tb/day transfer rate current 1 Gb Ethernet has orders of magnitude higher capacity than what we anticipate IceCube will require on average (tested using the NFS protocol).

The operating conditions at the South Pole dictate that certain measures are taken to ensure performance equal to a more standard working environment. The IceCube system is operated on UPS (Uninteruptable Power Supply) systems. This protects against transient power anomalies, and provides a grace period for controlled system shutdown if a power outage lasts longer than a user-specified period of time. Due to the high altitude at the South Pole, the effectiveness of air cooling systems is reduced. Most computer components are not specified to operate at this altitude. Also the very dry environment results in large static build up. Experience has shown that by using more aggressive cooling systems, careful anti-static handling techniques, and routine discharging by numerous grounding strips, the same level of performance can be achieved as in a normal working environment.

Because of the physical isolation of South Pole for almost 3/4 of the year, and the limited personnel available during this period, the standardization of hardware with the support contractor to NSF is essential to obtaining the maximum level of mutual technical support, and efficiency in stocking spare components. This attitude has already been adopted by AMANDA and has been shown to provide many advantages not only to the science project, but South Pole support crew.

Storage The choices for large data storage are expanding rapidly with options such as Snap file servers, IDE RAID, FireWire disks arrays, and the more traditional SCSI RAID disk arrays. The system most suitable for IceCube is the SCSI RAID array. This system provides adequate storage size, reliability, and transfer speeds. A major advantage is that the main components, the SCSI disks, are standard throughout the whole systems. With the introduction of the Ultra 160 72 GB disks, arrays of 8 disks giving an overall size of 500 GB (access rates greater than 10 MB per sec through an NFS mount), or 4 days of uncompressed raw data. Two such arrays would be adequate for storage of raw and filtered data. Arrays such as these are highly reliable with hot swappable disks and power supplies resulting in operation being continuous even through hardware failures. Another advantage of using SCSI is that the tape drives have access to the data on the same bus, resulting in very high data transfer rates. Similar units are used as caches along the data flow chain to guard against data loss if the data flow is interrupted. All Data Handling hardware, including CPU, disk and tapedrives, fits into three standard-size racks.

Due to the unreliable nature of the satellites available for use at the South Pole, all filtered data is backed up on tape in the unlikely event that this data will need to be physically transferred north. Also, to guard against the possibility that a future technique requires re-filtering, tape copies of the raw data are made. Currently available tape technology such as SDLT (Quantum SDLT, 110 GB raw data at 22 MB/sec) is adequate for the maximum expected data rate of 130 GB per day. Using an array of 4 tape drives gives roughly one week between operator interventions. The filtered data and raw data are stored on separate tapes.

CPU Cluster Once built events are on disk storage they need to be filtered. This requires large amounts of CPU power and a means of controlling the priority of the filtering and other important tasks. This is a problem that the High Performance Computing community has been addressing for many years and a number of batching and load equalization systems are available. A system that is widely used, and which suits this situation well, is PBS (Portable Batch System). This is a very flexible system with a very large user base. It will give priority control over jobs, as tasks such as post-deployment calibrations and monitoring are given high priority, and ensures optimal use of the CPU resources. A large cluster such as this is accessed via one control node, with the rest of the system being easily expanded as required by the expansion of the detector, and or the use of more advanced filtering techniques. Very powerful, compact and inexpensive clusters are now available using dual CPU 1U rack mount machines.

The long period over which IceCube is installed and operated means the consequences of technology advances need to be considered. Such advances in the past have resulted in increased system performance at reduced cost, and size. More recently there has been a strong focus on increased energy efficiency and robustness. Thus, all future advances in technology work in the favor of IceCube, and restraints imposed by current hardware specification will only be relaxed and limited resources such as power and space will not be greater than what is estimated now. By standardizing on specific vendors early, upgrading hardware will be feasible and as straightforward as possible.

Hardware Monitoring Monitoring of the performance of the detector can be divided into two sections, the hardware health, and the state of physical characteristics of each device. The hardware layout of the detector is very modular, with only a few basic units being used. The characteristics of these basic units depends on their function. Disk storage has characteristics such as total storage used and available, and rate of change. The performance of networking devices can be measured by connectivity and response times. CPU nodes have characteristics of both a storage and a network device.

Modern hardware has the ability to monitor the physical health of a device, usually through a system monitoring bus. The Linux OS has software written to access this information, making it possible to quickly monitor CPU loads, temperatures, cooling systems, and power supply voltages. Other devices such as network switches use messaging protocols such as the traditional UNIX syslog system to make this information available.

As with other monitor information, the hardware status information will be entered into the data stream. A hardware monitoring process, as part of the overall monitoring package for the detector, will extract these packets of monitoring information from the data stream and display them in a graphical manner. Many such graphical systems already exist with features such as visual highlighting of system states and problems. During the hours of network connectivity to the South Pole, users will be able to examine the state of the detector as a function of time. This information will be regularly synchronized with a server which is always available for detailed examination both at the Pole and elsewhere. On-site users will be immediately notified of problems through an email pager system.

With the implementation of IceCube being over a period of years, it will be possible to test performance estimates of the system as it is scaled up. This will allow for adjustments and corrections before full operational mode is reached.

8.1.10 Data Distribution

Filtered data is transferred via satellite to a single location in the northern hemisphere. From there it is automaticaly copied via the internet to a designated IceCube institution in the US, where one or more backup copies are made and a standard set of data quality checks are performed. Checks are made to ensure that all the data from the Pole has been transferred completely and correctly, e.g., through use of checksums. Further tests investigate the quality of the data itself. Some of these tests are exactly the same as those performed at the Pole to further check the integrity of the filtered data transfer. Other more detailed tests are also performed (these tests have not yet been defined).

The data is then copied via the internet to a few designated distribution centers in the US and Europe and additional tape backup copies are made. Ultimately, all the filtered data, at 2 TB/yr, reside on disk at each institution (or at centralized facilities shared by a small group of institutions).

The tapes of raw data are shipped north from the Pole at the beginning of each season to a location where at least one full copy is made. Each copy is stored at a different locations in Europe or the US for added safety.