IceCube
IceCube Neutrino Observatory

PDD - Offline Data Flow

By Section | Whole Paper | PDF (233 pages, ~6.62mb)

8 Data Handling
  • 8.1 System Elements
    • 8.1.1 Software Management
    • 8.1.2 System Engineering
    • 8.1.3 Development Environment
    • 8.1.4 Analysis Framework
    • 8.1.5 Database
    • 8.1.6 Visualization
    • 8.1.7 Development Interfaces
    • 8.1.8 Integration at Pole
    • 8.1.9 Hardware
    • 8.1.10 Data Distribution
  • 8.2 Offline Data Flow
  • 8.3 Data Model
    • 8.3.1 High Multiplicity
    • 8.3.2 Upgoing Tracks
    • 8.3.3 Cascades/Taus
    • 8.3.4 GRB Downgoing Muons
    • 8.3.5 Icetop
    • 8.3.6 Supernova
    • 8.3.7 Prescaled Raw Data
    • 8.3.8 Monitor
    • 8.3.9 Calibration
    • 8.3.10 Full-Sky Summary Histograms
    • 8.3.11 Unfiltered Raw
  • 8.4 Data Sample Organization
  • 8.5 Latency
  • 8.6 Schedule
  • 8.7 Summary

8.2 Offline Data Flow

The DAQ system depicted in fig. 73 deposits data of several different origins into a disk cache at the South Pole. Triggered events with a size of about 1.0 kB are collected at roughly 1.5 kHz and are intermingled with monitor and calibration data. Triggered events are dominated by downgoing atmospheric muons. These figures can be used to estimate the total data from the detector: 130 GB/day or 50 TB/year. Since the satellite bandwidth is 12 GB/day, we must reduce the data volume by roughly an order of magnitude at the Pole. This reduction will be accomplished with specialized filters. The hardware required to implement these filters is described in sec. 8.1.9. In addition to the filtered data, we will also send monitor data and calibration data over the satellite. A sketch of the Data Handling system is provided in fig. 75.

Note that during the six-string first year, all raw data will be delivered to the northern hemisphere over the satellite for analysis and filter development, and taping at the Pole will only be used for data backup. The required bandwidth is estimated from the fraction of strings relative to the full IceCube detector and the estimated IceCube unfiltered raw data rate: 6/80×130 ≅ 10 GB/day. (AMANDA-II will also require satellite bandwidth of about 2 GB/day; the total required bandwidth remains below 12 GB/day.) Transferring the entire unfiltered raw data the first year gives the software developers additional time to develop a coherent system, and in particular to develop the robust online filters which will be required in all subsequent years.

A pre-filter raw data cache will hold 2 days' data so that correlations with GRBs and other external triggers can be studied. The post-filter cache needs to hold at least a week of data to handle failed transfers and satellite down times smoothly. We will avoid continuously saturating the bandwidth every day to further ensure that the transfer process can catch up with the data taking in a reasonable amount of time. Prioritizing data for transfer may be useful during recovery period.

After transfer, the filtered data will be cached for the indefinite future in a mass archive. Expected data size is 2 TB/year plus a comparable amount of simulated data. Additional processing will be provided as necessary to support the analysis working groups' access needsa nd use of the data. The CPU requirements for IceCube simulations are estimated from the desire to simulate a reasonable fraction of the background (e.g., 20%), and from an estimated 1 s/event IceCube simulation time. Table 11 lists bandwidth, CPU and diskspace requirements for IceCube. Data which is not transferred over the satellite will take over a year to receive from the Pole, archive and filter. Since it is highly desirable and certainly feasible to perform filtering at the Pole, we are concentrating our efforts on such a system. There are several benefits which accrue from this decision. First, the data analysis can begin essentially right after data is taken, more than a year sooner than otherwise. Second, priority is placed on looking at the data immediately so that problems are detected and fixed promptly. Finally, we are concentrating our efforts on a single system from the start. To avoid having to refilter this data, it is imperative that this system works reliably. Note that the scenario in which filtering occurs only in the northern hemisphere has its own serious downside beyond its inherently longer latency: computing needs grow overwhelmingly large if we demand reasonably fast filtering turnaround times.

Processapprox. output bandwidth1-GHz CPU requirements
event builder1.0 MB/s1
Pole filter50 kB/s45
online calibration25 MB/yr5
satellite transfer12 GB/day1
simulationn/a300
offline processing<50 kB/s50

Cache (cache type)cache size
30-day raw (FIFO)4 TB
30-day filtered (FIFO)330 GB
filtered data archive (permanent)2 TB/year
simulated data archive (permanent)2 TB/year

Table 11: Bandwidth, cache and CPU requirements for IceCube data handling. Note that the online calibration uses 5 CPUs only for the deployment period.

Figure 75: Diagram of the IceCube Data Handling system at the Pole. Arrows indicate direction of primary data flow, with data flow volume as indicated. The raw data is cached for filter processing and to permit extraction of raw data coincident with GRB (and other external) triggers. External triggers may arrive at any time if Iridium access is available. Raw data is permanently recorded to tape. Filter, monitor, GRB and calibration (FMGC) data are buffered to a small disk cache and transferred via satellite link to the northern hemisphere. With Iridium access, monitor data can also be uploaded to the northern hemisphere. FMGC data is also backed up to tape at the Pole. (See fig. 73 for a diagram of the upstream DAQ system.

Tapes are used at the Pole to backup filtered data in order to deal with possible long-term satellite outages. Tapes are also used at the Pole to make copies of all the unfiltered raw data. All data written to tape is checksummed to help ensure data integrity.

In the northern hemisphere, the computing constraints are similar to those for any large computing project. Data samples of about 2 TB/yr are archived in a mass storage system for access. Distributed data needs to fit on user's facilities at their home institutions. Processing power needs to be somewhat centralized for big projects and otherwise distributed for accessibility to the analysis. Expensive software tools or licenses need to be purchased in such a way that all collaborating institutions can participate fully in the data analysis.