Brief mention of the Monarc Model and its evolutions towards LCG
HEP data analysis and management and HEP requirements. Processing patterns
Testbeds 1 and 2 validation : what has already been done on the testbed ?
Current GRID-based distributed comp.model of the HEP experiments
Earth Observation
Mission and plans
What do typical Earth Obs. applications do ?
Biology
dgBLAST
Common Applications Issues
Applications are the end-users of the GRID : they are the ones finally making the difference
All applications started modeling their usage of the GRID through USE CASES : a standard technique for gathering requirements in software development methodologies
Use Cases are narrative documents that describe the sequence of events of an actor using a system [...] to complete processes
What Use Cases are NOT:
the description of an architecture
the representation of an implementation
The LHC challenge
HEP is carried out by a community of more than 10,000 users spread all over the world
The CERN Large Hadron Collider at CERN is the most challenging goal for the whole HEP community in the next years
Test the standard model and the models beyond it (SUSY,GUTs) at an energy scale ( 7+7 TeV p-p) corresponding to the very first instants of the universe after the big bang (<10-13 s) , allowing the study of quark-gluon plasma
LHC experiments will produce an unprecedented amount of data to be acquired, stored, analysed :
1010 collision events / year (+ same from simulation)
This corresponds to 3 - 4 PB data / year / experiment (ALICE, ATLAS, CMS, LHCb)
Data rate ( input to data storage center ): up to 1.2 GB/s per experiment
Collision event records are large: up to 25 MB (real data) and 2 GB (simulation)
HEP Data Analysis and Datasets
Raw data (RAW) ~ 1 MByte
hits, pulse heights
Reconstructed data (ESD) ~ 100 kByte
tracks, clusters…
Analysis Objects (AOD) ~ 10 kByte
Physics Objects
Summarized
Organized by physics topic
Reduced AODs(TAGs) ~1 kByte
histograms, statistical data on collections of events
HEP Data Analysis –processing patterns
Processing fundamentally independent (Embarassing parallel) due to independent nature of ‘events’
So have concepts of splitting and merging
Processing organised into ‘jobs’ which process N events
(e.g. simulation job organised in groups of ~500 events which takes ~ day to complete on one node)
A processing for 10**6 events would then involve 2,000 jobs merging into total set of 2 Tbyte
Production processing is planned by experiment and physics group data managers(this will vary from expt to expt)
Reconstruction processing (1-3 times a year of 10**9 events)
Physics group processing (? 1/month). Produce ~10**7 AOD+TAG
This may be distributed in several centres
Processing Patterns (2)
Individual physics analysis - by definition ‘chaotic’ (according to work patterns of individuals)
Hundreds of physicists distributed in expt may each want to access central AOD+TAG and run their own selections . Will need very selective access to ESD+RAW data (for tuning algorithms, checking occasional events)
Will need replication of AOD+TAG in experiment, and selective replication of RAW+ESD
This will be a function of processing and physics group organisation in the experiment
Alice: AliEn-EDG integration
What have HEP experiments already done on the EDG testbeds 1.0 and 2.0
The EDG User Community has actively contributed to the validation of the first and second EDG testbeds (feb 2002 – feb 2003)
All four LHC experiments have ran their software (firstly in some preliminary version) to perform the basics operations supported by the testbed 1 features provided by the EDG middleware
Validation included job submission (JDL), output retrieval, job status query, basic data management operations ( file replication, register into replica catalogs ), check of possible s/w dependencies or incompatibility (e.g. missing libs, rpms) problems
ATLAS, CMS, Alice have run intense production data challenges and stress tests during 2002 and 2003
The CMS Stress Test
CMS MonteCarlo production using BOSS and Impala tools.
Originally designed for submitting and monitoring jobs on a ‘local’ farm (eg. PBS)
Modified to treat Grid as ‘local farm’
December 2002 to January 2003
250,000 events generated by job submission at 4 separate UI’s
2,147 event files produced
500Gb data transferred using automated grid tools during production, including transfer to and from mass storage systems at CERN and Lyon
Efficiency of 83% for (small) CMKIN jobs, 70% for (large) CMSIM jobs
Were able to quickly add new sites to provide extra resources
Fast turnaround in bug fixing and installing new software
Test was labour intensive (since software was developing and the overall system was initially fragile)
New release EDG 2.0 should fix the major problems providing a system suitable for full integration in distributed production
EarthObservation applications (WP9)
Global Ozone (GOME) Satellite Data Processing and Validation by KNMI, IPSL and ESA
The DataGrid testbed provides a collaborative processing environment for 3 geographically distributed EO sites (Holland, France, Italy)
Earth Observation
Two different GOME processing techniques will be investigated
OPERA (Holland) - Tightly coupled - using MPI
NOPREGO (Italy) - Loosely coupled - using Neural Networks
The results are checked by VALIDATION (France). Satellite Observations are compared against ground-based LIDAR measurements coincident in area and time.
GOME OZONE Data Processing Model
Level-1 data (raw satellite measurements) are analysed to retrieve actual physical quantities : Level-2 data
Level-2 data provides measurements of OZONE within a vertical column of atmosphere at a given lat/lon location above the Earth’s surface
Coincident data consists of Level-2 data co-registered with LIDAR data (ground-based observations) and compared using statistical methods
GOME Processing Steps (1-2)
GOME Processing Steps (3-4)
GOME Processing Steps (5-6)
Biomedical requirements
Large user community(thousands of users)
anonymous/group login
Data management
data updates and data versioning
Large volume management (a hospital can accumulate TBs of images in a year)