Applications and the Grid The European DataGrid Project Team

Yüklə 445 b.

tarix	27.12.2018
ölçüsü	445 b.
	#87043

Applications and the Grid

The European DataGrid Project Team
http://www.eu-datagrid.org

Overview

An applications view of the the Grid – Use Cases
High Energy Physics

Why we need to use GRIDs in HEP ?
Brief mention of the Monarc Model and its evolutions towards LCG
HEP data analysis and management and HEP requirements. Processing patterns
Testbeds 1 and 2 validation : what has already been done on the testbed ?
Current GRID-based distributed comp.model of the HEP experiments

Earth Observation

Mission and plans
What do typical Earth Obs. applications do ?

Biology

dgBLAST

Common Applications Issues

Applications are the end-users of the GRID : they are the ones finally making the difference
All applications started modeling their usage of the GRID through USE CASES : a standard technique for gathering requirements in software development methodologies
Use Cases are narrative documents that describe the sequence of events of an actor using a system [...] to complete processes
What Use Cases are NOT:

the description of an architecture
the representation of an implementation

The LHC challenge

HEP is carried out by a community of more than 10,000 users spread all over the world
The CERN Large Hadron Collider at CERN is the most challenging goal for the whole HEP community in the next years
Test the standard model and the models beyond it (SUSY,GUTs) at an energy scale ( 7+7 TeV p-p) corresponding to the very first instants of the universe after the big bang (<10-13 s) , allowing the study of quark-gluon plasma
LHC experiments will produce an unprecedented amount of data to be acquired, stored, analysed :

1010 collision events / year (+ same from simulation)
This corresponds to 3 - 4 PB data / year / experiment (ALICE, ATLAS, CMS, LHCb)
Data rate ( input to data storage center ): up to 1.2 GB/s per experiment
Collision event records are large: up to 25 MB (real data) and 2 GB (simulation)

HEP Data Analysis and Datasets

Raw data (RAW) ~ 1 MByte

hits, pulse heights

Reconstructed data (ESD) ~ 100 kByte

tracks, clusters…

Analysis Objects (AOD) ~ 10 kByte

Physics Objects
Summarized
Organized by physics topic

Reduced AODs(TAGs) ~1 kByte
histograms, statistical data on collections of events

HEP Data Analysis –processing patterns

Processing fundamentally independent (Embarassing parallel) due to independent nature of ‘events’

So have concepts of splitting and merging
Processing organised into ‘jobs’ which process N events

(e.g. simulation job organised in groups of ~500 events which takes ~ day to complete on one node)

A processing for 10**6 events would then involve 2,000 jobs merging into total set of 2 Tbyte

Production processing is planned by experiment and physics group data managers(this will vary from expt to expt)

Reconstruction processing (1-3 times a year of 10**9 events)
Physics group processing (? 1/month). Produce ~10**7 AOD+TAG
This may be distributed in several centres

Processing Patterns (2)

Individual physics analysis - by definition ‘chaotic’ (according to work patterns of individuals)

Hundreds of physicists distributed in expt may each want to access central AOD+TAG and run their own selections . Will need very selective access to ESD+RAW data (for tuning algorithms, checking occasional events)

Will need replication of AOD+TAG in experiment, and selective replication of RAW+ESD

This will be a function of processing and physics group organisation in the experiment

Alice: AliEn-EDG integration

What have HEP experiments already done on the EDG testbeds 1.0 and 2.0

The EDG User Community has actively contributed to the validation of the first and second EDG testbeds (feb 2002 – feb 2003)
All four LHC experiments have ran their software (firstly in some preliminary version) to perform the basics operations supported by the testbed 1 features provided by the EDG middleware
Validation included job submission (JDL), output retrieval, job status query, basic data management operations ( file replication, register into replica catalogs ), check of possible s/w dependencies or incompatibility (e.g. missing libs, rpms) problems
ATLAS, CMS, Alice have run intense production data challenges and stress tests during 2002 and 2003

The CMS Stress Test

CMS MonteCarlo production using BOSS and Impala tools.

Originally designed for submitting and monitoring jobs on a ‘local’ farm (eg. PBS)
Modified to treat Grid as ‘local farm’

December 2002 to January 2003

250,000 events generated by job submission at 4 separate UI’s
2,147 event files produced
500Gb data transferred using automated grid tools during production, including transfer to and from mass storage systems at CERN and Lyon
Efficiency of 83% for (small) CMKIN jobs, 70% for (large) CMSIM jobs

The CMS Stress Test

CMS Stress Test : Architecture of the system

Main results and observations from CMS work

RESULTS

Could distribute and run CMS s/w
in EDG environment
Generated ~250K events for
physics with
~10,000 jobs in 3 week period

OBSERVATIONS

Were able to quickly add new sites to provide extra resources
Fast turnaround in bug fixing and installing new software
Test was labour intensive (since software was developing and the overall system was initially fragile)
New release EDG 2.0 should fix the major problems providing a system suitable for full integration in distributed production

Earth Observation applications (WP9)

Global Ozone (GOME) Satellite Data Processing and Validation by KNMI, IPSL and ESA
The DataGrid testbed provides a collaborative processing environment for 3 geographically distributed EO sites (Holland, France, Italy)

Earth Observation

Two different GOME processing techniques will be investigated

OPERA (Holland) - Tightly coupled - using MPI
NOPREGO (Italy) - Loosely coupled - using Neural Networks

The results are checked by VALIDATION (France). Satellite Observations are compared against ground-based LIDAR measurements coincident in area and time.

GOME OZONE Data Processing Model

Level-1 data (raw satellite measurements) are analysed to retrieve actual physical quantities : Level-2 data
Level-2 data provides measurements of OZONE within a vertical column of atmosphere at a given lat/lon location above the Earth’s surface
Coincident data consists of Level-2 data co-registered with LIDAR data (ground-based observations) and compared using statistical methods

GOME Processing Steps (1-2)

GOME Processing Steps (3-4)

GOME Processing Steps (5-6)

Biomedical requirements

Large user community(thousands of users)

anonymous/group login

Data management

data updates and data versioning
Large volume management (a hospital can accumulate TBs of images in a year)

Security

disk / network encryption

Limited response time

fast queues

Diverse Users…

Patient

has free access to own medical data

Physician

has complete read access to patients data. Few persons have read/write access.

Researchers

may obtain read access to anonymous medical data for research purposes. Nominative data should be blanked before transmission to these users

Biologist

has free access to public databases. Use web portal to access biology server services.

Chemical/Pharmacological manufacturer

owns private data. Need to control the possible targets for data storage.

…and data

Biological Data

Public and private databases
Very fast growth (doubles every 8-12 months)
Frequent updates (versionning)
Heterogenous formats

Medical data

Strong semantic
Distributed over imaging sites
Images and metadata

Web portals for biologists

Biologist enters sequences through web interface
Pipelined execution of bio-informatics algorithms

Genomics comparative analysis (thousands of files of ~Gbyte)

Genome comparison takes days of CPU (~n**2)

Phylogenetics
2D, 3D molecular structure of proteins…

The algorithms are currently executed on a local cluster

Big labs have big clusters …
But growing pressure on resources – Grid will help

More and more biologists
compare larger and larger sequences (whole genomes)…
to more and more genomes…
with fancier and fancier algorithms !!

Example GRID application for Biology: dgBLAST

The Visual DataGrid Blast, a first genomics application on DataGrid

A graphical interface to enter query sequences and select the reference database
A script to execute the BLAST algorithm on the grid
A graphical interface to analyze result
Accessible from the web
portal genius.ct.infn.it

Other Medical Applications

Complex modelling of anatomical structures

Anatomical and functional models, parallelizatoin

Surgery simulation

Realistic models, real-time constraints

Simulation of MRIs

MRI modelling, artifacts modeling, parallel simulation

Mammographies analysis

Automatic pathologies detection

Shared and distributed data management

Data hierarchy, dynamic indices, optimization, caching

Summary ( 1 / 2 )

Summary ( 2 / 2 )

Many challanging issues are facing us :

strengthen effective massive productions on the EDG testbed
keep up the pace with next generation grid computing evolutions, implementing or interfacing them to EDG
further develop middleware components for all EDG workpackages to address growing user’s demands. EDG 2.0 will implement many new functionality.

Yüklə 445 b.

Dostları ilə paylaş:

Applications and the Grid The European DataGrid Project Team

Applications and the Grid

The European DataGrid Project Team

http://www.eu-datagrid.org

Overview

Common Applications Issues

Applications are the end-users of the GRID : they are the ones finally making the difference

All applications started modeling their usage of the GRID through USE CASES : a standard technique for gathering requirements in software development methodologies

Use Cases are narrative documents that describe the sequence of events of an actor using a system [...] to complete processes

What Use Cases are NOT:

The LHC challenge

HEP is carried out by a community of more than 10,000 users spread all over the world

The CERN Large Hadron Collider at CERN is the most challenging goal for the whole HEP community in the next years

Test the standard model and the models beyond it (SUSY,GUTs) at an energy scale ( 7+7 TeV p-p) corresponding to the very first instants of the universe after the big bang (<10-13 s) , allowing the study of quark-gluon plasma

LHC experiments will produce an unprecedented amount of data to be acquired, stored, analysed :

HEP Data Analysis and Datasets

Raw data (RAW) ~ 1 MByte

Reconstructed data (ESD) ~ 100 kByte

Analysis Objects (AOD) ~ 10 kByte

Reduced AODs(TAGs) ~1 kByte

histograms, statistical data on collections of events

HEP Data Analysis –processing patterns

Processing fundamentally independent (Embarassing parallel) due to independent nature of ‘events’

Production processing is planned by experiment and physics group data managers(this will vary from expt to expt)

Processing Patterns (2)

Individual physics analysis - by definition ‘chaotic’ (according to work patterns of individuals)

Will need replication of AOD+TAG in experiment, and selective replication of RAW+ESD

Alice: AliEn-EDG integration

What have HEP experiments already done on the EDG testbeds 1.0 and 2.0

The EDG User Community has actively contributed to the validation of the first and second EDG testbeds (feb 2002 – feb 2003)

All four LHC experiments have ran their software (firstly in some preliminary version) to perform the basics operations supported by the testbed 1 features provided by the EDG middleware

Validation included job submission (JDL), output retrieval, job status query, basic data management operations ( file replication, register into replica catalogs ), check of possible s/w dependencies or incompatibility (e.g. missing libs, rpms) problems

ATLAS, CMS, Alice have run intense production data challenges and stress tests during 2002 and 2003

The CMS Stress Test

CMS MonteCarlo production using BOSS and Impala tools.

December 2002 to January 2003

The CMS Stress Test

CMS Stress Test : Architecture of the system

Main results and observations from CMS work

RESULTS

OBSERVATIONS

Earth Observation applications (WP9)

Global Ozone (GOME) Satellite Data Processing and Validation by KNMI, IPSL and ESA

The DataGrid testbed provides a collaborative processing environment for 3 geographically distributed EO sites (Holland, France, Italy)

Earth Observation

Two different GOME processing techniques will be investigated

The results are checked by VALIDATION (France). Satellite Observations are compared against ground-based LIDAR measurements coincident in area and time.

GOME OZONE Data Processing Model

Level-1 data (raw satellite measurements) are analysed to retrieve actual physical quantities : Level-2 data

Level-2 data provides measurements of OZONE within a vertical column of atmosphere at a given lat/lon location above the Earth’s surface

Coincident data consists of Level-2 data co-registered with LIDAR data (ground-based observations) and compared using statistical methods

GOME Processing Steps (1-2)

GOME Processing Steps (3-4)

GOME Processing Steps (5-6)

Biomedical requirements

Large user community(thousands of users)

Data management

Security

Limited response time

Diverse Users…

Patient

Physician

Researchers

Biologist

Chemical/Pharmacological manufacturer

…and data

Biological Data

Medical data

Web portals for biologists

Biologist enters sequences through web interface

Pipelined execution of bio-informatics algorithms

The algorithms are currently executed on a local cluster

Example GRID application for Biology: dgBLAST

The Visual DataGrid Blast, a first genomics application on DataGrid

A graphical interface to enter query sequences and select the reference database

A script to execute the BLAST algorithm on the grid

A graphical interface to analyze result

Accessible from the web

portal genius.ct.infn.it

Other Medical Applications