6status of application deployment at PM6
This chapter briefly summarizes the status of application deployment on GILDA and EGEE-0/LCG-2 infrastructures at project month 6. As reminder, the GILDA infrastructure gathers 12 academic and 2 industrial sites while EGEE-0/LCG2 gathers more than 80 sites and 7000 CPUs. A more detailed description of application deployment will be provided in the next NA4 deliverable (DNA4.3).
Besides applications already deployed or under deployment, contacts have been established with national and/or European projects expressing interest to collaborate with EGEE at different levels: joint deployment of applications, early testing and usage of EGEE middleware, deployment of virtual laboratories on EGEE etc. We present some of these contacts to illustrate the variety of requests to handle to successfully implement the project vision.
6.1High Energy Physics applications
LCG-2 has been heavily used during 2004 by the four LHC experiments for large distributed data challenges. These data challenges represent a significant achievement, as they represent the largest-scale computational efforts on generic grid infrastructures to date. The largest previous efforts were the participation of the EDG testbed in the CMS Data Challenge (2002) and D0 Re-reconstruction (2003). We have made a significant step into the use of grid technology for large scale production.
LCG/EGEE operations provides over 70 sites distributed worldwide, which have provided resources for at least one of the HEP experiment VOs. Also the operation of the international networking infrastructures has been crucial to the highly successful operation.
Before summarising the technical achievements of the HEP experiments we must emphasise that all have acknowledged the very high quality of support provided by LCG through their EIS(Experiment Integration Support) team. This team has worked closely with the experiments to interface their systems to the grid middleware, and to test systems before moving to production. We look to setting up similar quality support within EGEE for Biomedicine and other applications.
A general characteristic of experiment computing has been the use of multi-grids, for example Atlas using LCG, Nordugrid and the US GRID3. This has worked very well, and this mode of working will continue for the foreseeable future. Each experiment implemented its own systems for workload definition, bookkeeping, and data management. Jobs were defined and submitted to the experiment-specific production database (ATLAS, LHCb) or task queue (ALICE). A translator facility retrieves the jobs from the central database, translates the jobs to LCG JDL, and submits them via the regular LCG job-submission tools. For ATLAS and ALICE, the job being sent to the WMS was completely specified at submission time. For LHCb, the job being sent to the WMS was an ‘agent’ job, which upon starting execution, contacted the LHCb production database and retrieved the ‘real’ job description to be executed. In the following we overview the use of LCG by the experiments.
ALICE commenced Phase1 in their data challenge in early 2004 and finished in April. In total they ran 56K jobs in Alien and LCG grids. 11K of these jobs, each running for 7.5 hours, were accomplished on LCG sites. In total they registered about 2 million files and generated 26TB of data. They commenced Phase2 of their data challenge in August with significantly more jobs going to LCG, and making more use of LCG data management services.
LCG2 is being used in the LHCb production Data Challenge 2004 (DC04) through the LHCb production system DIRAC. DIRAC is also used on so-called DIRAC native sites where LCG2 is not installed or in parallel with LCG resources. The LHCb DC started in May 2004 by running mostly on the Dirac, non-LCG sites. Progressively, more and more LCG sites were commissioned (with the LHCb testing procedure,) and added to the LHCb DC pool of sites. The LCG share in the total volume of produced data grew from 11% in May to 73% in August. In the end of this period up to 3000 jobs were executed concurrently on LCG sites. The total of 43 LCG sites were executing (at least one) LHCb jobs with major contributions being from CERN, RAL, CNAF, NIKHEF, PIC, FZK. 113K jobs were run successfully on LCG, each running for more than 24 hours and accomplishing event simulation, digitisation and reconstruction. Some 60TB of data were generated and stored at CERN, with some 30TB of derived datasets being stored at several national centres.
ATLAS started their data challenge DC2, involving large scale event simulation in late July. By early September they had run 40K jobs, each running a day, on LCG. This forms some 40% of their grid based production, the rest being accomplished on Nordugrid and the US GRID3. They have benefited from the experiences of ALICE and LHCb, obtaining efficiencies through LCG of 80% in recent large scale running.
CMS did not make large scale use of LCG computational resources in their data challenge (DC04). But they performed extremely important work in confirming their computing model from the point of view of data storage and data transfers involving their Tier-0, Tier-1 and Tier-2 sites. The Spanish and Italian Tier-1 and Tier-2 were configured as LCG-2 sites. The full DC04 chain, except theTier-0 reconstruction, was tested using LCG-2 components. The total number of files registered in the RLS during DC04 was ~570K Logical File Names each with typically from 5 to 10 Physical File Names and 9 metadata attributes per file. Some performance issues were identified which are currently being addressed. The data distribution Tier-0 to Tier-1 to Tier-2 was established using LCG Storage Elements and LCG transfer tools. Over 6 TB of data were distributed to PIC and CNAF Tier-1, reaching sustained transfer rates of 30 MB/s. The total network throughput was limited by the small size of the files being pushed through the system.
In parallel with this very large scale production, the experiments and LCG have learned many lessons regarding grid operation and the performance of both applications software and grid middleware. Consequently the performance has gradually improved, and also valuable information has been provided for the EGEE group re-engineering middleware which will be used for the next major jump in grid utilisation by the experiments, namely for user analysis by many physicists and engineers throughout the community. In addition to lessons learned regarding the middleware performance and functionality we have seen that the instability of sites can be a major factor on overall system performance. So both LCG and the experiments have worked on ongoing site validation procedures. These procedures must run continuously, and address individual VOs as well as central site facilities.
6.1.2Internal HEP applications
We discuss below progress in the so-called ‘internal’ HEP application area, which includes non-LHC experiments running on the LCG/EGEE infrastructure, and receiving their support from sites closely involved in the experiment.
The DØ experiment at the Fermilab Tevatron is collecting some 100 Terabytes of data each year and has a very high need of computing resources for the various parts of the physics program. DØ meets these demands by establishing a worldwide distributed computing infrastructure – increasingly based on GRID technologies. In the lifetime of the Datagrid project they interfaced use of the EDG Applications Testbed to the general use by D0 of the US Samgrid.
After the European DataGrid project has been ended at the end of March 2004, D0 continued this work in the context of LCG. A D0 VO has been set up on LCG/EGEE and the principle user support is provided by Nikhef, a key site in both D0 and LCG. Basic tests have been performed and D0 will be making use of LCG resources.
The BaBar experiment, running in the USA at SLAC, is also collecting many Terabytes of data in a year.It has a mature distributed computing system, parts of which are moving to grid technology.The BaBar VO is managed in the UK and currently supported in Italy, UK and Germany.In the UK Babargrid are migrating their non-LCG Grid sites to standard LCG sites, in order to have a large part of the Simulation production of the experiment managed via LCG-2.BaBar has started to study the integration of Grid storage and location services in their experiment production. And, following this, an important next development will be to study data analysis in the grid environment.
Dostları ilə paylaş: |