The Service activities in EGEE-III build upon the experience gained and the infrastructure deployed in the predecessor projects EGEE and EGEE-II. This infrastructure is a leading global Grid, in terms both of the scale of resources provided and the number of user communities supported.
The service activities within EGEE-III are aimed at ensuring that the Grid infrastructure delivers a service that focuses on enabling and supporting science in diverse research communities while taking appropriate steps towards a sustainable infrastructure in Europe. This will be achieved through provision of a production infrastructure (SA1), provision of networking support coordination with GÉANT2 and NRENs (SA2), and provision of a middleware distribution to the production infrastructure as well as to related efforts worldwide (SA3). The EGEE middleware distribution (gLite) combines components from different providers, most importantly the EGEE middleware engineering activity JRA1, but also the Virtual Data Toolkit (VDT) distribution of OSG, application projects such as LCG etc. The components are chosen to satisfy the requirements of the EGEE user communities and operations. Interoperability is one of the drivers of the gLite distribution as allowing access to a diverse set of research infrastructures is a major goal of EGEE-III. These infrastructures comprise in particular related efforts such as SEE-Grid, BalticGrid, NorduGrid, and DEISA in Europe, OSG and Teragrid in the US, NAREGI in Japan.
The service activities are closely related and work with the Technical Management Board to ensure the middleware and services deployed and operated by SA1 are as robust, effective, and reliable as possible. SA3 provides essential second level support for the deployed services (on the production or pre-production systems), acting as triage for third level support by the middleware developers or external projects. The SA1 and SA3 activities cleanly separate the problems associated with middleware development and integration from the issues associated with deploying and maintaining a production-quality service. SA2 fulfils the role of interface between the EGEE infrastructure and the NRENs and GÉANT2.
Based on the existing EGEE-II procedures, EGEE-III will provide the following infrastructures in order to carry out its mission:
A Production Service infrastructure, with incremental growth anticipated within the existing structure, and expanded through collaborating infrastructure projects listed in section 3. Interoperability with other Grid infrastructures will evolve at all levels from campus to international.
A Pre-Production Service (PPS) will demonstrate new services, or new versions of existing services before they move to production. This will provide an environment for applications to test new services and to integrate their software with Grid services. The PPS has also shown itself to be invaluable for deployment testing in the Regional Operations Centres (ROC) before full distribution of new or updated services.
The EGEE Network Operations Centre (ENOC) which caters for the network operational coordination between EGEE and the network providers (GÉANT2 /NRENs)
This is complemented by the training infrastructure and the certification test-beds as well as the needed support structures and policy groups as shown in table 3Table which also lists the responsible activity.
A general strategy of the Service Activities in EGEE-III will be to optimise the activities in order to reduce the overall level of effort required in the future to manage a sustainable infrastructure. In an era of National Grid Infrastructures and European-level coordination, it is clear that the present model of grid management oversight is not suitable for the long term. Experience in the first two phases of the EGEE project has shown the value of distributed operations teams – at several levels. The concept of Regional Operations Centres as a distributed management team has worked well, providing both effort to cover the operational oversight, but also ensuring the dissemination of knowledge and expertise. At a finer-grained level several of the ROCs themselves are also distributed teams with similar benefits. From the experience in managing the sites, it has also become clear that grid-level monitoring, while it has been vital in the earlier stages of the project, must evolve to ensure that resource site managers receive results of monitoring directly rather than waiting for trouble tickets to be opened by the grid-level operators. In addition improvements in grid service management and monitoring must help the site and service managers to provide a robust and reliable service with the minimum of external intervention. This strategy is important for ensuring site reliability, as well as being necessary for increasing the scale of the infrastructure without additional staff, and eventually reducing the level of effort at the grid oversight level.
There are a number of specific actions that will be taken during EGEE-III to address these issues. The most important of these is set out a clear strategy for automation, and to set up a team to manage this strategy. This will be set out in a milestone (MSA1.1) which sets up an Operations Automation Team, whose role will be to provide the strategy and oversee implementation towards operations automation, including coordination of tool developments. The milestone will set out the roadmap. It will include a plan for monitoring tools and requirements for increasing automation of alarms, tools needed to support operations, improving reliability, and for verification of Service Level Agreements. A guiding principle will be to push information and responsibility for proactivity to the lowest level of service management – as close to the service itself as possible to ensure fast response. This will start to reduce the need to higher level oversight, and hopefully turn that role into one of monitoring of Service metrics such as accounting, site and service reliability, performance, etc.
A good set of tools has been developed in the first phases of the project, these will be developed and coordinated at the level of information gathering and publishing. Standard formats and interfaces, already proposed as prototypes will be further enhanced to permit data gathered by a variety of different tools to be presented to the service managers in a coherent and consistent way. Such standards will increase the scope for sharing of tools and monitors and avoid duplication of developments and monitoring.
An important adjunct to the tools themselves will be the continued and evolving use of Service Level Agreements, to set standards for site and service levels. This mechanism and the associated reporting and publication of the associated metrics will exert pressure to provide reliable and effective services, which will also help reduce the level of grid oversight needed.
For these proposals to bear fruit it is important that sufficient effort be assigned to the appropriate tool development tasks in this phase of the project.
In SA2 there will be a focus on the EGEE Network Operations Centre (ENOC) as an integrated part of the grid operation, providing coordination with GEANT2 and the NRENs. It will be integrated with GGUS and the operations procedures adapted. Important network metrics, derived from monitoring systems developed in the network community will be provided to the grid and site operators to ensure that network problems are managed in the same way as other grid services.
SA3 will continue to evolve its use of automated build systems – working closely with the ETICS-2 project. This will hopefully improve the speed and effectiveness of the build and middleware integration processes. Automating the testing and certification procedures will continue, again bringing increased efficiency, with the eventual goal of being able to reduce effort associated with the management of the system itself. In a similar vein the use of virtual machines, which has already greatly simplified the management of the certification test-beds, will continue.
SA3 will enforce more stringent acceptance criteria for middleware in order to avoid solving too low level problems. These criteria must include appropriate and correct documentation, as well as important missing functionality such as management interfaces, and appropriate monitoring interfaces. These will also eventually help reduce the overall level of effort in operations.
Progress in these activities will be monitored through the deliverables providing assessments of the operations infrastructure (DSA1.2.1 and DSA1.2.2 at PM 11 and 22), and the mechanisms implemented will be documented in DSA1.5 (the Operations Cookbook).
The interactions and interdependencies between the Service and Joint Research Activities in EGEE-III are shows in the PERT chart below.