Grid computing: An introduction Lionel Brunie National Institute of Applied Science (INSA) LIRIS Laboratory/DRIM Team – UMR CNRS 5205 Lyon, France http://liris.cnrs.fr/lionel.brunie
A Brain is a Lot of Data! (Mark Ellisman, UCSD)
Data Intensive Physical Sciences
High energy & nuclear physics
Simulation
Earth observation, climate modeling
Geophysics, earthquake modeling
Fluids, aerodynamic design
Pollutant dispersal scenarios
Astronomy- Digital sky surveys: modern telescopes produce over 10 Petabytes per year by 2008 !
Molecular genomics
Chemistry and biochemistry
Financial applications
Medical images
Performance evolution of computer components
Network vs. computer performance
Computer speed doubles every 18 months
Network speed doubles every 9 months
Disk capacity doubles every 12 months
1986 to 2000
Computers: x 500
Networks: x 340,000
2001 to 2010
Computers: x 60
Networks: x 4000
Conclusion: invest in networks !
Hansel and Gretel are lost in the forest of definitions
Distributed system
Parallel system
Cluster computing
Meta-computing
Grid computing
Peer to peer computing
Global computing
Internet computing
Network computing
Cloud computing
Distributed system
N autonomous computers (sites): n administrators, n data/control flows
an interconnection network
User view: one single (virtual) system
«A distributed system is a collection of independent computers that appear to the users of the system as a single computer » Distributed Operating Systems, A. Tanenbaum, Prentice Hall, 1994
« Traditional » programmer view: client-server
Parallel System
1 computer, n nodes: one administrator, one scheduler, one power source
memory: it depends
Programmer view: one single machine executing parallel codes. Various programming models (message passing, distributed shared memory, data parallelism…)
Examples of parallel system
Cluster computing
Use of PCs interconnected by a (high performance) network as a parallel (cheap) machine
Two main approaches
dedicated network (based on a high performance network: Myrinet, SCI, Infiniband, Fiber Channel...)
non-dedicated network (based on a (good) LAN)
Where are we today ?
A source for efficient and up-to-date information: www.top500.org
AMD x86_64 Opteron Six Core 2600 MHz (10.4 GFlops)
Rmax = 1759 – Rpeak = 2331
Power: 6,950 MW
http://www.nccs.gov/jaguar/
Network computing
From LAN (cluster) computing to WAN computing
Set of machines distributed over a MAN/WAN that are used to execute parallel loosely coupled codes
Depending on the infrastructure (soft and hard), network computing is derived in Internet computing, P2P, Grid computing, etc.
Meta computing (beginning 90’s)
Definitions become fuzzy...
A meta computer = set of (widely) distributed (high performance) processing resources that can be associated for processing a parallel not so loosely coupled code
A meta computer = parallel
virtual machine over a
distributed system
Internet computing
Use of (idle) computer interconnected by Internet for processing large throughput applications
Ex: SETI@HOME
5M+ users since launching
2009/11: 930k users, 2.4M computers; 190k active users, 278k active computers, 2M years of CPU time
“Coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organisations” (I. Foster)
Grid computing (2)
Information grid
large access to distributed data (the Web)
Data grid
management and processing of very large distributed data sets
Computing grid
meta computer
Parallelism vs grids: some recalls
Grids date back “only” 1996
Parallelism is older ! (first classification in 1972)
Motivations:
need more computing power (weather forecast, atomic simulation, genomics…)
need more storage capacity (Petabytes and more)
in a word: improve performance ! 3 ways ...
Work harder --> Use faster hardware
Work smarter --> Optimize algorithms
Get help --> Use more computers !
The performance ? Ideally it grows linearly
Speed-up:
if TS is the best time to process a problem sequentially,
then the parallel processing time should be TP=TS/P with P processors
speedup = TS/TP
the speedup is limited by Amdhal law: any parallel program has a purely sequential and a parallelizable part TS= F + T//,
thus the speedup is limited: S = (F + T//) / (F + (T///P)) < P
Scale-up:
if TPS is the time to solve a problem of size S with P processors,
then TPS should also be the time to process a problem of size n*S with n*P processors
Grid computing
Starting point
Real need for very high performance infrastructures
Basic idea: share computing resources
“The sharing that the GRID is concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering” (I. Foster)
Applications
Distributed supercomputing
High throughput computing
On demand (real time) computing
Data intensive computing
Collaborative computing
An Example Virtual Organization: CERN’s Large Hadron Collider Worldwide LHC Computing Grid (WLCG)
8000 Physicists, 170 Sites, 34 Countries
15 PB of data per year; 100,000 CPUs
Why Grid Computing (CERN opinion) ?
The answer is "money"... In 1999, the "LHC Computing Grid" was merely a concept on the drawing board for a computing system to store, process and analyse data produced from the Large Hadron Collider at CERN. However when work began on the design of the computing system for LHC data analysis, it rapidly became clear that the required computing power was far beyond the funding capacity available at CERN.
On the other hand, most of the laboratories and universities collaborating on the LHC had access to national or regional computing facilities.
The obvious question was: Could these facilities be somehow integrated to provide a single LHC computing service? The rapid evolution of wide area networking—increasing capacity and bandwidth coupled with falling costs—made it look possible. From there, the path to the LHC Computing Grid was set.
Multiple copies of data can be kept in different sites, ensuring access for all scientists involved, independent of geographical location.
Allows optimum use of spare capacity for multiple computer centres, making it more efficient.
Having computer centres in multiple time zones eases round-the-clock monitoring and the availability of expert support.
No single points of failure.
The cost of maintenance and upgrades is distributed, since individual institutes fund local computing resources and retain responsibility for these, while still contributing to the global goal.
Independently managed resources have encouraged novel approaches to computing and analysis.
So-called “brain drain”, where researchers are forced to leave their country to access resources, is reduced when resources are available from their desktop.
The system can be easily reconfigured to face new challenges, making it able to dynamically evolve throughout the life of the LHC, growing in capacity to meet the rising demands as more data is collected each year.
Provides considerable flexibility in deciding how and where to provide future computing resources.
Allows community to take advantage of new technologies that may appear and that offer improved usability, cost effectiveness or energy efficiency.
LCG System Architecture
A 4 layers Computing Model
Tier-0: CERN: accelerator
Data Acquisition and Reconstruction
Data Distribution to Tier-1 (~online)
Tier-1
24x7 Access and Availability,
Quasi-online data Acquisition
Data Service on the Grid
“Heavy” Analysis of the data
~10 countries
Tier-2
Simulation
Final User, Analysis of the data (batch and interactive modes)
~40 Countries
Tier-3
Final User, Scientific analysis
LCG System Architecture (Cont’d)
Back to roots (routes)
Railways, telephone, electricity, roads, bank system
clients (the citizens) are NOT providers (states or companies)
small number of actors/providers
small number of applications
strong supervision/control
Computational grid
“Hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities” (I. Foster)
Performance criteria:
security
reliability
computing power
latency
throughput
scalability
services
Grid characteristics
Large scale
Heterogeneity
Multiple administration domain
Autonomy… and coordination
Dynamicity
Flexibility
Extensibility
Security
Levels of cooperation in a computing grid
End system (computer, disk, sensor…)
multithreading, local I/O
Cluster
synchronous communications, DSM, parallel I/O
parallel processing
Intranet/Organization
heterogeneity, distributed admin, distributed FS and databases
load balancing
access control
Internet/Grid
global supervision
brokers, negotiation, cooperation…
Basic services
Authentication/Authorization/Traceability
Activity control (monitoring)
Resource discovery
Resource brokering
Scheduling
Job submission, data access/migration and execution
Accounting
Layered Grid Architecture (By Analogy to Internet Architecture)
Elements of the Problem
Resource sharing
Computers, storage, sensors, networks, …
Heterogeneity of device, mechanism, policy
Sharing conditional: negotiation, payment, …
Coordinated problem solving
Integration of distributed resources
Compound quality of service requirements
Dynamic, multi-institutional virtual orgs
Dynamic overlays on classic organization structures
Map to underlying control mechanisms
Resources
Description
Advertising
Cataloging
Matching
Claiming
Reserving
Checkpointing
Resource management (1)
Services and protocols depend on the infrastructure
Some parameters
stability of the infrastructure (same set of resources or not)
freshness of the resource availability information
reservation facilities
multiple resource or single resource brokering
Example of request: I need from 10 to 100 CE each with at least 512 MB RAM and a computing power of 150 Mflops
Resource management and scheduling (1)
Levels of scheduling
job scheduling (global level ; perf: throughput)
resource scheduling (perf: fairness, utilization)
application scheduling (perf: response time, speedup, produced data…)
Mapping/Scheduling process
resource discovery and selection
assignment of tasks to computing resources
data distribution
task scheduling on the computing resources
(communication scheduling)
Resource management and scheduling (2)
Individual perfs are not necessarily consistent with the global (system) perf !
Data integration, data warehousing and analysis tools
Knowledge discovery and data mining
Functional View of Grid Data Management
Grid Security (1): Why Grid Security is Hard
Used resources may be extremely valuable & the problems to be solved extremely sensitive
Resources are located in distinct administrative domains
Each resource has its own policies & procedures
Users are diverse
The set of resources used by a single computation may be large, dynamic, and/or unpredictable
Not just client/server
The security service must be broadly available & applicable
Standard, well-tested, well-understood protocols
Integration with wide variety of tools
Grid security (2): Requirements
Authentication
Authorization and Delegation of authority
Assurance
Accounting
Auditing and Monitoring
Traceability
Integrity and Confidentiality (ACID properties)
Access to data and Mediation
Ciel, where are the data ?
Use case: Italian tourist – heart accident in Lyon
Data inside the grid # data at the side of the grid !
Basic idea
use of metadata/indexes. Pb: indexes are (sensitive) information
Alternative
encrypted indexes, use of views, proxies
Mediation
no single view of the world mechanisms for interoperability, ontologies
Negotiation: a key open issue
Motivation:
Motivation:
Collaborative caching is proved to be efficient
Each institution wants to control the access to its data
No standard exists in Grids for caching
Proposal:
on demand caching
a two-level cache: local caches and a global virtual cache
use metadata to collaborate / index data
Query optimization and execution
Old wine in new bottles ?
Yes and no: it seems the problem has not changed but the operational context has so changed that classical heuristics and methods are not more pertinent
Key issues:
Dynamicity
Unpredictability
Adaptability
Very few works have specifically addressed this problem
An application example: GGM Grille Geno-Médicale
An application example: GGM Biomedical grids
Biomedical applications are perfect candidates for gridification:
Huge volumes of data (an hospital = several TB per year)
Dissemination of data
Collaborative work (health networks)
Very hard requirements (e.g. response time)
But
Partially structured semantic data
Very strong privacy issues
→ a perfect play field for researchers !
An application example: GGM Motivation (1)
Dissemination of new “high bandwidth” technologies in genome and proteome research (e.g. micro-arrays)
huge volume of structural (gene localization)
functional (gene expression) data
Generalization of digital patient files and digital medical images
Implementation of (regional and inter-national) health networks
All information is available, people are connected to the network.
The question is: How can we use it ?
An application example: GGM Motivation (2)
Need for an information infrastructure to
index, exchange/share, process all this data
while preserving their privacy at a very large scale
That is... just a good grid!
Application objectives:
correlation of genomic and medical data: fundamental research and later medical decision making process
patient-centered medical data integration: patient’s monitoring in and out-side the hospital
epidemiology
training
An application example: GGM Motivation (3)
References: “Synergy between medical informatics and bioinformatics: facilitating genomic medicines for future healthcare”,
BIOINFOMED Working Group, Jan. 2003, European Commission
Proceedings of Healthgrid conferences (1st edition in Lyon(2003))
“The goal of the GGM project is, on top of a grid infrastructure, to propose a software architecture able to manage heterogeneous and dynamic data stored in distributed warehouses for intensive analysis and processing purposes.”
“The goal of the GGM project is, on top of a grid infrastructure, to propose a software architecture able to manage heterogeneous and dynamic data stored in distributed warehouses for intensive analysis and processing purposes.”
Distributed Data Warehouses
Query Optimization
Data Access [and Control]
Data Mining
An application example: GGM Data
A piece of medical data (age, image, biological result, salient object in an image) has a meaning
It conveys information that can be interpreted (in multiple ways !)
Meta-data can be attached to medical data… or not
pre-processing is necessary
Medical data are often private
privacy/delegation
The medical data of a patient are often disseminated over multiple sites
access rights/authentication problem, collection/integration of data into partial views, identification of data/users
Medical (meta-)data are complex and not yet (fully) standardized
no global structure
Virtual Data Warehouses on the Grid
Virtual Data Warehouses on the Grid (1)
Almost nothing…
Why is it so difficult ?
multiple administrative domains
very sensitive data => security/privacy issues
wide distribution
unpredictability
relationship with data replica
heterogeneity
dynamicity (permanent production of large volumes of data)
Centralized data warehouse ?
Not realistic at a large scale and not acceptable
Virtual Data Warehouses on the Grid (2)
A possible direction of research: virtual data warehouses on the grid
Components:
a federated schema
a set of partial views (“chunks”) materialized at the local system level
Advantages
Flexibility wrt users’ needs
Good use of the storage capacity of the grid and scalability
Security control at the local level
Global view of the disseminated data
Virtual Data Warehouses on the Grid (3)
Drawbacks and open issues
maintenance protocols
indexing tools
access to data and negotiation
query processing
Access to data and collaborative brokers
Access to data and collaborative brokers (1)
Brokers act as interfaces between data, services and applications
Possible locations
at the interface between the grid and the external data repositories