Parallelism is older ! (first classification in 1972)
Motivations :
need more computing power (weather forecast, atomic simulation, genomics…)
need more storage capacity (petabytes and more)
in a word : improve performance ! 3 ways ...
Work harder --> Use faster hardware
Work smarter --> Optimize algorithms
Get help --> Use more computers !
Parallelism : the old classification
Flynn (1972)
Parallel architectures classified by the number of instructions (single/multiple) performed and the number of data (single/multiple) treated at same time
easy to program : only one address space to exchange data (but programmer must take care of synchronization in memory access : critical section)
Cons :
poor scalability : when the number of processors increase, the cost to transfer data becomes too high; more CPUs = more access memory by the network = more need in memory bandwidth !
Direct transfer from proc. to proc. (->MPP)
Different interconnection schema (full impossible !, growing in O(n2) when nb of procs increases by O(n)) : bus, crossbar, multistage crossbar, ...
Massively Parallel Processors (1/2)
Several hundred nodes with a high speed interconnection network/switch
A share-nothing architecture
each node owns a memory (distributed memory), one or more processors, each runs an OS copy
Massively Parallel Processors (2/2)
Pros :
good scalability
Cons :
communication between nodes longer than in shared memory; improve interconnection schema : hypercube, (2D or 3D) torus, fat-tree, multistage crossbars
harder to program :
data and/or tasks have to be explicitly distributed to nodes
remote procedure calls (RPC, JavaRMI)
message passing between nodes (PVM, MPI), synchronous or asynchronous communications
DSM : Distributed Shared Memory : a virtual memory
upgrade : processors and/or communication ?
Constellations
a small number of processors (up to 16) clustered in SMP nodes (fast connection)
SMPs are connected through a less costly network with “poorer” performance
With DSM, memory may be addressed globally : each CPU has a global memory view, memory and cache coherence is guaranteed (ccNuma)
Clusters
a collection of workstations (PC for instance) interconnected through high speed network, acting as a MPP/DSM with network RAM and software RAID (redundant storage, // IO)
clusters = specialized version of NOW : Network Of Workstation
«A distributed system is a collection of independent computers that appear to the users of the system as a single computer » Distributed Operating System. A. Tanenbaum, Prentice Hall, 1994
Where are we today (nov 22, 2004) ?
a source for efficient and up-to-date information : www.top500.org
the 500 best architectures !
we head towards 100 Tflops
1 Flops = 1 floating point operation per second
1 TeraFlop = 1000 GigaFlops = 100 000 MegaFlops = 1 000 000 000 000 flops = one thousand billion operations per second
Today's bests
comparison on a similar matrix maths test (Linpack) :Ax=b
Rank Tflops Constructor Nb of procs
1 70.72 (USA) IBM BlueGene/L / DOE 32768
2 51.87 (USA) SGI Columbia / Nasa 10160
3 35.86 (Japon) NEC EarthSim 5120
4 20.53 (Espagne) IBM MareNostrum 3564
5 19.94 (USA) California Digital Corporation 4096
41 3.98 (France) HP Alpha Server / CEA 2560
NEC earth simulator
How it grows ?
in 1993 (11 years ago!)
n°1 : 59.7 GFlops
n°500 : 0.4 Gflops
Sum = 1.17 TFlops
Problems of the parallelism
Two models of parallelism :
driven by data flow : how to distribute data ?
driven by control flow : how to distribute tasks ?
Scheduling :
which task to execute, on which data, when ?
how to insure highest compute time (overlap communication/computation?) ?
Communication
using shared memory ?
using explicit node to node communication ?
what about the network ?
Concurrent access
to memory (in shared memory systems)
to input/output (parallel Input/Output)
The performance ? Ideally grows linearly
Speed-up :
if TS is the best time to treat a problem in sequential, its time should be TP=TS/P with P processors !
Speedup = TS/TP
limited (Amdhal law): any program has a sequential and a parallel part : TS=F+T//, thus the speedup is limited : S = (F+T//)/(F+T///P)<1/F
Scale-up :
if TPS is the time to treat a problem of size S with P processors, then TPS should also be the time to treat a problem of size n*S with n*P processors
Network performance analysis
scalability : can the network be extended ?
limited wire length, physical problems
fault tolerance : if one node is down ?
for instance in an hypercube
multiple access to media ?
inter-blocking ?
The metrics :
latency : time to connect
bandwidth : measured in MB/s
Tools/environment for parallelism (1/2)
Communication between nodes :
By global memory ! (if possible, plain or virtual)
Otherwise :
low-level communication : sockets
s = socket(AF_INET, SOCK_STREAM, 0 );
mid-level communication library (PVM, MPI)
info = pvm_initsend( PvmDataDefault );
info = pvm_pkint( array, 10, 1 );
info = pvm_send( tid, 3 );
remote service/object call (RPC, RMI, CORBA)
service runs on distant node, only its name and parameters (in, out) have to be known
Tools/environment for parallelism (2/2)
Programming tools
threads : small processes
data parallel language (for DM archi.):
HPF (High Performance Fortran)
say how data (arrays) are placed, the system will infer the best placement of computation (to minimize total computation time (e.g. further communications)
task parallel language (for SM archi.):
OpenMP : compiler directives and library routines; based on threads. The parallel program is close to sequential; it is a step by step transform