Applications of GPU Computing
Alex Karantza
0306-722 Advanced Computer Architecture Fall 2011
Outline
•
Introduction
•
GPU Architecture
▫ Multiprocessing
▫ Vector ISA
•
GPUs in Industry
▫ Scientific Computing
▫ Image Processing
▫ Databases
•
Examples and Benefits
Introduction
“GPUs have evolved to the point where many real world
applications are easily implemented on them and run
significantly faster than on multi-core systems. Future
computing architectures will be hybrid systems with
parallel-core GPUs working in tandem with multi-core
CPUs.”
-
Prof. Jack Dongarra, director of the Innovative
Computing Laboratory at the University of Tennessee
Author of LINPACK
(As typified by NVIDIA CUDA)
GPU Architecture
•
Parallel Coprocessor to conventional CPUs
▫ Implement a SIMD structure, multiple threads running the
same code.
•
Grid of Blocks of Threads
▫ Thread local registers
▫ Block local memory and control
▫ Global memory
Grids, Blocks, and Threads
Thread
Thread
Processor
Thread
Block
Multiprocessor
Grid
Device(s)
Contains local registers
and memory; scalar processor
Shared memory and registers;
shared control logic
Global memory, can be easily
distributed across devices
GPU Architecture
•
Processors also implement vector instructions
▫ Vectors of length 2,3,4 of any fundamental type
integer, float, bits, predicate
▫ Instructions for conversion between vector, scalar
•
To encourage uniform execution, rather than
branching for conditionals, use predicates
▫ All instructions can be conditionally executed based on
predicate registers
Vectors and Predicates
.global .v4 .f32 V; // a length-4 vector of floats
.shared .v2 .u16 uv; // a length-2 vector of unsigned
.global .v4 .b8 v; // a length-4 vector of bytes
.reg .s32 a, b; // two 32-bit signed ints
.reg .pred p; // a predicate register
setp.lt.s32 p, a, b; // if a < b, set p
@p add.v4.f32 V, V, {1,0,0,0}; // if p, V.x = V.x + 1
NSF Keeneland
360 Tesla20s
GPUs in Industry
•
Many applications have been developed to use GPUs
for supercomputing in various fields
▫ Scientific Computing
CFD, Molecular Dynamics, Genome Sequencing,
Mechanical Simulation, Quantum Electrodynamics
▫ Image Processing
Registration, interpolation, feature detection, recognition,
filtering
▫ Data Analysis
Databases, sorting and searching, data mining
Major Categories of Algorithm
•
2D/3D filtering operations
•
n-body simulations
•
Parallel tree operations – searching/sorting
•
All suited to GPUs because of data-parallel
requirements and uniform kernels
Computational Fluid Dynamics
•
Simulate fluids in a discrete volume over time
•
Involves solving the Navier-Stokes partial differential
equations iteratively on a grid
▫ Can be considered a filtering operation
•
When parallelized on a GPU using multigrid solvers,
10x speedups have been reported
Molecular Dynamics
•
Large set of particles with forces between them –
protein behavior, material simulation
•
Calculating forces between particles can be done in
parallel for each particle
•
Accumulation of forces can be implemented as
multilevel parallel sums
Genetics
•
Large strings of genome sequences must be searched
through to organize and identify samples
•
GPUs enable multiple parallel queries to the
database to perform string matching
•
Again, order of magnitude
speedups reported
Electrodynamics
•
Simulation of electric fields, Coulomb forces
•
Requires iterative solving of partial differential
equations
•
Cell phone modeling applications have
reported 50x speedups using GPUs
Image Processing
•
Medical Imaging was the early adopter
▫ Registration of massive 3D voxel images
▫ Both the cost function for deformable registration and
interpolation of results are filtering operations
•
Generic feature detection, recognition,
object extraction are all filters
•
For object recognition, one can search a
database of objects in parallel
•
Running these algorithms off the CPU can
allow real-time interaction
Data Analysis
•
Huge databases for web services require instant
results for many simultaneous users
•
Insufficient room in main memory, disk is too slow and
doesn’t allow parallel reads
•
GPUs can split up the data and perform
fast searches, keeping their section
in memory
Example: Filtering Operation
•
Many algorithms can be reduced to a filtering
operation. As an example, consider image convolution
for blurring
Kernel = Gaussian2D(size);
for (x,y) in Input {
for (p,q) in Kernel {
Output(x,y) += Input(x+p,y+q) * Kernel(p,q);
}
}
Example: Filtering Operation
•
A quick optimization that can be made on many filters is that they
are separable, and can be done in one pass per dimension
Kernel = Gaussian1D(size);
for (x,y) in Input {
for (p) in Kernel {
Output(x,y) += Input(x+p,y) * Kernel(p);
}
}
for (x,y) in Input {
for (q) in Kernel {
Output(x,y) += Input(x,y+q) * Kernel(q);
}
}
Example: Filtering Operation
•
This is still O(2nnm) on a sequential processor
•
Each output pixel is independent, but shares spatially
local data and a constant kernel
UploadGPU(Kernel, CONSTANT);
UploadGPU(Input, TEXTURE);
ConvolveColumnsGPU();
ConvolveRowsGPU();
DownloadGPU(Output, TEXTURE);
Example: Filtering Operation
•
Complexity remains the same, however each MAC
instruction can be executed on as many processors as
are available, and memory can be accessed quickly
because of the assignment of blocks and texture
memory
•
In practice, the overhead of uploading and
downloading from the GPU is far less than the
performance gained in the kernel
Example: Filtering Operation
Dostları ilə paylaş: |