E-Book, Englisch, 514 Seiten
Kirk / Hwu Programming Massively Parallel Processors
2. Auflage 2012
ISBN: 978-0-12-391418-7
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: 6 - ePub Watermark
A Hands-on Approach
E-Book, Englisch, 514 Seiten
ISBN: 978-0-12-391418-7
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: 6 - ePub Watermark
David B. Kirk is well recognized for his contributions to graphics hardware and algorithm research. By the time he began his studies at Caltech, he had already earned B.S. and M.S. degrees in mechanical engineering from MIT and worked as an engineer for Raster Technologies and Hewlett-Packard's Apollo Systems Division, and after receiving his doctorate, he joined Crystal Dynamics, a video-game manufacturing company, as chief scientist and head of technology. In 1997, he took the position of Chief Scientist at NVIDIA, a leader in visual computing technologies, and he is currently an NVIDIA Fellow. At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. For his role in bringing high-performance graphics to personal computers, Kirk received the 2002 Computer Graphics Achievement Award from the Association for Computing Machinery and the Special Interest Group on Graphics and Interactive Technology (ACM SIGGRAPH) and, in 2006, was elected to the National Academy of Engineering, one of the highest professional distinctions for engineers. Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III. A technological 'evangelist' who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide.
Autoren/Hrsg.
Weitere Infos & Material
1;Front Cover;1
2;Programming Massively Parallel Processors;4
3;Copyright Page;5
4;Contents;6
5;Preface;14
5.1;Target Audience;15
5.2;How to Use the Book;15
5.2.1;A Three-Phased Approach;16
5.2.2;Tying It All Together: The Final Project;16
5.2.2.1;Project Workshop;17
5.2.2.2;Design Document;17
5.2.2.3;Project Report;18
5.3;Online Supplements;18
6;Acknowledgements;20
7;Dedication;22
8;1 Introduction;24
8.1;1.1 Heterogeneous Parallel Computing;25
8.2;1.2 Architecture of a Modern GPU;31
8.3;1.3 Why More Speed or Parallelism?;33
8.4;1.4 Speeding Up Real Applications;35
8.5;1.5 Parallel Programming Languages and Models;37
8.6;1.6 Overarching Goals;39
8.7;1.7 Organization of the Book;40
8.8;References;44
9;2 History of GPU Computing;46
9.1;2.1 Evolution of Graphics Pipelines;46
9.1.1;The Era of Fixed-Function Graphics Pipelines;47
9.1.2;Evolution of Programmable Real-Time Graphics;51
9.1.3;Unified Graphics and Computing Processors;54
9.2;2.2 GPGPU: An Intermediate Step;56
9.3;2.3 GPU Computing;57
9.3.1;Scalable GPUs;58
9.3.2;Recent Developments;59
9.3.3;Future Trends;60
9.4;References and Further Reading;60
10;3 Introduction to Data Parallelism and CUDA C;64
10.1;3.1 Data Parallelism;65
10.2;3.2 CUDA Program Structure;66
10.3;3.3 A Vector Addition Kernel;68
10.4;3.4 Device Global Memory and Data Transfer;71
10.5;3.5 Kernel Functions and Threading;76
10.6;3.6 Summary;82
10.6.1;Function Declarations;82
10.6.2;Kernel Launch;82
10.6.3;Predefined Variables;82
10.6.4;Runtime API;83
10.7;3.7 Exercises;83
10.8;References;85
11;4 Data-Parallel Execution Model;86
11.1;4.1 Cuda Thread Organization;87
11.2;4.2 Mapping Threads to Multidimensional Data;91
11.3;4.3 Matrix-Matrix Multiplication—A More Complex Kernel;97
11.4;4.4 Synchronization and Transparent Scalability;104
11.5;4.5 Assigning Resources to Blocks;106
11.6;4.6 Querying Device Properties;108
11.7;4.7 Thread Scheduling and Latency Tolerance;110
11.8;4.8 Summary;114
11.9;4.9 Exercises;114
12;5 CUDA Memories;118
12.1;5.1 Importance of Memory Access Efficiency;119
12.2;5.2 CUDA Device Memory Types;120
12.3;5.3 A Strategy for Reducing Global Memory Traffic;128
12.4;5.4 A Tiled Matrix–Matrix Multiplication Kernel;132
12.5;5.5 Memory as a Limiting Factor to Parallelism;138
12.6;5.6 Summary;141
12.7;5.7 Exercises;142
13;6 Performance Considerations;146
13.1;6.1 Warps and Thread Execution;147
13.2;6.2 Global Memory Bandwidth;155
13.3;6.3 Dynamic Partitioning of Execution Resources;164
13.4;6.4 Instruction Mix and Thread Granularity;166
13.5;6.5 Summary;168
13.6;6.6 Exercises;168
13.7;References;172
14;7 Floating-Point Considerations;174
14.1;7.1 Floating-Point Format;175
14.1.1;Normalized Representation of M;175
14.1.2;Excess Encoding of E;176
14.2;7.2 Representable Numbers;178
14.3;7.3 Special Bit Patterns and Precision in Ieee Format;183
14.4;7.4 Arithmetic Accuracy and Rounding;184
14.5;7.5 Algorithm Considerations;185
14.6;7.6 Numerical Stability;187
14.7;7.7 Summary;192
14.8;7.8 Exercises;193
14.9;References;194
15;8 Parallel Patterns: Convolution;196
15.1;8.1 Background;197
15.2;8.2 1D Parallel Convolution—A Basic Algorithm;202
15.3;8.3 Constant Memory and Caching;204
15.4;8.4 Tiled 1D Convolution with Halo Elements;208
15.5;8.5 A Simpler Tiled 1D Convolution—General Caching;215
15.6;8.6 Summary;216
15.7;8.7 Exercises;217
16;9 Parallel Patterns: Prefix Sum;220
16.1;9.1 Background;221
16.2;9.2 A Simple Parallel Scan;223
16.3;9.3 Work Efficiency Considerations;227
16.4;9.4 A Work-Efficient Parallel Scan;228
16.5;9.5 Parallel Scan for Arbitrary-Length Inputs;233
16.6;9.6 Summary;237
16.7;9.7 Exercises;238
16.8;Reference;239
17;10 Parallel Patterns: Sparse Matrix–Vector Multiplication;240
17.1;10.1 Background;241
17.2;10.2 Parallel SpMV Using CSR;245
17.3;10.3 Padding and Transposition;247
17.4;10.4 Using Hybrid to Control Padding;249
17.5;10.5 Sorting and Partitioning for Regularization;253
17.6;10.6 Summary;255
17.7;10.7 Exercises;256
17.8;References;257
18;11 Application Case Study: Advanced MRI Reconstruction;258
18.1;11.1 Application Background;259
18.2;11.2 Iterative Reconstruction;262
18.3;11.3 Computing FHD;264
18.3.1;Step 1: Determine the Kernel Parallelism Structure;266
18.3.2;Step 2: Getting Around the Memory Bandwidth Limitation;272
18.3.3;Step 3: Using Hardware Trigonometry Functions;278
18.3.4;Step 4: Experimental Performance Tuning;282
18.4;11.4 Final Evaluation;283
18.5;11.5 Exercises;285
18.6;References;287
19;12 Application Case Study: Molecular Visualization and Analysis;288
19.1;12.1 Application Background;289
19.2;12.2 A Simple Kernel Implementation;291
19.3;12.3 Thread Granularity Adjustment;295
19.4;12.4 Memory Coalescing;297
19.5;12.5 Summary;300
19.6;12.6 Exercises;302
19.7;References;302
20;13 Parallel Programming and Computational Thinking;304
20.1;13.1 Goals of Parallel Computing;305
20.2;13.2 Problem Decomposition;306
20.3;13.3 Algorithm Selection;310
20.4;13.4 Computational Thinking;316
20.5;13.5 Summary;317
20.6;13.6 Exercises;317
20.7;References;318
21;14 An Introduction to OpenCL™;320
21.1;14.1 Background;320
21.2;14.2 Data Parallelism Model;322
21.3;14.3 Device Architecture;324
21.4;14.4 Kernel Functions;326
21.5;14.5 Device Management and Kernel Launch;327
21.6;14.6 Electrostatic Potential Map in Opencl;330
21.7;14.7 Summary;334
21.8;14.8 Exercises;335
21.9;References;336
22;15 Parallel Programming with OpenACC;338
22.1;15.1 OpenACC Versus CUDA C;338
22.2;15.2 Execution Model;341
22.3;15.3 Memory Model;342
22.4;15.4 Basic OpenACC Programs;343
22.4.1;Parallel Construct;343
22.4.1.1;Parallel Region, Gangs, and Workers;343
22.4.2;Loop Construct;345
22.4.2.1;Gang Loop;345
22.4.2.2;Worker Loop;346
22.4.2.3;OpenACC Versus CUDA;346
22.4.2.4;Vector Loop;349
22.4.3;Kernels Construct;350
22.4.3.1;Prescriptive Versus Descriptive;350
22.4.3.2;Ways to Help an OpenACC Compiler;352
22.4.4;Data Management;354
22.4.4.1;Data Clauses;354
22.4.4.2;Data Construct;355
22.4.5;Asynchronous Computation and Data Transfer;358
22.5;15.5 Future Directions of OpenACC;359
22.6;15.6 Exercises;360
23;16 Thrust: A Productivity-Oriented Library for CUDA;362
23.1;16.1 Background;362
23.2;16.2 Motivation;365
23.3;16.3 Basic Thrust Features;366
23.3.1;Iterators and Memory Space;367
23.3.2;Interoperability;368
23.4;16.4 Generic Programming;370
23.5;16.5 Benefits of Abstraction;372
23.6;16.6 Programmer Productivity;372
23.6.1;Robustness;373
23.6.2;Real-World Performance;373
23.7;16.7 Best Practices;375
23.7.1;Fusion;376
23.7.2;Structure of Arrays;377
23.7.3;Implicit Ranges;379
23.8;16.8 Exercises;380
23.9;References;381
24;17 CUDA FORTRAN;382
24.1;17.1 CUDA FORTRAN and CUDA C Differences;383
24.2;17.2 A First CUDA FORTRAN Program;384
24.3;17.3 Multidimensional Array in CUDA FORTRAN;386
24.4;17.4 Overloading Host/Device Routines With Generic Interfaces;387
24.5;17.5 Calling CUDA C Via Iso_C_Binding;390
24.6;17.6 Kernel Loop Directives and Reduction Operations;392
24.7;17.7 Dynamic Shared Memory;393
24.8;17.8 Asynchronous Data Transfers;394
24.9;17.9 Compilation and Profiling;400
24.10;17.10 Calling Thrust from CUDA FORTRAN;401
24.11;17.11 Exercises;405
25;18 An Introduction to C++ 406
25.1;18.1 Core C++ Amp Features;407
25.2;18.2 Details of the C++ AMP Execution Model;414
25.2.1;Explicit and Implicit Data Copies;414
25.2.2;Asynchronous Operation;416
25.2.3;Section Summary;418
25.3;18.3 Managing Accelerators;418
25.4;18.4 Tiled Execution;421
25.5;18.5 C++ AMP Graphics Features;424
25.6;18.6 Summary;428
25.7;18.7 Exercises;428
26;19 Programming a Heterogeneous Computing Cluster;430
26.1;19.1 Background;431
26.2;19.2 A Running Example;431
26.3;19.3 MPI Basics;433
26.4;19.4 MPI Point-to-Point Communication Types;437
26.5;19.5 Overlapping Computation and Communication;444
26.6;19.6 MPI Collective Communication;454
26.7;19.7 Summary;454
26.8;19.8 Exercises;455
26.9;Reference;456
27;20 CUDA Dynamic Parallelism;458
27.1;20.1 Background;459
27.2;20.2 Dynamic Parallelism Overview;461
27.3;20.3 Important Details;462
27.3.1;Launch Environment Configuration;462
27.3.2;API Errors and Launch Failures;462
27.3.3;Events;462
27.3.4;Streams;463
27.3.5;Synchronization Scope;464
27.4;20.4 Memory Visibility;465
27.4.1;Global Memory;465
27.4.2;Zero-Copy Memory;465
27.4.3;Constant Memory;465
27.4.3.1;Local Memory;465
27.4.3.2;Shared Memory;466
27.4.4;Texture Memory;466
27.5;20.5 A Simple Example;467
27.6;20.6 Runtime Limitations;469
27.6.1;Memory Footprint;469
27.6.2;Nesting Depth;471
27.6.3;Memory Allocation and Lifetime;471
27.6.4;ECC Errors;472
27.6.5;Streams;472
27.6.6;Events;472
27.6.7;Launch Pool;472
27.7;20.7 A More Complex Example;472
27.7.1;Linear Bezier Curves;473
27.7.2;Quadratic Bezier Curves;473
27.7.3;Bezier Curve Calculation (Predynamic Parallelism);473
27.7.4;Bezier Curve Calculation (with Dynamic Parallelism);476
27.8;20.8 Summary;479
27.9;Reference;480
28;21 Conclusion and Future Outlook;482
28.1;21.1 Goals Revisited;482
28.2;21.2 Memory Model Evolution;484
28.3;21.3 Kernel Execution Control Evolution;487
28.4;21.4 Core Performance;490
28.5;21.5 Programming Environment;490
28.6;21.6 Future Outlook;491
28.7;References;492
29;Appendix A: Matrix Multiplication Host-Only Version Source Code;494
29.1;A.1 matrixmul.cu;494
29.2;A.2 matrixmul_gold.cpp;497
29.3;A.3 matrixmul.h;498
29.4;A.4 assist.h;499
29.5;A.5 Expected Output;503
30;Appendix B: GPU Compute Capabilities;504
30.1;B.1 GPU Compute Capability Tables;504
30.2;B.2 Memory Coalescing Variations;505
31;Index;510
Chapter 1
Introduction
Chapter Outline
1.1 Heterogeneous Parallel Computing
1.2 Architecture of a Modern GPU
1.3 Why More Speed or Parallelism?
1.4 Speeding Up Real Applications
1.5 Parallel Programming Languages and Models
1.6 Overarching Goals
1.7 Organization of the Book
Microprocessors based on a single central processing unit (CPU), such as those in the Intel Pentium family and the AMD Opteron family, drove rapid performance increases and cost reductions in computer applications for more than two decades. These microprocessors brought GFLOPS, or giga (1012) floating-point operations per second, to the desktop and TFLOPS, or tera (1015) floating-point operations per second, to cluster servers. This relentless drive for performance improvement has allowed application software to provide more functionality, have better user interfaces, and generate more useful results. The users, in turn, demand even more improvements once they become accustomed to these improvements, creating a positive (virtuous) cycle for the computer industry.
This drive, however, has slowed since 2003 due to energy consumption and heat dissipation issues that limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU. Since then, virtually all microprocessor vendors have switched to models where multiple processing units, referred to as processor cores, are used in each chip to increase the processing power. This switch has exerted a tremendous impact on the software developer community [Sutter2005].
Traditionally, the vast majority of software applications are written as sequential programs, as described by von Neumann in his seminal report in 1945 [vonNeumann1945]. The execution of these programs can be understood by a human sequentially stepping through the code. Historically, most software developers have relied on the advances in hardware to increase the speed of their sequential applications under the hood; the same software simply runs faster as each new generation of processors is introduced. Computer users have also become accustomed to the expectation that these programs run faster with each new generation of microprocessors. Such expectation is no longer valid from this day onward. A sequential program will only run on one of the processor cores, which will not become significantly faster than those in use today. Without performance improvement, application developers will no longer be able to introduce new features and capabilities into their software as new microprocessors are introduced, reducing the growth opportunities of the entire computer industry.
Rather, the applications software that will continue to enjoy performance improvement with each new generation of microprocessors will be parallel programs, in which multiple threads of execution cooperate to complete the work faster. This new, dramatically escalated incentive for parallel program development has been referred to as the concurrency revolution [Sutter2005]. The practice of parallel programming is by no means new. The high-performance computing community has been developing parallel programs for decades. These programs run on large-scale, expensive computers. Only a few elite applications can justify the use of these expensive computers, thus limiting the practice of parallel programming to a small number of application developers. Now that all new microprocessors are parallel computers, the number of applications that need to be developed as parallel programs has increased dramatically. There is now a great need for software developers to learn about parallel programming, which is the focus of this book.
1.1 Heterogeneous Parallel Computing
Since 2003, the semiconductor industry has settled on two main trajectories for designing microprocessors [Hwu2008]. The trajectory seeks to maintain the execution speed of sequential programs while moving into multiple cores. The multicores began with two core processors with the number of cores increasing with each semiconductor process generation. A current exemplar is the recent Intel Core i7 microprocessor with four processor cores, each of which is an out-of-order, multiple instruction issue processor implementing the full X86 instruction set, supporting hyperthreading with two hardware threads, designed to maximize the execution speed of sequential programs. In contrast, the trajectory focuses more on the execution throughput of parallel applications. The many-threads began with a large number of threads, and once again, the number of threads increases with each generation. A current exemplar is the NVIDIA GTX680 graphics processing unit (GPU) with 16,384 threads, executing in a large number of simple, in-order pipelines.
Many-threads processors, especially the GPUs, have led the race of floating-point performance since 2003. As of 2012, the ratio of peak floating-point calculation throughput between many-thread GPUs and multicore CPUs is about 10. These are not necessarily application speeds, but are merely the raw speed that the execution resources can potentially support in these chips: 1.5 teraflops versus 150 gigaflops double precision in 2012.
Such a large performance gap between parallel and sequential execution has amounted to a significant “electrical potential” build-up, and at some point, something will have to give. We have reached that point now. To date, this large performance gap has already motivated many application developers to move the computationally intensive parts of their software to GPUs for execution. Not surprisingly, these computationally intensive parts are also the prime target of parallel programming—when there is more work to do, there is more opportunity to divide the work among cooperating parallel workers.
One might ask why there is such a large peak-performance gap between many-threads GPUs and general-purpose multicore CPUs. The answer lies in the differences in the fundamental design philosophies between the two types of processors, as illustrated in Figure 1.1. The design of a CPU is optimized for sequential code performance. It makes use of sophisticated control logic to allow instructions from a single thread to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution. More importantly, large cache memories are provided to reduce the instruction and data access latencies of large complex applications. Neither control logic nor cache memories contribute to the peak calculation speed. As of 2012, the high-end general-purpose multicore microprocessors typically have six to eight large processor cores and multiple megabytes of on-chip cache memories designed to deliver strong sequential code performance.
Figure 1.1 CPUs and GPUs have fundamentally different design philosophies.
Memory bandwidth is another important issue. The speed of many applications is limited by the rate at which data can be delivered from the memory system into the processors. Graphics chips have been operating at approximately six times the memory bandwidth of contemporaneously available CPU chips. In late 2006, GeForce 8800 GTX, or simply G80, was capable of moving data at about 85 gigabytes per second (GB/s) in and out of its main dynamic random-access memory (DRAM) because of graphics frame buffer requirements and the relaxed memory model (the way various system software, applications, and input/output (I/O) devices expect how their memory accesses work). The more recent GTX680 chip supports about 200 GB/s. In contrast, general-purpose processors have to satisfy requirements from legacy operating systems, applications, and I/O devices that make memory bandwidth more difficult to increase. As a result, CPUs will continue to be at a disadvantage in terms of memory bandwidth for some time.
The design philosophy of GPUs is shaped by the fast-growing video game industry that exerts tremendous economic pressure for the ability to perform a massive number of floating-point calculations per video frame in advanced games. This demand motivates GPU vendors to look for ways to maximize the chip area and power budget dedicated to floating-point calculations. The prevailing solution is to optimize for the execution throughput of massive numbers of threads. The design saves chip area and power by allowing pipelined memory channels and arithmetic operations to have long latency. The reduced area and power of the memory access hardware and arithmetic units allows the designers to have more of them on a chip and thus increase the total execution throughput.
The application software is expected to be written with a large number of parallel threads. The hardware takes advantage of the large number of threads to find work to do when some of them are waiting for long-latency memory accesses or arithmetic operations. Small cache memories are provided to help control the bandwidth requirements of these applications so that multiple threads that access the same memory data do not need to all go to the DRAM. This design style is commonly referred to as throughput-oriented design since it strives to maximize the total execution throughput of a large number of threads while allowing individual threads to take a potentially much longer time to execute.
The CPUs, on the other hand, are designed to minimize the execution latency of a single thread....




