Farber | CUDA Application Design and Development | E-Book | www.sack.de
E-Book

E-Book, Englisch, 336 Seiten

Farber CUDA Application Design and Development


1. Auflage 2011
ISBN: 978-0-12-388432-9
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)

E-Book, Englisch, 336 Seiten

ISBN: 978-0-12-388432-9
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)



As the computer industry retools to leverage massively parallel graphics processing units (GPUs), this book is designed to meet the needs of working software developers who need to understand GPU programming with CUDA and increase efficiency in their projects. CUDA Application Design and Development starts with an introduction to parallel computing concepts for readers with no previous parallel experience, and focuses on issues of immediate importance to working software developers: achieving high performance, maintaining competitiveness, analyzing CUDA benefits versus costs, and determining application lifespan. The book then details the thought behind CUDA and teaches how to create, analyze, and debug CUDA applications. Throughout, the focus is on software engineering issues: how to use CUDA in the context of existing application code, with existing compilers, languages, software tools, and industry-standard API libraries. Using an approach refined in a series of well-received articles at Dr Dobb's Journal, author Rob Farber takes the reader step-by-step from fundamentals to implementation, moving from language theory to practical coding. - Includes multiple examples building from simple to more complex applications in four key areas: machine learning, visualization, vision recognition, and mobile computing - Addresses the foundational issues for CUDA development: multi-threaded programming and the different memory hierarchy - Includes teaching chapters designed to give a full understanding of CUDA tools, techniques and structure. - Presents CUDA techniques in the context of the hardware they are implemented on as well as other styles of programming that will help readers bridge into the new material

Rob Farber has served as a scientist in Europe at the Irish Center for High-End Computing as well as U.S. national labs in Los Alamos, Berkeley, and the Pacific Northwest. He has also been on the external faculty at the Santa Fe Institute, consultant to fortune 100 companies, and co-founder of two computational startups that achieved liquidity events. He is the author of 'CUDA Application Design and Development as well as numerous articles and tutorials that have appeared in Dr. Dobb's Journal and Scientific Computing, The Code Project and others."
Farber CUDA Application Design and Development jetzt bestellen!

Autoren/Hrsg.


Weitere Infos & Material


1;Front Cover;1
2;CUDA Application Design and Development;4
3;Copyright;5
4;Dedication;6
5;Table of Contents;8
6;Foreword;12
7;Preface;14
8;1 First Programs and How to Think in CUDA;20
8.1;Source Code and Wiki;21
8.2;Distinguishing CUDA from Conventional Programming with a Simple Example;21
8.3;Choosing a CUDA API;24
8.4;Some Basic CUDA Concepts;27
8.5;Understanding Our First Runtime Kernel;30
8.6;Three Rules of GPGPU Programming;32
8.6.1;Rule 1: Get the Data on the GPU and Keep It There;32
8.6.2;Rule 2: Give the GPGPU Enough Work to Do;33
8.6.3;Rule 3: Focus on Data Reuse within the GPGPU to Avoid Memory Bandwidth Limitations;33
8.7;Big-O Considerations and Data Transfers;34
8.8;CUDA and Amdahl’s Law;36
8.9;Data and Task Parallelism;37
8.10;Hybrid Execution: Using Both CPU and GPU Resources;38
8.11;Regression Testing and Accuracy;40
8.12;Silent Errors;41
8.13;Introduction to Debugging;42
8.14;UNIX Debugging;43
8.14.1;NVIDIA's cuda-gdb Debugger;43
8.14.2;The CUDA Memory Checker;45
8.14.3;Use cuda-gdb with the UNIX ddd Interface;46
8.15;Windows Debugging with Parallel Nsight;48
8.16;Summary;49
9;2 CUDA for Machine Learning and Optimization;52
9.1;Modeling and Simulation;53
9.1.1;Fitting Parameterized Models;54
9.1.2;Nelder-Mead Method;55
9.1.3;Levenberg-Marquardt Method;55
9.1.4;Algorithmic Speedups;56
9.2;Machine Learning and Neural Networks;57
9.3;XOR: An Important Nonlinear Machine-Learning Problem;58
9.3.1;An Example Objective Function;60
9.3.2;A Complete Functor for Multiple GPU Devices and the Host Processors;61
9.3.3;Brief Discussion of a Complete Nelder-Mead Optimization Code;63
9.4;Performance Results on XOR;72
9.5;Performance Discussion;72
9.6;Summary;75
9.7;The C++ Nelder-Mead Template;76
10;3 The CUDA Tool Suite: Profiling a PCA/NLPCA Functor;82
10.1;PCA and NLPCA;83
10.1.1;Autoencoders;84
10.1.1.1;An Example Functor for PCA Analysis;85
10.1.1.2;An Example Functor for NLPCA Analysis;87
10.2;Obtaining Basic Profile Information;90
10.3;Gprof: A Common UNIX Profiler;92
10.4;The NVIDIA Visual Profiler: Computeprof;93
10.5;Parallel Nsight for Microsoft Visual Studio;96
10.5.1;The Nsight Timeline Analysis;96
10.5.2;The NVTX Tracing Library;98
10.5.3;Scaling Behavior of the CUDA API;99
10.6;Tuning and Analysis Utilities (TAU);101
10.7;Summary;102
11;4 The CUDA Execution Model;104
11.1;GPU Architecture Overview;105
11.1.1;Thread Scheduling: Orchestrating Performance and Parallelism via the Execution Configuration;106
11.1.2;Relevant computeprof Values for a Warp;109
11.1.3;Warp Divergence;109
11.1.4;Guidelines for Warp Divergence;110
11.1.5;Relevant computeprof Values for Warp Divergence;111
11.2;Warp Scheduling and TLP;111
11.2.1;Relevant computeprof Values for Occupancy;113
11.3;ILP: Higher Performance at Lower Occupancy;113
11.3.1;ILP Hides Arithmetic Latency;114
11.3.2;ILP Hides Data Latency;117
11.3.3;ILP in the Future;117
11.3.4;Relevant computeprof Values for Instruction Rates;119
11.4;Little’s Law;119
11.5;CUDA Tools to Identify Limiting Factors;121
11.5.1;The nvcc Compiler;122
11.5.2;Launch Bounds;123
11.5.3;The Disassembler;124
11.5.4;PTX Kernels;125
11.5.5;GPU Emulators;126
11.6;Summary;127
12;5 CUDA Memory;128
12.1;The CUDA Memory Hierarchy;128
12.2;GPU Memory;130
12.3;L2 Cache;131
12.3.1;Relevant computeprof Values for the L2 Cache;132
12.4;L1 Cache;133
12.4.1;Relevant computeprof Values for the L1 Cache;134
12.5;CUDA Memory Types;135
12.5.1;Registers;135
12.5.2;Local memory;135
12.5.3;Relevant computeprof Values for Local Memory Cache;136
12.5.4;Shared Memory;136
12.5.5;Relevant computeprof Values for Shared Memory;139
12.5.6;Constant Memory;139
12.5.7;Texture Memory;140
12.5.8;Relevant computeprof Values for Texture Memory;143
12.6;Global Memory;143
12.6.1;Common Coalescing Use Cases;145
12.6.2;Allocation of Global Memory;146
12.6.3;Limiting Factors in the Design of Global Memory;147
12.6.4;Relevant computeprof Values for Global Memory;149
12.7;Summary;150
13;6 Efficiently Using GPU Memory;152
13.1;Reduction;153
13.1.1;The Reduction Template;153
13.1.2;A Test Program for functionReduce.h;159
13.1.3;Results;163
13.2;Utilizing Irregular Data Structures;165
13.3;Sparse Matrices and the CUSP Library;168
13.4;Graph Algorithms;170
13.5;SoA, AoS, and Other Structures;173
13.6;Tiles and Stencils;173
13.7;Summary;174
14;7 Techniques to Increase Parallelism;176
14.1;CUDA Contexts Extend Parallelism;177
14.2;Streams and Contexts;178
14.2.1;Multiple GPUs;178
14.2.2;Explicit Synchronization;179
14.2.3;Implicit Synchronization;180
14.2.4;The Unified Virtual Address Space;181
14.2.5;A Simple Example;181
14.2.6;Profiling Results;184
14.3;Out-of-Order Execution with Multiple Streams;185
14.3.1;Tip for Concurrent Kernel Execution on the Same GPU;188
14.3.2;Atomic Operations for Implicitly Concurrent Kernels;188
14.4;Tying Data to Computation;191
14.4.1;Manually Partitioning Data;191
14.4.2;Mapped Memory;192
14.4.3;How Mapped Memory Works;194
14.5;Summary;195
15;8 CUDA for All GPU and CPU Applications;198
15.1;Pathways from CUDA to Multiple Hardware Backends;199
15.1.1;The PGI CUDA x86 Compiler;200
15.1.2;The PGI CUDA x86 Compiler;202
15.1.2.1;An x86 core as an SM;204
15.1.3;The NVIDIA NVCC Compiler;205
15.1.4;Ocelot;206
15.1.5;Swan;207
15.1.6;MCUDA;207
15.2;Accessing CUDA from Other Languages;207
15.2.1;SWIG;208
15.2.2;Copperhead;208
15.2.3;EXCEL;209
15.2.4;MATLAB;209
15.3;Libraries;210
15.3.1;CUBLAS;210
15.3.2;CUFFT;210
15.3.3;MAGMA;221
15.3.4;phiGEMM Library;222
15.3.5;CURAND;222
15.4;Summary;224
16;9 Mixing CUDA and Rendering;226
16.1;OpenGL;227
16.1.1;GLUT;227
16.1.2;Mapping GPU Memory with OpenGL;228
16.1.3;Using Primitive Restart for 3D Performance;229
16.2;Introduction to the Files in the Framework;232
16.2.1;The Demo and Perlin Example Kernels;232
16.2.1.1;The Demo Kernel;233
16.2.1.2;The Demo Kernel to Generate a Colored Sinusoidal Surface;233
16.2.1.3;Perlin Noise;236
16.2.1.4;Using the Perlin Noise Kernel to Generate Artificial Terrain;238
16.2.2;The simpleGLmain.cpp File;243
16.2.3;The simpleVBO.cpp File;247
16.2.4;The callbacksVBO.cpp File;252
16.3;Summary;257
17;10 CUDA in a Cloud and Cluster Environments;260
17.1;The Message Passing Interface (MPI);261
17.1.1;The MPI Programming Model;261
17.1.2;The MPI Communicator;262
17.1.3;MPI Rank;262
17.1.4;Master-Slave;264
17.1.5;Point-to-Point Basics;264
17.2;How MPI Communicates;265
17.3;Bandwidth;267
17.4;Balance Ratios;268
17.5;Considerations for Large MPI Runs;271
17.5.1;Scalability of the Initial Data Load;271
17.5.2;Using MPI to Perform a Calculation;272
17.5.3;Check Scalability;273
17.6;Cloud Computing;274
17.7;A Code Example;275
17.7.1;Data Generation;275
17.8;Summary;283
18;11 CUDA for Real Problems;284
18.1;Working with High-Dimensional Data;285
18.1.1;PCA/NLPCA;286
18.1.2;Multidimensional Scaling;286
18.1.3;K-Means Clustering;287
18.1.4;Expectation-Maximization;287
18.1.5;Support Vector Machines;288
18.1.6;Bayesian Networks;288
18.1.7;Mutual information;289
18.2;Force-Directed Graphs;290
18.3;Monte Carlo Methods;291
18.4;Molecular Modeling;292
18.5;Quantum Chemistry;292
18.6;Interactive Workflows;293
18.7;A Plethora of Projects;293
18.8;Summary;294
19;12 Application Focus on Live Streaming Video;296
19.1;Topics in Machine Vision;297
19.1.1;3D Effects;298
19.1.2;Segmentation of Flesh-colored Regions;298
19.1.3;Edge Detection;299
19.2;FFmpeg;300
19.3;TCP Server;302
19.4;Live Stream Application;306
19.4.1;kernelWave(): An Animated Kernel;306
19.4.2;kernelFlat(): Render the Image on a Flat Surface;307
19.4.3;kernelSkin(): Keep Only Flesh-colored Regions;307
19.4.4;kernelSobel(): A Simple Sobel Edge Detection Filter;308
19.4.5;The launch_kernel() Method;309
19.5;The simpleVBO.cpp File;310
19.6;The callbacksVBO.cpp File;310
19.7;Building and Running the Code;314
19.8;The Future;314
19.8.1;Machine Learning;314
19.8.2;The Connectome;315
19.9;Summary;316
19.10;Listing for simpleVBO.cpp;316
20;Works Cited;322
21;Index;330
21.1;A;330
21.2;B;330
21.3;C;330
21.4;D;330
21.5;E;331
21.6;F;331
21.7;G;331
21.8;H;331
21.9;I;331
21.10;J;331
21.11;K;331
21.12;L;332
21.13;M;332
21.14;N;332
21.15;O;332
21.16;P;333
21.17;Q;333
21.18;R;333
21.19;S;333
21.20;T;333
21.21;U;334
21.22;V;334
21.23;W;334
21.24;X;334



Ihre Fragen, Wünsche oder Anmerkungen
Vorname*
Nachname*
Ihre E-Mail-Adresse*
Kundennr.
Ihre Nachricht*
Lediglich mit * gekennzeichnete Felder sind Pflichtfelder.
Wenn Sie die im Kontaktformular eingegebenen Daten durch Klick auf den nachfolgenden Button übersenden, erklären Sie sich damit einverstanden, dass wir Ihr Angaben für die Beantwortung Ihrer Anfrage verwenden. Selbstverständlich werden Ihre Daten vertraulich behandelt und nicht an Dritte weitergegeben. Sie können der Verwendung Ihrer Daten jederzeit widersprechen. Das Datenhandling bei Sack Fachmedien erklären wir Ihnen in unserer Datenschutzerklärung.