E-Book, Englisch, 377 Seiten
Naono / Teranishi / Cavazos Software Automatic Tuning
1. Auflage 2010
ISBN: 978-1-4419-6935-4
Verlag: Springer-Verlag
Format: PDF
Kopierschutz: Wasserzeichen (»Systemvoraussetzungen)
From Concepts to State-of-the-Art Results
E-Book, Englisch, 377 Seiten
ISBN: 978-1-4419-6935-4
Verlag: Springer-Verlag
Format: PDF
Kopierschutz: Wasserzeichen (»Systemvoraussetzungen)
Automatic Performance Tuning is a new software paradigm which enables software to be high performance in any computing environment. Its methodologies have been developed over the past decade, and it is now rapidly growing in terms of its scope and applicability, as well as in its scientific knowledge and technological methods. Software developers and researchers in the area of scientific and technical computing, high performance database systems, optimized compilers, high performance systems software, and low-power computing will find this book to be an invaluable reference to this powerful new paradigm.
Autoren/Hrsg.
Weitere Infos & Material
1;Software Automatic Tuning;3
1.1;Preface;5
1.2;Contents;7
1.3;Contributors;11
1.4;Part I Introduction;15
1.4.1;Chapter1 Software Automatic Tuning: Concepts and State-of-the-Art Results;16
1.4.1.1;1.1 Software Automatic Tuning: General Concepts;17
1.4.1.2;1.2 Software Automatic Tuning: State-of-the-Art;22
1.5;Part II Achievements in Scientific Computing;29
1.5.1;Chapter2 ATLAS Version 3.9: Overview and Status* ;30
1.5.1.1;2.1 Introduction;30
1.5.1.1.1;2.1.1 Basic AEOS Requirements;32
1.5.1.1.2;2.1.2 Methods of Software Adaptation;33
1.5.1.2;2.2 ATLAS Overview;34
1.5.1.2.1;2.2.1 Explicit LAPACK Support In ATLAS;35
1.5.1.2.2;2.2.2 Dense Level 3 BLAS Support in ATLAS;36
1.5.1.2.2.1;2.2.2.1 Using Assembly in ATLAS;38
1.5.1.2.3;2.2.3 Dense Level 2 BLAS Support in ATLAS;39
1.5.1.2.4;2.2.4 Dense Level 1 BLAS Support in ATLAS;39
1.5.1.3;2.3 Status;40
1.5.1.4;References;40
1.5.2;Chapter3 Autotuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS;44
1.5.2.1;3.1 Introduction;44
1.5.2.2;3.2 Background;45
1.5.2.2.1;3.2.1 BLAS;46
1.5.2.2.2;3.2.2 DL-BLAS;46
1.5.2.2.3;3.2.3 Motivation of the present Research;48
1.5.2.3;3.3 Proposed Methods;49
1.5.2.3.1;3.3.1 Parallel Efficiency Analysis;49
1.5.2.3.2;3.3.2 Diagonal Search;51
1.5.2.3.3;3.3.3 Reductive Search;51
1.5.2.3.4;3.3.4 Parameter Selection;53
1.5.2.4;3.4 Experiments;53
1.5.2.4.1;3.4.1 Environments;53
1.5.2.4.2;3.4.2 Performance of Diagonal Search;54
1.5.2.4.3;3.4.3 Analysis and Discussion of Diagonal Search;55
1.5.2.4.4;3.4.4 Performance Evaluation Tests;56
1.5.2.5;3.5 Conclusion;57
1.5.2.6;References;59
1.5.3;Chapter4 Automatic Tuning for Parallel FFTs;60
1.5.3.1;4.1 Introduction;60
1.5.3.2;4.2 Parallel One-Dimensional FFT;61
1.5.3.2.1;4.2.1 A Block Nine-Step FFT Algorithm;61
1.5.3.2.2;4.2.2 A Parallel One-Dimensional FFT Algorithm Based on the Block Nine-Step FFT;63
1.5.3.3;4.3 Parallel Multi-Dimensional FFT;65
1.5.3.3.1;4.3.1 Parallel Two-Dimensional FFT;65
1.5.3.3.2;4.3.2 Parallel Three-Dimensional FFT;66
1.5.3.4;4.4 Automatic Tuning of Parallel FFTs;68
1.5.3.4.1;4.4.1 Automatic Tuning of All-to-All Communication;68
1.5.3.4.2;4.4.2 Selection of the Radices;69
1.5.3.4.3;4.4.3 Selection of the Block Size;69
1.5.3.5;4.5 Performance Results;70
1.5.3.5.1;4.5.1 Performance of All-to-All Communication;72
1.5.3.5.2;4.5.2 Performance of One-Dimensional FFT;72
1.5.3.5.3;4.5.3 Performance of Multi-Dimensional FFTs;75
1.5.3.6;4.6 Conclusion;76
1.5.3.7;References;78
1.5.4;Chapter5 Dynamic Programming Approaches to Optimizing the Blocking Strategy for Basic Matrix Decompositions;79
1.5.4.1;5.1 Introduction;79
1.5.4.2;5.2 Basics of the Householder QR Algorithm;81
1.5.4.2.1;5.2.1 The Householder QR Algorithm;81
1.5.4.2.2;5.2.2 Blocking Strategies for the Householder QR Algorithm;81
1.5.4.2.2.1;5.2.2.1 Fixed-size Blocking;82
1.5.4.2.2.2;5.2.2.2 Recursive Blocking;82
1.5.4.2.3;5.2.3 Another Blocking Strategy: The TSQR Algorithm;83
1.5.4.3;5.3 Dynamic Programming Approaches to Optimizing the Blocking Strategy;84
1.5.4.3.1;5.3.1 Limitations of Conventional Blocking Strategies;84
1.5.4.3.2;5.3.2 Approaches for the Sequential Case;85
1.5.4.3.2.1;5.3.2.1 Variable-size Blocking;85
1.5.4.3.2.2;5.3.2.2 Generalized Recursive Blocking;87
1.5.4.3.3;5.3.3 Approaches for the Parallel Case;89
1.5.4.3.3.1;5.3.3.1 Variable-size Blocking for Distributed Memory Machines;89
1.5.4.3.3.2;5.3.3.2 Combination of Variable-size Blocking and TSQR;89
1.5.4.4;5.4 Future Directions;92
1.5.4.4.1;5.4.1 Use of Task Level Parallelism;92
1.5.4.4.2;5.4.2 Extension to Heterogeneous Architectures;92
1.5.4.4.3;5.4.3 Online Refinement of the Performance Models and the Blocking Strategy;92
1.5.4.4.4;5.4.4 Application to Other Linear Algebra Algorithms;93
1.5.4.5;5.5 Conclusion;93
1.5.4.6;References;93
1.5.5;Chapter6 Automatic Tuning of the Division Number in the Multiple Division Divide-and-Conquer for Real Symmetric Eigenproblem;96
1.5.5.1;6.1 Introduction;96
1.5.5.2;6.2 DCk Algorithm and Deflation;97
1.5.5.3;6.3 Deflation Occurrence Rate (DOR);99
1.5.5.4;6.4 Proposal of Algorithm to Estimate the Optimal Division Number;102
1.5.5.5;6.5 Numerical Experiment;103
1.5.5.6;6.6 Summary;109
1.5.5.7;References;109
1.5.6;Chapter7 Automatically Tuned Mixed-Precision Conjugate Gradient Solver;111
1.5.6.1;7.1 Introduction;111
1.5.6.2;7.2 Iterative Refinement ;113
1.5.6.3;7.3 Choosing the Target Residual Reduction for the Inner Solver ;115
1.5.6.4;7.4 Stopping the Iterative Refinement ;119
1.5.6.4.1;7.4.1 Stopping Based on Threshold;119
1.5.6.4.2;7.4.2 Stopping Based on Condition Number Estimation;121
1.5.6.4.3;7.4.3 Combined Stopping Method;122
1.5.6.5;7.5 Condition Number Estimation ;123
1.5.6.6;7.6 Conclusions ;126
1.5.6.7;References;127
1.5.7;Chapter8 Automatically Tuned Sparse Eigensolvers;128
1.5.7.1;8.1 Introduction;128
1.5.7.2;8.2 Background: AT Research Trends;129
1.5.7.2.1;8.2.1 Definition of AT;129
1.5.7.2.2;8.2.2 Six Studies on AT;130
1.5.7.2.3;8.2.3 AT Research Trends;131
1.5.7.3;8.3 Proposal of AT-Restarted-Lanczos;133
1.5.7.3.1;8.3.1 Restarted-Lanczos and Its Problem;133
1.5.7.3.2;8.3.2 Preliminary Experiment;133
1.5.7.3.3;8.3.3 Proposal of the Run-Time Automatic Tuning Algorithm;135
1.5.7.3.4;8.3.4 Evaluations;137
1.5.7.4;8.4 Conclusions;139
1.5.7.5;References;139
1.5.8;Chapter9 Systematic Performance Evaluation of Linear Solvers Using Quality Control Techniques;141
1.5.8.1;9.1 Introduction;141
1.5.8.2;9.2 Systematic Performance Evaluation and Systematic Analyses for Numerical Computation Algorithms;142
1.5.8.2.1;9.2.1 Systematic Performance Evaluation of Solution Algorithms;142
1.5.8.2.2;9.2.2 Construction of the Performance Information DB on the Solution Algorithms;143
1.5.8.3;9.3 Quality Control in the Solution Process;145
1.5.8.3.1;9.3.1 Quality Control Techniques;145
1.5.8.3.2;9.3.2 Systematic Performance Evaluation Using Survey Charts;148
1.5.8.4;9.4 Analysis of Causes of Poor Solution Quality;152
1.5.8.4.1;9.4.1 Preconditioning of Solution Algorithm for Linear Equations;154
1.5.8.4.2;9.4.2 Residual Vector and Convergence Criterion for Each Preconditioning System;155
1.5.8.5;9.5 Concluding Remarks;156
1.5.8.6;References;158
1.5.9;Chapter 10 Application of Alternating Decision Trees in Selecting Sparse Linear Solvers*;159
1.5.9.1;10.1 Introduction;159
1.5.9.2;10.2 Machine Learning Methods;161
1.5.9.2.1;10.2.1 Adaboost;162
1.5.9.2.2;10.2.2 Alternating Decision Trees;163
1.5.9.3;10.3 Solver Selection as a Classification Problem;164
1.5.9.3.1;10.3.1 Feature Selection and Computation;166
1.5.9.4;10.4 Applying Machine Learning;167
1.5.9.5;10.5 Experimental Results;168
1.5.9.5.1;10.5.1 Experiment 1: Classification on Linear Systems Generated from PDE-based Simulation Code;169
1.5.9.5.2;10.5.2 Experiment 2: Classification on Only Coefficient Matrices;171
1.5.9.6;10.6 Conclusions and Future Plans;175
1.5.9.7;References;177
1.5.10;Chapter11 Toward Automatic Performance Tuning for Numerical Simulations in the SILC Matrix Computation Framework;180
1.5.10.1;11.1 Introduction;180
1.5.10.2;11.2 The SILC Matrix Computation Framework;182
1.5.10.3;11.3 Performance Modeling;184
1.5.10.4;11.4 Experiments;185
1.5.10.4.1;11.4.1 Cloth Simulation;186
1.5.10.4.2;11.4.2 Fluid Simulation;189
1.5.10.4.3;11.4.3 Simple Initial Value Problem;190
1.5.10.5;11.5 Automatic Performance Tuning;193
1.5.10.5.1;11.5.1 Autotuning Mechanism for Numerical Simulations in SILC;193
1.5.10.5.2;11.5.2 Related Work and Discussions;195
1.5.10.6;11.6 Concluding Remarks;195
1.5.10.7;References;196
1.5.11;Chapter12 Exploring Tuning Strategies for Quantum Chemistry Computations;198
1.5.11.1;12.1 Introduction;198
1.5.11.2;12.2 Related Works;200
1.5.11.3;12.3 Methodology;200
1.5.11.4;12.4 Performance Results and Observations;203
1.5.11.5;12.5 Tuning Strategy;208
1.5.11.6;12.6 Conclusions and Future Work;211
1.5.11.7;References;211
1.5.12;Chapter13 Automatic Tuning of CUDA Execution Parameters for Stencil Processing;214
1.5.12.1;13.1 Introduction;214
1.5.12.2;13.2 Related Work;216
1.5.12.3;13.3 Performance Tuning for CUDA;217
1.5.12.4;13.4 Automatic Performance Tuning of CUDA Execution Parameters;219
1.5.12.4.1;13.4.1 Parameter Exploration Space;220
1.5.12.4.2;13.4.2 Auto-Tuning Mechanism;223
1.5.12.5;13.5 Performance Evaluation;225
1.5.12.5.1;13.5.1 Automatic Tuning on Himeno Benchmark;225
1.5.12.5.2;13.5.2 Automatic Tuning on LU Decomposition;229
1.5.12.6;13.6 Conclusions;232
1.5.12.7;References;232
1.5.13;Chapter14 Static Task Cluster Size Determination in Homogeneous Distributed Systems;234
1.5.13.1;14.1 Introduction;234
1.5.13.2;14.2 Problem Definition and Assumptions;236
1.5.13.2.1;14.2.1 Assumed Model;236
1.5.13.2.2;14.2.2 Problems in Conventional Cluster Merging;237
1.5.13.2.3;14.2.3 Requirements for Reducing Both the Schedule Length and the Number of Processors;238
1.5.13.3;14.3 Definition of an Indicative Value for Schedule Length;239
1.5.13.3.1;14.3.1 Abstract of Worst Schedule Length;239
1.5.13.3.2;14.3.2 WSL Definition;239
1.5.13.4;14.4 Derivation of Cluster Size for Minimizing WSL;240
1.5.13.4.1;14.4.1 An Upper Bound of WSL;240
1.5.13.4.2;14.4.2 Derivation of dopt;246
1.5.13.4.3;14.4.3 Relationship Between WSLs and Schedule Length;247
1.5.13.4.4;14.4.4 Requirements for Minimizing WSL in a Task Clustering;248
1.5.13.5;14.5 Task Clustering;248
1.5.13.5.1;14.5.1 Overview of the Algorithm;248
1.5.13.5.2;14.5.2 Cluster Selection Policies;249
1.5.13.6;14.6 Experimental Evaluation;251
1.5.13.6.1;14.6.1 Comparison Targets;251
1.5.13.6.2;14.6.2 Simulation Setup;251
1.5.13.6.3;14.6.3 Comparison of WSLS and Schedule Length in the Fixed Number of Processors;252
1.5.13.6.4;14.6.4 Processor Utilization;253
1.5.13.6.5;14.6.5 Comparison of WSLS and Schedule Length of Real Application DAGs;254
1.5.13.6.6;14.6.6 Discussion;255
1.5.13.7;14.7 Conclusion and Future Works;256
1.5.13.8;References;256
1.6;Part III Evolution to a General Paradigm;258
1.6.1;Chapter15 Algorithmic Parameter Optimization of the DFO Method with the OPAL Framework;259
1.6.1.1;15.1 Introduction;259
1.6.1.2;15.2 Optimization of Algorithmic Parameters;262
1.6.1.2.1;15.2.1 Black Box Construction;262
1.6.1.2.2;15.2.2 Direct Search Algorithms;264
1.6.1.3;15.3 The OPAL Package;265
1.6.1.3.1;15.3.1 The OPAL Structure;266
1.6.1.3.2;15.3.2 Usage of OPAL;266
1.6.1.4;15.4 Application to Derivative-Free Optimization;268
1.6.1.4.1;15.4.1 General Description of DFO;268
1.6.1.4.2;15.4.2 Two DFO Parameter Optimization Problems;268
1.6.1.4.3;15.4.3 Numerical Results;272
1.6.1.5;15.5 Discussion;276
1.6.1.6;References;277
1.6.2;Chapter16 A Bayesian Method of Online Automatic Tuning;279
1.6.2.1;16.1 Introduction;279
1.6.2.2;16.2 Automatic Tuning Concepts;280
1.6.2.3;16.3 Review of the Proposed Methods;281
1.6.2.3.1;16.3.1 Online Automatic Tuning;282
1.6.2.3.2;16.3.2 Tuning Based on Cost Function Modeling;283
1.6.2.3.3;16.3.3 Bayesian Analysis to Treat Uncertainties;284
1.6.2.3.4;16.3.4 Bayesian Suboptimal Sequential Experimental Design;286
1.6.2.3.5;16.3.5 Desirable Properties of Mathematical Methods for Automatic Tuning;288
1.6.2.3.6;16.3.6 Infinite Dilution;291
1.6.2.3.7;16.3.7 Pseudocode of the Proposed Method;292
1.6.2.4;16.4 Evaluation;293
1.6.2.4.1;16.4.1 Infinite Dilution with Random Sampling;293
1.6.2.4.2;16.4.2 Performance Evaluation;293
1.6.2.5;16.5 Conclusion;296
1.6.2.6;References;297
1.6.3;Chapter17 ABCLibScript: A Computer Language for Automatic Performance Tuning;298
1.6.3.1;17.1 Introduction;298
1.6.3.2;17.2 The FIBER Framework;299
1.6.3.2.1;17.2.1 FIBER Framework Basics;299
1.6.3.2.2;17.2.2 The Two Users and Auto-tuning Software Construction Assumption;299
1.6.3.2.2.1;17.2.2.1 Software Developer Phase;300
1.6.3.2.2.2;17.2.2.2 Software User (End-User) Phase;301
1.6.3.3;17.3 The ABCLibScript;301
1.6.3.4;17.4 Examples of ABCLibScript Descriptions;304
1.6.3.4.1;17.4.1 Software Developer API;304
1.6.3.4.2;17.4.2 End-User API: Before Execute-Time Auto-tuning;305
1.6.3.4.3;17.4.3 Other Programming Examples for Software Developers Using ABCLibScript;306
1.6.3.4.3.1;17.4.3.1 Install-Time Auto-tuning;306
1.6.3.4.3.2;17.4.3.2 Before Execute-Time Auto-tuning;308
1.6.3.4.3.3;17.4.3.3 Run-Time Auto-tuning;309
1.6.3.5;17.5 Future Direction of ABCLibScript;310
1.6.3.5.1;17.5.1 Evaluation of Compiler Optimization and Hardware Performance;310
1.6.3.5.2;17.5.2 Adapting ABCLibScript to Embedded Systems;312
1.6.3.5.3;17.5.3 Adapting ABCLibScript to Low-Power Optimization;314
1.6.3.6;17.6 Concluding Remarks;315
1.6.3.7;References;316
1.6.4;Chapter18 Automatically Tuning Task-Based Programs for Multicore Processors;317
1.6.4.1;18.1 Introduction;317
1.6.4.2;18.2 Example;318
1.6.4.2.1;18.2.1 Task Extensions;319
1.6.4.2.2;18.2.2 Execution;319
1.6.4.2.3;18.2.3 Scheduling for Multicore Processor;320
1.6.4.3;18.3 Implementation Synthesis;322
1.6.4.3.1;18.3.1 Dependence Analysis;323
1.6.4.3.2;18.3.2 Profiling;323
1.6.4.3.3;18.3.3 Candidate Implementation Synthesis;323
1.6.4.3.4;18.3.4 Simulation Stage;325
1.6.4.3.5;18.3.5 Directed-Simulated Annealing;326
1.6.4.3.5.1;18.3.5.1 Generation of Optimized Implementation;327
1.6.4.3.6;18.3.6 Dynamic Scheduling;328
1.6.4.4;18.4 Evaluation;328
1.6.4.4.1;18.4.1 Benchmarks;329
1.6.4.4.2;18.4.2 Accuracy of Scheduling Simulator;330
1.6.4.4.3;18.4.3 Optimality of Directed Simulated Annealing;331
1.6.4.4.4;18.4.4 Generality of Synthesized Implementation;332
1.6.4.4.5;18.4.5 Overhead of Bamboo;333
1.6.4.5;18.5 Related Work;334
1.6.4.6;18.6 Conclusion;335
1.6.4.7;References;335
1.6.5;Chapter19 Efficient Program Compilation Through Machine Learning Techniques;337
1.6.5.1;19.1 Introduction;337
1.6.5.1.1;19.1.1 Motivation;338
1.6.5.2;19.2 Design and Implementation;339
1.6.5.2.1;19.2.1 Preparing for Data Collection;339
1.6.5.2.2;19.2.2 Gathering Training Data;342
1.6.5.2.3;19.2.3 Learning Process;344
1.6.5.2.4;19.2.4 Deployment Process;345
1.6.5.3;19.3 Experimental Setup;346
1.6.5.4;19.4 Results and Discussion;346
1.6.5.4.1;19.4.1 Quality of Training Data;347
1.6.5.4.2;19.4.2 Classifier Evaluation;348
1.6.5.4.3;19.4.3 Other Benchmarks;349
1.6.5.5;19.5 Related Work;350
1.6.5.6;19.6 Conclusions and Future Work;352
1.6.5.7;19.7 Acknowledgments and Trademarks;352
1.6.5.8;References;352
1.6.6;Chapter20 Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology;354
1.6.6.1;20.1 Introduction;354
1.6.6.2;20.2 Compiler Technology: Autotuning and Specialization;356
1.6.6.2.1;20.2.1 Optimization Strategy;358
1.6.6.2.2;20.2.2 Specialization and Transformation Using CHiLL;359
1.6.6.2.3;20.2.3 Autotuning: Heuristics for Pruning the Search Space;361
1.6.6.2.4;20.2.4 Building the Library;362
1.6.6.3;20.3 Experiments;363
1.6.6.3.1;20.3.1 Experimental Environment;364
1.6.6.3.2;20.3.2 Generating the Library;364
1.6.6.3.3;20.3.3 Evaluating the Pruning Heuristics;365
1.6.6.3.4;20.3.4 Performance Results;367
1.6.6.4;20.4 Related Work;369
1.6.6.5;20.5 Conclusion;370
1.6.6.6;References;370
1.7;Index;372




