| Introduction | p. 1 |
| Basic goals of the book | p. 1 |
| What do I get for one Watt today? | p. 1 |
| Main memory bottleneck | p. 3 |
| Optimize resource usage | p. 3 |
| Application design | p. 4 |
| Organization of the book | p. 4 |
| Historical aspects | p. 4 |
| Parameterization | p. 5 |
| Models | p. 5 |
| Core optimization | p. 6 |
| Node optimization | p. 6 |
| Cluster optimization | p. 6 |
| Grid-brokering to save energy | p. 7 |
| Historical highlights | p. 9 |
| Evolution of computing | p. 9 |
| The first computer companies | p. 14 |
| ERA, EMCC and Univac | p. 14 |
| Control Data Corporation, CDC | p. 14 |
| Cray Research | p. 15 |
| Thinking Machines Corporation | p. 16 |
| International Business Machines (IBM) | p. 17 |
| The ASCI effort | p. 18 |
| The Japanese efforts | p. 19 |
| The computer generations | p. 20 |
| The evolution in computing performance | p. 20 |
| Performance/price evolution | p. 22 |
| Evolution of basic software | p. 22 |
| Evolution of algorithmic complexity | p. 23 |
| The TOP500 list | p. 25 |
| Outlook with the TOP500 curves | p. 27 |
| The GREEN500 List | p. 28 |
| Proposal for a REAL500 list | p. 30 |
| Parameterization | p. 31 |
| Definitions | p. 31 |
| Parameterization of applications | p. 35 |
| Application parameter set | p. 35 |
| Parameterization of BLAS library routines | p. 36 |
| SMXV: Parameterization of sparse matrix*vector operation | p. 38 |
| Parameterization of a computational nodes Pi ∈ ri | p. 39 |
| Parameterization of the interconnection networks | p. 41 |
| Types of networks | p. 41 |
| Parameterization of clusters and networks | p. 42 |
| Parameters related to running applications | p. 44 |
| Conclusion | p. 47 |
| Models | p. 49 |
| The performance prediction model | p. 49 |
| The execution time evaluation model (ETEM) | p. 53 |
| A network performance model | p. 53 |
| The extended ¿ - ¿ model | p. 55 |
| Validation of the models | p. 56 |
| Methodology | p. 56 |
| Example: The full matrix*matrix multiplication DGEMM | p. 57 |
| Example: Sparse matrix*vector multiplication SMXV | p. 59 |
| Core optimization | p. 63 |
| Some useful notions | p. 63 |
| Data hierarchy | p. 63 |
| Data representation | p. 64 |
| Floating point operations | p. 67 |
| Pipelining | p. 68 |
| Single core optimization | p. 70 |
| Single core architectures | p. 70 |
| Memory conflicts | p. 70 |
| Indirect addressing | p. 74 |
| Unrolling | p. 75 |
| Dependency | p. 76 |
| Inlining | p. 78 |
| If statement in a loop | p. 78 |
| Code porting aspects | p. 79 |
| How to develop application software | p. 83 |
| Application to plasma physics codes | p. 84 |
| Tokamaks and Stellerators | p. 84 |
| Optimization of VMEC | p. 88 |
| Optimization of TERPSICHORE | p. 91 |
| Conclusions for single core optimization | p. 94 |
| Node optimization | p. 95 |
| Shared memory computer architectures | p. 95 |
| SMP/NUMA architectures | p. 95 |
| The Cell | p. 99 |
| GPGPU for HPC | p. 100 |
| Node comparison and OpenMP | p. 105 |
| Race condition with OpenMP | p. 109 |
| Application optimization with OpenMP: the 3D Helmholtz solver | p. 110 |
| Fast Helmholtz solver for parallelepipedic geometries | p. 111 |
| NEC SX-5 reference benchmark | p. 113 |
| Single processor benchmarks | p. 114 |
| Parallelizalion with OpenMP | p. 115 |
| Parallelizalion with MPI | p. 115 |
| Conclusion | p. 119 |
| Application optimization with OpenMP: TERPSICHORE | p. 119 |
| Cluster optimization | p. 121 |
| Introduction on parallelization | p. 121 |
| Internode communication networks | p. 121 |
| Network architectures | p. 121 |
| Comparison between network architectures | p. 129 |
| Distributed memory parallel computer architectures | p. 131 |
| Integrated parallel computer architectures | p. 131 |
| Commodity cluster architectures | p. 134 |
| Energy consumption issues | p. 136 |
| The issue of resilience | p. 137 |
| Type of parallel applications | p. 138 |
| Embarrassingly parallel applications | p. 138 |
| Applications with point-to-point communications | p. 138 |
| Applications with multicast communication needs | p. 139 |
| Shared memory applications (OpenMP) | p. 139 |
| Components based applications | p. 139 |
| Domain decomposition techniques | p. 139 |
| Test example: The Gyrotron | p. 140 |
| The geometry and the mesh | p. 142 |
| Connectivity conditions | p. 142 |
| Parallel matrix solver | p. 143 |
| The electrostatic precipitator | p. 145 |
| Scheduling of parallel applications | p. 146 |
| Static scheduling | p. 146 |
| Dynamic scheduling | p. 146 |
| SpecuLOOS | p. 147 |
| Introduction | p. 147 |
| Test case description | p. 147 |
| Complexity on one node | p. 149 |
| Wrong complexity on the Blue Gene/L | p. 150 |
| Fine results on the Blue Gene/L | p. 151 |
| Conclusions | p. 151 |
| TERPSICHORE | p. 153 |
| Parallelization of the LEMan code with MPI and OpenMP | p. 154 |
| Introduction | p. 154 |
| Parallelization | p. 154 |
| CPU time results | p. 156 |
| Conclusions | p. 159 |
| Grid-level Brokering to save energy | p. 161 |
| About Grid resource brokering | p. 161 |
| An Introduction to ïanos | p. 162 |
| Job Submission Scenario | p. 164 |
| The cost model | p. 165 |
| Mathematical formulation | p. 165 |
| CPU costs Ke | p. 167 |
| License fees Kl | p. 169 |
| Costs due to waiting time Kw | p. 169 |
| Energy costs Keco | p. 169 |
| Data transfer costs Kd | p. 171 |
| Example: The Pleiades clusters CPU cost per hour | p. 171 |
| Different currencies in a Grid environment | p. 173 |
| The implementation | p. 173 |
| Architecture & Design | p. 174 |
| The Grid Adapter | p. 174 |
| The Meta Scheduling Service (MSS) | p. 175 |
| The Resource Broker | p. 176 |
| The System Information | p. 177 |
| The Data Warehouse | p. 177 |
| The Monitoring Service | p. 177 |
| The Monitoring Module VAMOS | p. 178 |
| Integration with UNICORL Grid System | p. 179 |
| Scheduling algorithm | p. 179 |
| User Interfaces to the ïanos framework | p. 181 |
| DVS-able processors | p. 182 |
| Power consumption of a CPU | p. 183 |
| An algorithm to save energy | p. 184 |
| First results with SMXV | p. 185 |
| A first implementation | p. 186 |
| Conclusions | p. 188 |
| Recommendations | p. 189 |
| Application oriented recommendations | p. 189 |
| Code development | p. 189 |
| Code validation | p. 189 |
| Porting codes | p. 190 |
| Optimizing parallelized applications | p. 190 |
| Race condition | p. 190 |
| Hardware and basic software aspects | p. 191 |
| Basic software | p. l91 |
| Choice of system software | p. 192 |
| Energy reduction | p. 192 |
| Processor frequency adaptation | p. 192 |
| Improved cooling | p. 193 |
| Choice of optimal resources | p. 193 |
| Best choice of new computer | p. 193 |
| Last but not least | p. 194 |
| Miscellaneous | p. 194 |
| Course material | p. 194 |
| A new REAL500 List | p. 194 |
| Glossary | p. 197 |
| References | p. 205 |
| About the authors | p. 213 |
| Index | p. 215 |
| Table of Contents provided by Ingram. All Rights Reserved. |