| Foreword | p. v |
| Preface | p. xv |
| Acknowledgments | p. xix |
| Introduction | p. 1 |
| Technology Trends | p. 2 |
| Sub-Micron Technology | p. 2 |
| Implications of Sub-Micron Technology | p. 2 |
| Instruction-Level Parallelism (ILP) | p. 3 |
| Extracting ILP by Software | p. 5 |
| Extracting ILP by Hardware | p. 9 |
| Thread-Level Parallelism (TLP) | p. 12 |
| Speculative TLP | p. 13 |
| Challenges for TLP Processing | p. 14 |
| The Multiscalar Paradigm | p. 15 |
| The Multiscalar Story | p. 16 |
| Developing the Idea | p. 16 |
| Multi-block based Threads and the ARB | p. 17 |
| Maturing of the Ideas | p. 18 |
| Other Speculative Multithreading Models | p. 19 |
| The Rest of the Story | p. 20 |
| The Multiscalar Paradigm | p. 25 |
| Ideal TLP Processing Paradigm--The Goal | p. 26 |
| Multiscalar Paradigm--The Basic Idea | p. 27 |
| Multiscalar Execution Example | p. 29 |
| Control Dependences | p. 30 |
| Register Data Dependences | p. 31 |
| Memory Data Dependences | p. 32 |
| Interesting Aspects of the Multiscalar Paradigm | p. 32 |
| Comparison with Other Processing Paradigms | p. 35 |
| Multiprocessing Paradigm | p. 35 |
| Superscalar Paradigm | p. 36 |
| VLIW Paradigm | p. 38 |
| The Multiscalar Processor | p. 38 |
| Summary | p. 40 |
| Multiscalar Threads--Static Aspects | p. 43 |
| Structural Aspects of Multiscalar Threads | p. 43 |
| Definition | p. 43 |
| Thread Spawning Model | p. 44 |
| Thread Flow Graph | p. 46 |
| Thread Granularity | p. 48 |
| Thread Size Variance | p. 49 |
| Thread Shape | p. 50 |
| Thread Entry Points | p. 52 |
| Thread Exit Points | p. 54 |
| Data Flow Aspects of Multiscalar Threads | p. 55 |
| Shared Name Spaces | p. 55 |
| Inter-Thread Data Dependence | p. 55 |
| Program Partitioning | p. 57 |
| Compiler-based Partitioning | p. 58 |
| Hardware-based Partitioning | p. 59 |
| Static Thread Descriptor | p. 59 |
| Nature of Information | p. 59 |
| Compatibility Issues and Binary Representation | p. 61 |
| Concluding Remarks | p. 62 |
| Multiscalar Threads--Dynamic Aspects | p. 65 |
| Multiscalar Microarchitecture | p. 65 |
| Circular Queue Organization of Processing Units | p. 66 |
| PU Interconnect | p. 68 |
| Thread Processing Phases | p. 69 |
| Spawn: Inter-Thread Control Prediction | p. 69 |
| Activate | p. 69 |
| Execute | p. 70 |
| Resolve | p. 70 |
| Commit | p. 70 |
| Squash | p. 71 |
| Thread Assignment Policies | p. 71 |
| Number of Threads in a PU | p. 71 |
| Thread-PU Mapping Policy | p. 72 |
| Thread Execution Policies | p. 74 |
| Intra-PU Thread Concurrency Policy: TLP | p. 74 |
| Intra-Thread Instruction Concurrency Policy: ILP | p. 75 |
| Recovery Policies | p. 76 |
| Thread Squashing | p. 77 |
| Basic Block Squashing | p. 77 |
| Instruction Re-execution | p. 78 |
| Exception Handling | p. 78 |
| Exceptions | p. 78 |
| Interrupt Handling | p. 79 |
| Concluding Remarks | p. 80 |
| Multiscalar Processor--Control Flow | p. 81 |
| Inter-Thread Control Flow Predictor | p. 81 |
| Dynamic Inter-Thread Control Prediction | p. 82 |
| Control Flow Outcome | p. 83 |
| Thread History | p. 84 |
| Prediction Automata | p. 85 |
| History Updates | p. 86 |
| Return Address Prediction | p. 87 |
| Intra-Thread Branch Prediction | p. 92 |
| Problems with Conventional Branch Predictors | p. 93 |
| Bimodal Predictor | p. 96 |
| Extrapolation with Shared Predictor | p. 96 |
| Correlation with Thread-Level Information to Obtain Accurate History | p. 97 |
| Hybrid of Extrapolation and Correlation | p. 99 |
| Intra-Thread Return Address Prediction | p. 99 |
| Private RASes with Support from Inter-Thread RAS | p. 100 |
| Detailed Example | p. 100 |
| Instruction Supply | p. 101 |
| Instruction Cache Options | p. 101 |
| A Hybrid Instruction Cache Organization for Multiscalar Processor | p. 104 |
| Static Thread Descriptor Cache (STDC) | p. 105 |
| Concluding Remarks | p. 106 |
| Multiscalar Processor--Register Data Flow | p. 109 |
| Nature of Register Data Flow in a Multiscalar Processor | p. 110 |
| Correctness Issues: Synchronization | p. 111 |
| Register Data Flow in Example Code | p. 112 |
| Performance Issues | p. 113 |
| Decentralized Register File | p. 114 |
| Multi-Version Register File--Basic Idea | p. 115 |
| Local Register File | p. 116 |
| Performing Intra-Thread Register Data Flow | p. 116 |
| Performing Inter-Thread Register Data Flow | p. 117 |
| Inter-Thread Synchronization: Busy Bits | p. 119 |
| How are Busy Bits Set? Forwarding of Create Mask | p. 119 |
| How are Busy Bits Reset? Forwarding of Register Values | p. 121 |
| Strategies for Inter-Thread Forwarding | p. 123 |
| Multi-Version Register File--Detailed Operation | p. 126 |
| Algorithms for Register Write and Register Read | p. 127 |
| Committing a Thread | p. 128 |
| Squashing a Thread | p. 130 |
| Example | p. 131 |
| Data Speculation: Relaxing Inter-Thread Synchronization | p. 133 |
| Producer Identity Speculation | p. 134 |
| Producer Result Speculation | p. 138 |
| Consumer Source Speculation | p. 143 |
| Compiler and ISA Support | p. 144 |
| Inter-Thread Data Flow Information | p. 145 |
| Utilizing Dead Register Information | p. 146 |
| Effect of Anti-Dependences | p. 147 |
| Concluding Remarks | p. 148 |
| Multiscalar Processor--Memory Data Flow | p. 151 |
| Nature of Memory Data Flow in a Multiscalar Processor | p. 152 |
| Example | p. 152 |
| Performance Issues | p. 154 |
| Address Resolution Buffer (ARB) | p. 156 |
| Basic Idea | p. 156 |
| Hardware Structure | p. 157 |
| Handling of Loads and Stores | p. 158 |
| Committing or Squashing a Thread | p. 160 |
| Reclaiming the ARB Entries | p. 161 |
| Example | p. 162 |
| Two-Level Hierarchical ARB | p. 164 |
| Novel Features of ARB | p. 164 |
| ARB Extensions | p. 166 |
| Memory Dependence Table: Controlled Dependence Speculation | p. 167 |
| Multi-Version Cache | p. 168 |
| Local Data Cache | p. 168 |
| Performing Intra-Thread Memory Data Flow | p. 170 |
| Performing Inter-Thread Memory Data Flow | p. 171 |
| Detailed Working | p. 172 |
| Comparison with Multiprocessor Caches | p. 175 |
| Speculative Version Cache | p. 175 |
| Concluding Remarks | p. 177 |
| Multiscalar Compilation | p. 179 |
| Role of the Compiler | p. 179 |
| Correctness Issues | p. 181 |
| Performance Issues | p. 181 |
| Compiler Organization | p. 181 |
| Program Partitioning Criteria | p. 183 |
| Thread Size Criteria | p. 183 |
| Control Flow Criteria | p. 184 |
| Data Dependence Criteria | p. 185 |
| Interaction Among the Criteria | p. 188 |
| Program Partitioning Heuristics | p. 188 |
| Basic Thread Formation Process | p. 189 |
| Control Flow Heuristic | p. 190 |
| Data Dependence Heuristics | p. 190 |
| Loop Recurrence Heuristics | p. 194 |
| Implementation of Program Partitioning | p. 194 |
| Program Profiling | p. 194 |
| Optimizations | p. 195 |
| Code Replication | p. 195 |
| Code Layout | p. 195 |
| Intra-Thread Static Scheduling | p. 196 |
| Identifying the Instructions for Motion | p. 197 |
| Cost Model | p. 198 |
| Code Transformations | p. 199 |
| Scheduling Loop Induction Variables | p. 199 |
| Controlling Code Explosion | p. 200 |
| Crosscutting Issues | p. 202 |
| Concluding Remarks | p. 204 |
| Recent Developments | p. 207 |
| Incorporating Fault Tolerance | p. 207 |
| Where to Execute the Duplicate Thread? | p. 208 |
| When to Execute the Duplicate Thread? | p. 209 |
| Partitioning of PUs | p. 210 |
| Multiscalar Processor with Trace-based Threads | p. 211 |
| Implementation Hurdles of Complex Threads | p. 212 |
| Tree-Like Threads | p. 213 |
| Instruction Cache Organization | p. 215 |
| Advantages | p. 216 |
| Trace Processors | p. 216 |
| Hierarchical Multiscalar Processor | p. 217 |
| Microarchitecture | p. 219 |
| Inter-Superthread Register Data Flow | p. 219 |
| Inter-Superthread Memory Data Flow | p. 221 |
| Advantages of Hierarchical Multiscalar Processing | p. 221 |
| Compiler-Directed Thread Execution | p. 221 |
| Non-speculative Inter-Thread Memory Data Flow | p. 221 |
| Thread-Level Pipelining | p. 222 |
| Increased Role of Compiler | p. 222 |
| A Commercial Implementation: NEC Merlot | p. 223 |
| Index | p. 235 |
| Table of Contents provided by Ingram. All Rights Reserved. |