Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors

Jean-Loup Baer

doi:10.1017/CBO9780511811258

References

Bibliography

Abel, N., Budnick, D., Kuck, D., Muraoka, Y., Northcote, R., and Wilhelmson, R., “TRANQUIL: A Language for an Array Processing Computer,” Proc. AFIPS SJCC, 1969, 57–73

Adiletta, M., Rosenbluth, M., Bernstein, D., Wolrich, G., and Wilkinson, H., “The Next Generation of Intel IXP Network Processors,” Intel Tech. Journal, 6, 3, Aug. 2002, 6–18

Adve, S. and Gharachorloo, K., “Shared Memory Consistency Models: A Tutorial,” IEEE Computer, 29, 12, Dec. 1996, 66–76

Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K., Kranz, D., Kubiatowicz, J., Lim, B.-H., Mackenzie, K., and Yeung, D., “The MIT Alewife Machine: Architecture and Performance,” Proc. 22nd Int. Symp. on Computer Architecture, 1995, 2–13

Agarwal, A., Lim, B.-H., Kranz, D., and Kubiatowicz, J., “APRIL: A Processor Architecture for Multiprocessing,” Proc. 17th Int. Symp. on Computer Architecture, 1990, 104–114

Agarwal, A. and Pudar, S., “Column-Associative Caches: A Technique for Reducing the Miss Rate of Direct-Mapped Caches,” Proc. 20th Int. Symp. on Computer Architecture, 1993, 179–190

Agarwal, A., Simoni, R., Hennessy, J., and Horowitz, M., “An Evaluation of Directory Schemes for Cache Coherence,” Proc. 15th Int. Symp. on Computer Architecture, 1988, 280–289

Aggarwal, A. and Franklin, M., “Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors,” IEEE Trans. on Parallel and Distributed Systems, 16, 10, Oct. 2005, 944–955

Akkary, H. and Driscoll, M., “A Dynamic Multithreading Processor,” Proc. 31st Int. Symp. on Microarchitecture, 1998, 226–236

Albonesi, D., Balasubramonian, R., Dropsho, S., Dwarkadas, S., Friedman, E., Huang, M., Kursun, V., Magklis, G., Scott, M., Semeraro, G., Bose, P., Buyuktosunoglu, A., Cook, P., and Schuster, S., “Dynamic Tuning Processor Resources with Adaptive Processing,” IEEE Computer, 36, 12, Dec. 2003, 49–58

Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., and Smith, B., “The Tera Computer System,” Proc. Int. Conf. on Supercomputing, 1990, 1–6

Amdahl, G., “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,” Proc. AFIPS SJCC, 30, Apr. 1967, 483–485

Anderson, D., Sparacio, F., and Tomasulo, R., “Machine Philosophy and Instruction Handling,” IBM Journal of Research and Development, 11, 1, Jan. 1967, 8–24

Anderson, S., Earle, J., Goldschmitt, R., and Powers, D., “The IBM System/360 Model 91: Floating-point Execution Unit,” IBM Journal of Research and Development, 11, Jan. 1967, 34–53

Anderson, T., “The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors,” IEEE Trans. on Parallel and Distributed Systems, 1, 1, Jan. 1990, 6–16

Archibald, J. and Baer, J.-L., “An Economical Solution to the Cache Coherence Problem,” Proc. 12th Int. Symp. on Computer Architecture, 1985, 355–362

Archibald, J. and Baer, J.-L., “Cache Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model,” ACM Trans. on Computing Systems, 4, 4, Nov. 1986, 273–298

August, D., Connors, D., Mahlke, S., Sias, J., Crozier, K., Cheng, B., Eaton, P., Olaniran, Q., and Hwu, W.-m., “Integrated Predicated and Speculative Execution in the IMPACT EPIC Architecture,” Proc. 25th Int. Symp. on Computer Architecture, 1998, 227–237

Austin, T., Larson, D., and Ernst, D., “SimpleScalar: An Infrastructure for Computer System Modeling,” IEEE Computer, 35, 2, Feb. 2002, 59–67

Baer, J.-L. and Wang, W.-H., “On the Inclusion Properties for Multi-Level Cache Hierarchies,” Proc. 15th Int. Symp. on Computer Architecture, 1988, 73–80

Baetke, F., “The CONVEX Exemplar SPP1000 and SPP1200 – New Scalable Parallel Systems with a Virtual Shared Memory Architecture,” in Dongarra, J., Grandinetti, L., Joubert, G., and Kowalik, J., Eds., High Performance Computing: Technology, Methods and Applications, Elsevier Press, 1995, 81–102

Balasubramonian, R., Albonesi, D., Buyuktosunoglu, A., and Dwarkadas, S., “Memory Hierarchy Reconfiguration for Energy and Performance in General-purpose Processor Architectures,” Proc. 33rd Int. Symp. on Microarchitecture, 2000, 245–257

Belady, L., “A Study of Replacement Algorithms for a Virtual Storage Computer,” IBM Systems Journal, 5, 1966, 78–101

Bernstein, A., “Analysis of Programs for Parallel Processing,” IEEE Trans. on Electronic Computers, EC-15, Oct. 1966, 746–757

Bhandarkar, D., Alpha Implementations and Architecture. Complete Reference and Guide, Digital Press, Boston, 1995

Boggs, D., Baktha, A., Hawkins, J., Marr, D., Miller, J., Roussel, P., Singhal, R., Toll, B., and Venkatraman, K., “The Microarchitecture of the Pentium 4 Processor on 90nm Technology,” Intel Tech. Journal, 8, 1, Feb. 2004, 1–17

Borkenhagen, J., Eickemeyer, R., Kalla, R., and Kunkel, S., “A Multithreaded PowerPC Processor for Commercial Servers,” IBM Journal of Research and Development, 44, 6, 2000, 885–899

Brooks, D. and Martonosi, M., “Dynamic Thermal Management in High-Performance Microprocessors,” Proc.7th Int. Symp. on High-Performance Computer Architecture, 2001, 171–182

Bucholz, W., Ed., Planning a Computer System: Project Stretch, McGraw-Hill, New York, 1962

Calder, B. and Grunwald, D., “Fast & Accurate Instruction Fetch and Branch Prediction,” Proc. 21st Int. Symp. on Computer Architecture, 1994, 2–11

Calder, B. and Grunwald, D., “Next Cache Line and Set Prediction,” Proc. 22nd Int. Symp. on Computer Architecture, 1995, 287–296

Calder, B., Grunwald, D., and Emer, J., “Predictive Sequential Associative Cache,” Proc. 2nd Int. Symp. on High-Performance Computer Architecture, 1996, 244–253

Calder, B. and Reinmann, G., “A Comparative Survey of Load Speculation Architectures,” Journal of Instruction-Level Parallelism, 1, 2000, 1–39

Canal, R., Parcerisa, J.M., and Gonzales, A., “Dynamic Cluster Assignment Mechanisms,” Proc. 6th Int. Symp. on High-Performance Computer Architecture, 2000, 133–141

Cantin, J. and Hill, M., Cache Performance for SPEC CPU2000 Benchmarks, Version 3.0, May 2003, http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/

Case, R. and Padegs, A., “The Architecture of the IBM System/370,” Communications of the ACM, 21, 1, Jan. 1978, 73–96

Censier, L. and Feautrier, P., “A New Solution to Coherence Problems in Multicache Systems,” IEEE Trans. on Computers, 27, 12, Dec. 1978, 1112–1118

Chan, K., Hay, C., Keller, J., Kurpanek, G., Shumaker, F., and Zheng, J., “Design of the HP PA 7200 CPU,” Hewlett Packard Journal, 47, 1, Jan. 1996, 25–33

Chaudhry, S., Caprioli, P., Yip, S., and Tremblay, M., “High-Performance Throughput Computing,” IEEE Micro, 25, 3, May 2005, 32–45

Chen, T.-F. and Baer, J.-L., “Effective Hardware-based Data Prefetching for High-Performance Processors,” IEEE Trans. on Computers, 44, 5, May 1995, 609–623

Cheng, I-C., Coffey, J., and Mudge, T., “Analysis of Branch Prediction via Data Compression,” Proc. 7th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 1996, 128–137

Christie, D., “Developing the AMD-K5 Architecture,” IEEE Micro, 16, 2, Mar. 1996, 16–27

Chryzos, G. and Emer, J., “Memory Dependence Prediction Using Store Sets,” Proc. 25th Int. Symp. on Computer Architecture, 1998, 142–153

Citron, D., Hurani, A., and Gnadrey, A., “The Harmonic or Geometric Mean: Does it Really Matter,” Computer Architecture News, 34, 6, Sep. 2006, 19–26

Colwell, R., Papworth, D., Hinton, G., Fetterman, M., and Glew, A., “Intel's P6 Microarchitecture,” Chapter 7 in Shen, J. P. and Lipasti, M., Eds., Modern Processor Design, 2005, 329–367

Conte, T., Memezes, K., Mills, P., and Patel, B., “Optimization of Instruction Fetch Mechanisms for High Issue Rates,” Proc. 22nd Int. Symp. on Computer Architecture, 1995, 333–344

Conti, C., Gibson, D., and Pitkowsky, S., “Structural Aspects of the IBM System 360/85; General Organization,” IBM Systems Journal, 7, 1968, 2–14

Cooksey, R., Jourdan, S., and Grunwald, D., “A Stateless, Content-Directed Data Prefetching Mechanism,” Proc. 10th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 2002, 279–290

Crisp, R., “Direct Rambus Technology: The New Main Memory Standard,” IEEE Micro, 17, 6, Nov.–Dec. 1997, 18–28

Culler, D. and Singh, J.P. with Gupta, A., Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufman Publishers, San Francisco, 1999

Cuppu, V., Jacob, B., Davis, B., and Mudge, T., “High-Performance DRAMs in Workstation Environments,” IEEE Trans. on Computers, 50, 11, Nov. 2001, 1133–1153

Curnow, H. and Wichman, B., “Synthetic Benchmark,” Computer Journal, 19, 1, Feb. 1976

Cvetanovic, Z. and Bhandarkar, D., “Performance Characterization of the Alpha 21164 Microprocessor Using TP and SPEC Workloads,” Proc. 2nd Int. Symp. on High-Performance Computer Architecture, 1996, 270–280

Cvetanovic, Z. and Kessler, R., “Performance Analysis of the Alpha 21264-based Compaq ES40 System,” Proc. 27th Int. Symp. on Computer Architecture, 2000, 192–202

Dally, W., “Virtual-Channel Flow Control,” Proc. 17th Int. Symp. on Computer Architecture, 1990, 60–68

Denning, P., “Virtual Memory,” ACM Computing Surveys, 2, Sep. 1970, 153–189

Dennis, J. and Misunas, D., “A Preliminary Data Flow Architecture for a Basic Data Flow Processor,” Proc. 2nd Int. Symp. on Computer Architecture, 1974, 126–132

Dongarra, J., Bunch, J., Moler, C., and Stewart, G., LINPACK User's Guide, SIAM, Philadelphia, 1979

Dongarra, J., Luszczek, P., and Petitet, A., “The LINPACK Benchmark: Past, Present, and Future,” Concurrency and Computation: Practice and Experience, 15, 2003, 1–18

Dubois, M., Scheurich, C., and Briggs, F., “Memory Access Buffering in Multiprocessors,” Proc. 13th Int. Symp. on Computer Architecture, 1986, 434–442

Eden, A. and Mudge, T., “The YAGS Branch Prediction Scheme,” Proc. 31st Int. Symp. on Microarchitecture, 1998, 69–77

Edmondson, J., Rubinfeld, P., Preston, R., and Rajagopalan, V., “Superscalar Instruction Execution in the 21164 Alpha Microprocessor,” IEEE Micro, 15, 2, Apr. 1995, 33–43

Eggers, S., Emer, J., Levy, H., Lo, J., Stamm, R., and Tullsen, D., “Simultaneous Multithreading: A Platform for Next-Generation Processors,” IEEE Micro, 17, 5, Sep. 1997, 12–19

Fagin, B. and Russell, K., “Partial Resolution in Branch Target Buffers,” Proc. 28th Int. Symp. on Microarchitecture, 1995, 193–198

Farkas, D. and Jouppi, N., “Complexity/Performance Trade-offs with Non-Blocking Loads,” Proc. 21st Int. Symp. on Computer Architecture, 1994, 211–222

Fields, B., Bodik, R., and Hill, M., “Slack: Maximizing Performance under Technological Constraints,” Proc. 29th Int. Symp. on Computer Architecture, 2002, 47–58

Flynn, M., “Very High Speed Computing Systems,” Proc. IEEE, 54, 12, Dec. 1966, 1901–1909

Folegnani, D. and Gonzales, A., “Energy-effective Issue Logic,” Proc. 28th Int. Symp. on Computer Architecture, 2001, 230–239

Franklin, M. and Sohi, G., “A Hardware Mechanism for Dynamic Reordering of Memory References,” IEEE Trans. on Computers, 45, 6, Jun. 1996, 552–571

Gharachorloo, K., Gupta, A., and Hennessy, J., “Two Techniques to Enhance the Performance of Memory Consistency Models,” Proc. Int. Conf. on Parallel Processing, 1991, I-355–364

Gochman, S., Ronen, R., Anati, I., Berkovits, R., Kurts, T., Naveh, A., Saeed, A., Sperber, Z., and Valentine, R., “The Intel Pentium M Processor: Microarchitecture and Performance,” Intel Tech. Journal, 07, 2, May 2003, 21–39

Golden, M. and Mudge, T., “A Comparison of Two Pipeline Organizations,” Proc. 27th Int. Symp. on Microarchitecture, 1994, 153–161

Goodman, J., Vernon, M., and Woest, P., “Efficient Synchronization Primitives for Large-Scale Cache Coherent Multiprocessors,” Proc. 3rd Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Apr. 1989, 64–73

Graunke, G. and Thakkar, S., “Synchronization Algorithms for Shared-Memory Multiprocessors,” IEEE Computer, 23, 6, Jun. 1990, 60–70

Grunwald, D., Levis, P., Farkas, K., Morrey, C., and Neufeld, M., “Policies for Dynamic Clock Scheduling,” Proc. 4th USENIX Symp. on Operating Systems Design and Implementation, 2000, 73–86

Gschwind, M., Hofstee, H., Flachs, B., Hopkins, M., Watanabe, Y., and Yamazaki, T., “Synergistic Processing in Cell's Multicore Architecture,” IEEE Micro, 26, 2, Mar. 2006, 11–24

Gunther, S., Beans, F., Carmean, D., and Hall, J., “Managing the Impact of Increasing Power Consumption,” Intel Tech. Journal, 5, 1, Feb. 2001, 1–9

Gwennap, L., “Brainiacs, Speed Demons, and Farewell,” Microprocessor Report Newsletter, 13, 7, Dec. 1999

Hallnor, E. and Reinhardt, S., “A Fully Associative Software-Managed Cache Design,” Proc. 27th Int. Symp. on Computer Architecture, 2000, 107–116

Hao, E., Chang, P.-Y., and Patt, Y., “The Effect of Speculatively Updating Branch History on Branch Prediction Accuracy, Revisited,” Proc. 27th Int. Symp. on Microarchitecture, 1994, 228–232

Harstein, A. and Puzak, T., “The Optimum Pipeline Depth for a Microprocessor,” Proc. 29th Int. Symp. on Computer Architecture, 2002, 7–13

Hennessy, J. and Patterson, D., Computer Architecture: A Quantitative Approach, Fourth Edition, Elsevier Inc., San Francisco, 2007

Henning, J., Ed., “SPEC CPU2006 Benchmark Descriptions,” Computer Architecture News, 36, 4, Sep. 2006, 1–17

Hill, M., Aspects of Cache Memory and Instruction Buffer Performance, Ph.D. Dissertation, Univ. of California, Berkeley, Nov. 1987

Hill, M., “Multiprocessors Should Support Simple Memory-Consistency Models,” IEEE Computer, 31, 8, Aug. 1998, 28–34

Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., and Roussel, P., “The Microarchitecture of the Pentium4 Processor,” Intel Tech. Journal, 1, Feb. 2001

Ho, R., Mai, K., and Horowitz, M., “The Future of Wires,” Proc. of the IEEE, 89, 4, Apr. 2001, 490–504

Hrishikesh, M., Jouppi, N., Farkas, K., Burger, D., Keckler, S., and Shivakumar, P., “The Optimal Logic Depth per Pipeline Stage is 6 to 8 FO4 Inverter Delays,” Proc. 29th Int. Symp. on Computer Architecture, 2002, 14–24

Huck, J., Morris, D., Ross, J., Knies, A., Mulder, H., and Zahir, R., “Introducing the IA-64 Architecture,” IEEE Micro, 20, 5, Sep. 2000, 12–23

Hwu, W.-m. and Patt, Y., “HPSm, A High-Performance Restricted Data Flow Architecture Having Minimal Functionality,” Proc. 13th Int. Symp. on Computer Architecture, 1986, 297–307

,Intel Corp., A Tour of the P6 Microarchitecture, 1995, http://www.x86.org/ftp/manuals/686/p6tour.pdf

Jeremiassen, T. and Eggers, S., “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations,” Proc. 5th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, 1995, 179–188

Jiménez, D., Keckler, S., and Lin, C., “The Impact of Delay on the Design of Branch Predictors,” Proc. 33rd Int. Symp. on Microarchitecture, 2000, 67–76

John, L., “More on Finding a Single Number to Indicate Overall Performance of a Benchmark Suite,” Computer Architecture News, 32, 1, Mar. 2004, 3–8

Joseph, D. and Grunwald, D., “Prefetching Using Markov Predictors,” Proc. 24th Int. Symp. on Computer Architecture, 1997, 252–263

Jouppi, N., “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” Proc. 17th Int. Symp. on Computer Architecture, 1990, 364–373

Jourdan, S., Stark, J., Hsing, T.-H., and Patt, Y., “Recovery Requirements of Branch Prediction Storage Structures in the Presence of Mispredicted-path Execution,” International Journal of Parallel Programming, 25, Oct. 1997, 363–383

Kaeli, D. and Emma, P., “Branch History Table Prediction of Moving Target Branches Due to Subroutine Returns,” Proc. 18th Int. Symp. on Computer Architecture, 1991, 34–42

Kagi, A., Burger, D., and Goodman, J., “Efficient Synchronization: Let them Eat QOLB,” Proc. 24th Int. Symp. on Computer Architecture, 1997, 170–180

Kahle, J., Day, M., Hofstee, H., Johns, C., Maeurer, T., and Shippy, D., “Introduction to the Cell Multiprocessor,” IBM Journal of Research and Development, 49, 4/5, Jul. 2005, 589–604

Kalamatianos, J., Khalafi, A., Kaeli, D., and Meleis, W., “Analysis of Temporal-based Program Behavior for Improved Instruction Cache Performance,” IEEE Trans. on Computers, 48, 2, Feb. 1999, 168–175

Kalla, R., Sinharoy, B., and Tendler, J., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro, 24, 2, Apr. 2004, 40–47

Kapasi, U., Rixner, S., Dally, W., Khailany, B., Ahn, J., Mattson, P., and Owens, J., “Programmable Stream Processors,” IEEE Computer, 36, 8, Aug. 2003, 54–62

Kaxiras, S., Hu, Z., and Martonosi, M., “Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power,” Proc. 28th Int. Symp. on Computer Architecture, 2001, 240–251

Keller, R., “Look-Ahead Processors,” ACM Computing Surveys, 7, 4, Dec. 1975, 177–195

Keltcher, C., McGrath, J., Ahmed, A., and Conway, P., “The AMD Opteron for Multiprocessor Servers,” IEEE Micro, 23, 2, 2003, 66–76

,Kendall Square Research, KSR1 Technology Background, Waltham, MA, 1992

Kermani, P. and Kleinrock, L., “Virtual Cut-through: A New Computer Communication Switching Technique,” Computer Networks, 3, 4, Sep. 1979, 267–286

Kerns, D. and Eggers, S., “Balanced Scheduling: Instruction Scheduling when Memory Latency is Uncertain,” Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation, SIGPLAN Notices, 28, 6, Jun. 1993, 278–289

Keshava, J. and Pentkovski, V., “Pentium III Processor Implementation Tradeoffs,” Intel Tech. Journal, 2, May 1999

Kessler, R., “The Alpha 21264 Microprocessor,” IEEE Micro, 19, 2, Mar. 1999, 24–36

Kessler, R., Jooss, R., Lebeck, A., and Hill, M., “Inexpensive Implementations of Set-Associativity,” Proc. 16th Int. Symp. on Computer Architecture, 1989, 131–139

Kilburn, T., Edwards, D., Lanigan, M., and Sumner, F., “One-level Storage System,” IRE Trans. on Electronic Computers, EC-11, 2, Apr. 1962, 223–235

Kim, C., Burger, D., and Keckler, S., “An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches,” Proc. 10th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 2002, 211–222

Kim, N., Flautner, K., Blaauw, D., and Mudge, T., “Drowsy Instruction Caches – Leakage Power Reduction Using Dynamic Voltage Scaling and Cache Sub-bank Prediction,” Proc. 29th Int. Symp. on Computer Architecture, 2002, 219–230

KleinOsowski, A. and Lilja, D., “MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research,” Computer Architecture Letters, 1, Jun. 2002

Kogge, P., The Architecture of Pipelined Computers, McGraw-Hill, New York, 1981

Kongetira, P., Aingaran, K., and Olukotun, K., “Niagara: A 32-way Multithreaded Sparc Processor,” IEEE Micro, 24, 2, Apr. 2005, 21–29

Koufaty, D. and Marr, D., “Hyperthreading Technology in the Netburst Microarchitecture,” IEEE Micro, 23, 2, Mar. 2003, 56–65

Kroft, D., “Lockup-Free Instruction Fetch/Prefetch Cache Organization,” Proc. 8th Int. Symp. on Computer Architecture, 1981, 81–87

Lai, A., Fide, C., and Falsafi, B., “Dead-block Prediction & Dead-block Correlation Prefetchers,” Proc. 28th Int. Symp. on Computer Architecture, 2001, 144–154

Lam, M., “Software Pipelining: An Effective Scheduling Technique for VLIW Machines,” Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation, SIGPLAN Notices, 23, 7, Jul. 1988, 318–328

Lamport, L., “How to Make a Multiprocessor Computer that Correctly Executes Programs,” IEEE Trans. on Computers, 28, 9, Sep. 1979, 690–691

Larus, J. and Kozyrakis, C., “Transactional Memory,” Communications of the ACM, 51, 7, Jul. 2008, 80–88

Lee, D., Crowley, P., Baer, J.-L., Anderson, T., and Bershad, B., “Execution Characteristics of Desktop Applications on Windows NT,” Proc. 25th Int. Symp. on Computer Architecture, 1998, 27–38

Lee, J., “Study of ‘Look-Aside’ Memory,” IEEE Trans. on Computers, C-18, 11, Nov. 1969, 1062–1065

Lee, J. and Smith, A., “Branch Prediction Strategies and Branch Target Buffer Design,” IEEE Computer, 17, 1, Jan. 1984, 6–22

Lin, W.-F., Reinhardt, S., and Burger, D., “Designing a Modern Memory Hierarchy with Hardware Prefetching,” IEEE Trans. on Computers, 50, 11, Nov. 2001, 1202–1218

Lipasti, M. and Shen, J.P., “Exceeding the Dataflow Limit with Value Prediction,” Proc. 29th Int. Symp. on Microarchitecture, 1996, 226–237

Liptay, J., “Design of the IBM Enterprise System/9000 High-end Processor,” IBM Journal of Research and Development, 36, 4, Jul. 1992, 713–731

Lo, J., Barroso, L., Eggers, S., Gharachorloo, K., Levy, H., and Parekh, S., “An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors,” Proc. 25th Int. Symp. on Computer Architecture, 1998, 39–50

Loh, G., “Advanced Instruction Flow Techniques,” Chapter 9 in Shen, J. P. and Lipasti, M., Eds., Modern Processor Design, 2005, 453–518

Lovett, T. and Clapp, R., “STiNG: A CC-NUMA Computer System for the Commercial Marketplace,” Proc. 23rd Int. Symp. on Computer Architecture, 1996, 308–317

Lovett, T. and Thakkar, S., “The Symmetry Multiprocessor System,” Proc. Int. Conf. on Parallel Processing, Aug. 1988, pp. 303–310

Mathis, H., Mericas, A., McCalpin, J., Eickemeyer, R., and Kunkel, S., “Characterization of Simultaneous Multithreading (SMT) Efficiency in Power5,” IBM Journal of Research and Development, 49, 4, Jul. 2005, 555–564

Mattson, R., Gecsei, J., Slutz, D., and Traiger, I., “Evaluation Techniques for Storage Hierarchies,” IBM Systems Journal, 9, 1970, 78–117

McFarling, S., “Combining Branch Predictors,” WRL Technical Note, TN-36, Jun. 1993

McMahon, H., “The Livermore Fortran Kernels Test of the Numerical Performance Range,” in Martin, J. L., Ed., Performance Evaluation of Supercomputers, Elsevier Science B.V., North-Holland, Amsterdam, 1988, 143–186.

McNairy, C. and Soltis, D., “Itanium 2 Processor Microarchitecture,” IEEE Micro, 23, 2, Mar. 2003, 44–55

Mendelson, A., Mandelblat, J., Gochman, S., Shemer, A., Chabukswar, R., Niemeyer, E., and Kumar, A., “CMP Implementation in Systems Based on the Intel Core Duo Processor,” Intel Tech. Journal, 10, 2, May 2006, 99–107

Moore, G., “Cramming More Components onto Integrated Circuits,” Electronics, 38, 8, Apr. 1965

Moshovos, A., Breach, S., Vijaykumar, T., and Sohi, G., “Dynamic Speculation and Synchronization of Data Dependences,” Proc. 24th Int. Symp. on Computer Architecture, 1997, 181–193

Mowry, T., Lam, M., and Gupta, A., “Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors,” Proc. 5th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 1992, 62–73

Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y., “Run-ahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” Proc. 9th Int. Symp. on High-Performance Computer Architecture, 2003, 129–140

Naveh, A., Rotem, E., Mendelson, A., Gochman, S., Chabuskwar, R., Krishnan, K., and Kumar, A., “Power and Thermal Management in the Intel Core Dual Processor,” Intel Tech. Journal, 10, 2, May 2006, 109–122

Ozer, E., Banerjia, S., and Conte, T., “Unified Assign and Schedule: A New Approach to Scheduling for Clustered Register File Microarchitectures,” Proc. 31st Int. Symp. on Microarchitecture, 1998, 308–315

Palacharla, S., Jouppi, N., and Smith, J., “Complexity-Effective Superscalar Processors,” Proc. 24th Int. Symp. on Computer Architecture, 1997, 206–218

Palacharla, S. and Kessler, R., “Evaluating Stream Buffers as a Secondary Cache Replacement,” Proc. 21st Int. Symp. on Computer Architecture, 1994, 24–33

Pan, S., So, K., and Rahmey, J., “Improving the Accuracy of Dynamic Branch Prediction using Branch Correlation,” Proc. 5th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 1992, 76–84

Papamarcos, M. and Patel, J., “A Low-overhead Coherence Solution for Multiprocessors with Private Cache Memories,” Proc. 12th Int. Symp. on Computer Architecture, 1985, 348–354

Papworth, D., “Tuning the Pentium Pro Microarchitecture,” IEEE Micro, 16, 2, Mar. 1996, 8–15

Patel, S., Friendly, D., and Patt, Y., “Evaluation of Design Options for the Trace Cache Fetch Mechanism,” IEEE Trans. on Computers, 48, 2, Feb. 1999, 193–204

Patterson, D. and Hennessy, J., Computer Organization & Design: The Hardware/Software Interface, Third Edition, Morgan Kaufman Publishers, San Francisco, 2004

Patterson, D. and Séquin, C., “RISC I: A Reduced Instruction Set VLSI Computer,” Proc. 8th Int. Symp. on Computer Architecture, 1981, 443–457.

Peir, J.-K., Hsu, W., and Smith, A., “Functional Implementations Techniques for CPU Cache Memories,” IEEE Trans. on Computers, 48, 2, Feb. 1999, 100–110

Peleg, A. and Weiser, U., “Dynamic Flow Instruction Cache Memory Organized Around Trace Segments Independent of Virtual Address Line,” U.S. Patent Number 5,381,533, 1994

Peleg, A. and Weiser, U., “MMX Technology Extension to the Intel Architecture,” IEEE Micro, 16, 4, Aug. 1996, 42–50

Perleberg, C. and Smith, A., “Branch Target Buffer Design and Optimization,” IEEE Trans. on Computers, 42, 4, Apr. 1993, 396–412

Pettis, K. and Hansen, R., “Profile Guided Code Positioning,” Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation, SIGPLAN Notices, 25, Jun. 1990, 16–27

Ponomarev, D., Kucuk, G., and Ghose, K., “Reducing Power Requirements of Instruction Scheduling through Dynamic Allocation of Multiple Datapath resources,” Proc. 34th Int. Symp. on Microarchitecture, 2001, 90–101

Postiff, M., Tyson, G., and Mudge, T., “Performance Limits of Trace Caches,” Journal of Instruction-Level Parallelism, 1, Sep. 1999, 1–17

Przybylski, S., Cache Design: A Performance Directed Approach, Morgan Kaufman Publishers, San Francisco, 1990

Pugh, E., Johnson, L., and Palmer, J., IBM's 360 and Early 370 Systems, The MIT Press, Cambridge, MA, 1991

Ranganathan, P., Adve, S., and Jouppi, N., “Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions,” Proc. 26th Int. Symp. on Computer Architecture, 1999, 124–135

Riseman, E. and Foster, C., “The Inhibition of Potential Parallelism by Conditional Jumps,” IEEE Trans. on Computers, C-21, 12, Dec. 1972, 1405–1411

Romer, T., Lee, D., Volker, G., Wolman, A., Wong, W., Baer, J.-L., Bershad, B., and Levy, H., “The Structure and Performance of Interpreters,” Proc. 7th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 1996, pp. 150–159

Rotenberg, E., Bennett, S., and Smith, J., “Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching,” Proc. 29th Int. Symp. on Microarchitecture, 1996, 24–34

Rudolf, L. and Segall, Z., “Dynamic Decentralized Cache Schemes for MIMD Parallel Processors,” Proc. 11th Int. Symp. on Computer Architecture, 1984, 340–347

Salverda, P. and Zilles, C., “A Criticality Analysis of Clustering in Superscalar Processors,” Proc. 38th Int. Symp. on Microarchitecture, 2005, 55–66

Schlansker, M. and Rau, B., “EPIC: Explicitly Parallel Instruction Computing,” IEEE Computer, 33, 2, Feb. 2000, 37–45

Scott, S., “Synchronization and Communication in the Cray 3TE Multiprocessor,” Proc. 7th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 1996, 26–36

Seznec, A., “A Case for Two-way Skewed-Associative Caches,” Proc. 20th Int. Symp. on Computer Architecture, 1993, 169–178

Sharangpani, H. and Arora, K., “Itanium Processor Microarchitecture,” IEEE Micro, 20, 5, Sep. 2000, 24–43

Shen, J. P. and Lipasti, M., Modern Processor Design Fundamentals of Superscalar Processors, McGraw-Hill, 2005

Sherwood, T., Perelman, E., Hamerly, G., Sair, S., and Calder, B., “Discovering and Exploiting Program Phases,” IEEE Micro, 23, 6, Nov.–Dec. 2003, 84–93

Sima, D., “The Design Space of Register Renaming Techniques,” IEEE Micro, 20, 5, Sep. 2000, 70–83

Skadron, K., Martonosi, M., and Clark, D., “Speculative Updates of Local and Global Branch History: A Quantitative Analysis,” Journal of Instruction-Level Parallelism, 2, 2000, 1–23

Skadron, K., Stan, M., Huang, W., Velusamy, S., Sankararayanan, K., and Tarjan, D., “Temperature-Aware Microarchitecture,” Proc. 30th Int. Symp. on Computer Architecture, 2003, 2–13

Slingerland, N. and Smith, A., “Multimedia Extensions for General-Purpose Microprocessors: A Survey,” Microprocessors and Microsystems, 29, 5, Jan. 2005, 225–246

Smith, A., “Cache Memories,” ACM Computing Surveys, 14, 3, Sep. 1982, 473–530

Smith, B., “A Pipelined, Shared Resource MIMD Computer,” Proc. Int. Conf. on Parallel Processing, 1978, 6–8

Smith, J., “A Study of Branch Prediction Strategies,” Proc. 8th Int. Symp. on Computer Architecture, 1981, 135–148

Smith, J., “Characterizing Computer Performance with a Single Number,” Communications of the ACM, 31, 10, Oct. 1988, 1201–1206

Smith, J. and Pleszkun, A., “Implementation of Precise Interrupts in Pipelined Processors,” IEEE Trans. on Computers, C-37, 5, May 1988, 562–573 (an earlier version was published in Proc. 12th Int. Symp. on Computer Architecture, 1985)

Smith, J. and Sohi, G., “The Microarchitecture of Superscalar Processors,” Proc. IEEE, 83, 12, Dec. 1995, 1609–1624

Sohi, G., “Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers,” IEEE Trans. on Computers, C-39, 3, Mar. 1990, 349–359 (an earlier version with co-author S. Vajapeyam was published in Proc. 14th Int. Symp. on Computer Architecture, 1987)

Sohi, G., Breach, S., and Vijaykumar, T., “Multiscalar Processors,” Proc. 22nd Int. Symp. on Computer Architecture, 1995, 414–425

Sohi, G. and Roth, A., “Speculative Multithreaded Processors,” IEEE Computer, 34, 4, Apr. 2001, 66–73

Srinivasan, S., Ju, D.-C., Lebeck, A., and Wilkerson, C., “Locality vs. Criticality,” Proc. 28th Int. Symp. on Computer Architecture, 2001, 132–143

Stark, J., Brown, M., and Patt, Y., “On Pipelining Dynamic Instruction Scheduling Logic,” Proc. 34th Int. Symp. on Microarchitecture, 2000, 57–66

Stunkel, C., Herring, J., Abali, B., and Sivaram, R., “A New Switch Chip for IBM RS/6000 SP Systems,” Proc. Int. Conf. on Supercomputing, 1999, 16–33

Sweazey, P. and Smith, A., “A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Future Bus,” Proc. 13th Int. Symp. on Computer Architecture, 1986, 414–423

Tendler, J., Dodson, J., Fields, Jr. J., Le, H., and Sinharoy, B., “POWER 4 System Microarchitecture,” IBM Journal of Research and Development, 46, 1, Jan. 2002, 5–25

Thornton, J., “Parallel Operation in the Control Data 6600,” Proc. AFIPS. FJCC, pt. 2, vol. 26, 1964, 33–40 (reprinted as Chapter 39 of Bell, C. and Newell, A., Eds., Computer Structures: Readings and Examples, McGraw-Hill, New York, 1971, and Chapter 43 of Siewiorek, D., Bell, C., and Newell, A., Eds., Computer Structures: Principles and Examples, McGraw-Hill, New York, 1982)

Thornton, J., Design of a Computer. The Control Data 6600, Scott, Foresman and Co., Glenview, IL, 1970

Tjaden, G. and Flynn, M., “Detection and Parallel Execution of Independent Instructions,” IEEE Trans. on Computers, C-19, 10, Oct. 1970, 889–895

Tomasulo, R., “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal of Research and Development, 11, 1, Jan. 1967, 25–33

Tremblay, M., Chan, J., Chaudhry, S., Coniglaro, A., and Tse, S., “The MAJC Architecture: A Synthesis of Parallelism and Scalability,” IEEE Micro, 20, 6, Nov. 2000, 12–25

Tremblay, M. and O'Connor, J., “UltraSparc I: A Four-issue Processor Supporting Multimedia,” IEEE Micro, 16, 2, Apr. 1996, 42–50

Tucker, L. and Robertson, G., “Architecture and Applications of the Connection Machine,” IEEE Computer, 21, 8, Aug. 1988, 26–38

Tullsen, D., Eggers, S., and Levy, H., “Simultaneous Multithreading: Maximizing On-chip Parallelism,” Proc. 22nd Int. Symp. on Computer Architecture, 1995, 392–403

Tune, E., Liang, D., Tullsen, D., and Calder, B., “Dynamic Prediction of Critical Path Instructions,” Proc. 7th Int. Symp. on High-Performance Computer Architecture, 2001, 185–195

Uhlig, R. and Mudge, T., “Trace-driven Memory Simulation: A Survey,” ACM Computing Surveys, 29, 2, Jun. 1997, 128–170

Vanderwiel, S. and Lilja, D., “Data Prefetch Mechanisms,” ACM Computing Surveys, 32, 2, Jun. 2000, 174–199

VanVleet, P., Anderson, E., Brown, L., Baer, J.-L., and Karlin, A., “Pursuing the Performance Potential of Dynamic Cache Lines,” Proc. ICCD, Oct. 1999, 528–537

Venkatachalam, V. and Franz, M., “Power Reduction Techniques for Microprocessor Systems,” ACM Computing Surveys, 37, 3, Sep. 2005, 195–237

Weicker, R., “Dhrystone: A Synthetic Systems Programming Benchmark,” Communications of the ACM, 27, Oct. 1984, 1013–1030

Weiser, M., Welch, B., Demers, A., and Shenker, S., “Scheduling for Reduced CPU Energy,” Proc. 1st USENIX Symp. on Operating Systems Design and Implementation, 1994, 13–23

Weschler, O., “Inside Intel Core Microarchitecture,” Intel White Paper, 2006, http://download.intel.com/technology/architecture/new_architecture_06.pdf

Wilkes, M., “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. on Electronic Computers, EC-14, Apr. 1965, 270–271

Wong, W. and Baer, J.-L., “Modified LRU Policies for Improving Second-Level Cache Behavior,” Proc. 6th Int. Symp. on High-Performance Computer Architecture, 2000, 49–60

Yeager, K., “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, 16, 2, Apr. 1996, 28–41

Yeh, T.-Y. and Patt, Y., “Alternative Implementations of Two-Level Adaptive Branch Prediction,” Proc. 19th Int. Symp. on Computer Architecture, 1992, 124–134

Yeh, T.-Y. and Patt, Y., “A Comprehensive Instruction Fetch Mechanism for a Processor Supporting Speculative Execution,” Proc. 25th Ann. Symp. on Microarchitecture, 1992, 129–139

Yoaz, A., Erez, M., Ronen, R., and Jourdan, S., “Speculation Techniques for Improving Load Related Instruction Scheduling,” Proc. 26th Int. Symp. on Computer Architecture, 1999, 42–53

Zhang, Z., Zhu, Z., and Zhang, X., “A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality,” Proc. 33rd Int. Symp. on Microarchitecture, 2000, 32–41

Microprocessor Architecture

From Simple Pipelines to Chip Multiprocessors

This Book has been cited by the following publications. This list is generated based on data provided by Crossref.

Book description

Reviews

Refine List

Actions for selected content:

Contents

Frontmatter
pp i-vi

Contents
pp vii-x

Preface
pp xi-xiv

1 - Introduction
pp 1-28

2 - The Basics
pp 29-74

3 - Superscalar Processors
pp 75-128

4 - Front-End: Branch Prediction, Instruction Fetching, and Register Renaming
pp 129-176

5 - Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters
pp 177-207

6 - The Cache Hierarchy
pp 208-259

7 - Multiprocessors
pp 260-302

8 - Multithreading and (Chip) Multiprocessing
pp 303-334

9 - Current Limitations and Future Challenges
pp 335-350

Bibliography
pp 351-360

Index
pp 361-367

Metrics

Altmetric attention score

Full text views

Book summary page views