Abstract
This paper evaluates the Raw microprocessor. Raw addresses thechallenge of building a general-purpose architecture that performswell on a larger class of stream and embedded computing applicationsthan existing microprocessors, while still running existingILP-based sequential programs with reasonable performance in theface of increasing wire delays. Raw approaches this challenge byimplementing plenty of on-chip resources - including logic, wires,and pins - in a tiled arrangement, and exposing them through a newISA, so that the software can take advantage of these resources forparallel applications. Raw supports both ILP and streams by routingoperands between architecturally-exposed functional units overa point-to-point scalar operand network. This network offers lowlatency for scalar data transport. Raw manages the effect of wiredelays by exposing the interconnect and using software to orchestrateboth scalar and stream data transport.We have implemented a prototype Raw microprocessor in IBM's180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. Wehave also implemented ILP and stream compilers. Our evaluationattempts to determine the extent to which Raw succeeds in meetingits goal of serving as a more versatile, general-purpose processor.Central to achieving this goal is Raw's ability to exploit all formsof parallelism, including ILP, DLP, TLP, and Stream parallelism.Specifically, we evaluate the performance of Raw on a diverse setof codes including traditional sequential programs, streaming applications,server workloads and bit-level embedded computation.Our experimental methodology makes use of a cycle-accurate simulatorvalidated against our real hardware. Compared to a 180 nmPentium-III, using commodity PC memory system components, Rawperforms within a factor of 2x for sequential applications with a verylow degree of ILP, about 2x to 9x better for higher levels of ILP, and10x-100x better when highly parallel applications are coded in astream language or optimized by hand. The paper also proposes anew versatility metric and uses it to discuss the generality of Raw.
- {1} V. Agarwal, et al. Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures. 2000 ISCA, pp. 248-259. Google ScholarDigital Library
- {2} E. Anderson, et al. LAPACK: A Portable Linear Algebra Library for High-Performance Computers. 1990 ICS, pp. 2-11. Google ScholarDigital Library
- {3} M. Annaratone, et al. The Warp Computer: Architecture, Implementation and Performance. IEEE Transactions on Computers 36, 12 (December 1987), pp. 1523- 1538. Google ScholarDigital Library
- {4} J. Babb, et al. The RAW Benchmark Suite: Computation Structures for General Purpose Computing. 1997 FCCM, pp. 134-143. Google ScholarDigital Library
- {5} R. Barua, et al. Maps: A Compiler-Managed Memory System for Raw Machines. 1999 ISCA, pp. 4-15. Google ScholarDigital Library
- {6} M. Bohr. Interconnect Scaling - The Real Limiter to High Performance ULSI. 1995 IEDM, pp. 241-244.Google Scholar
- {7} D. Chinnery, et al. Closing the Gap Between ASIC & Custom. Kluwer Academic Publishers, 2002.Google Scholar
- {8} K. Diefendorff. Intel Raises the Ante With P858. Microprocessor Report (January 1999), pp. 22-25.Google Scholar
- {9} R. Espasa, et al. Tarantula: A Vector Extension to the Alpha Architecture. 2002 ISCA, pp. 281-292. Google ScholarDigital Library
- {10} S. Goldstein, et al. PipeRench: A Coprocessor for Streaming Multimedia Acceleration. 1999 ISCA, pp. 28-39. Google ScholarDigital Library
- {11} M. I. Gordon, et al. A Stream Compiler for Communication-Exposed Architectures. 2002 ASPLOS, pp. 291-303. Google ScholarDigital Library
- {12} T. Gross, et al. iWarp, Anatomy of a Parallel Computing System. The MIT Press, Cambridge, MA, 1998. Google ScholarDigital Library
- {13} L. Gwennap. Coppermine Outruns Athlon. Microprocessor Report (October 1999), p. 1.Google Scholar
- {14} J. R. Hauser, et al. Garp: A MIPS Processor with Reconfigurable Coprocessor. 1997 FCCM, pp. 12-21. Google ScholarDigital Library
- {15} R. Ho, et al. The Future of Wires. Proceedings of the IEEE 89, 4 (April 2001), pp. 490-504.Google ScholarCross Ref
- {16} H. Hoffmann, et al. Stream Algorithms and Architecture. Technical Memo MIT-LCS-TM-636, LCS, MIT, 2003.Google Scholar
- {17} U. Kapasi, et al. The Imagine Stream Processor. 2002 ICCD, pp. 282-288. Google ScholarDigital Library
- {18} H.-S. Kim, et al. An ISA and Microarchitecture for Instruction Level Distributed Processing. 2002 ISCA, pp. 71-81. Google ScholarDigital Library
- {19} J. Kim, et al. Energy Characterization of a Tiled Architecture Processor with On-Chip Networks. 2003 ISLPED, pp. 424-427. Google ScholarDigital Library
- {20} A. Klein Osowski, et al. Minne SPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research. Computer Architecture Letters 1 (June 2002). Google ScholarDigital Library
- {21} C. Kozyrakis, et al. A New Direction for Computer Architecture Research. IEEE Computer 30, 9 (September 1997), pp. 24-32. Google ScholarDigital Library
- {22} R. Krashinsky, et al. The Vector-Thread Architecture. 2004 ISCA. Google ScholarDigital Library
- {23} J. Kubiatowicz. Integrated Shared-Memory and Message-Passing Communication in the Alewife Multiprocessor. PhD thesis, MIT, 1998. Google ScholarDigital Library
- {24} W. Lee, et al. Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine. 1998 ASPLOS, pp. 46-54. Google ScholarDigital Library
- {25} W. Lee, et al. Convergent Scheduling. 2002 MICRO, pp. 111-122. Google ScholarDigital Library
- {26} D. Lenoski, et al. The Stanford DASH Multiprocessor. IEEE Computer 25, 3 (March 1992), pp. 63-79. Google ScholarDigital Library
- {27} R. Mahnkopf, et al. System on a Chip Technology Platform for .18 micron Digital, Mixed Signal & eDRAM applications. 1999 IEDM, pp. 849-852.Google Scholar
- {28} K. Mai, et al. Smart Memories: A Modular Reconfigurable Architecture. 2000 ISCA, pp. 161-171. Google ScholarDigital Library
- {29} D. Matzke. Will Physical Scalability Sabotage Performance Gains? IEEE Computer 30, 9 (September 1997), pp. 37-39. Google ScholarDigital Library
- {30} J. McCalpin. STREAM: Sustainable Memory Bandwidth in High Perf. Computers. http://www.cs.virginia.edu/stream.Google Scholar
- {31} C. A. Moritz, et al. SimpleFit: A Framework for Analyzing Design Tradeoffs in Raw Architectures. IEEE Transactions on Parallel and Distributed Systems (July 2001), pp. 730-742. Google ScholarDigital Library
- {32} S. Naffziger, et al. The Implementation of the Next-Generation 64b Itanium Microprocessor. 2002 ISSCC, pp. 344-345, 472.Google Scholar
- {33} R. Nagarajan, et al. A Design Space Evaluation of Grid Processor Architectures. 2001 MICRO, pp. 40-51. Google ScholarDigital Library
- {34} M. Narayanan, et al. Generating Permutation Instructions from a High-Level Description. TR UCB-CS-03-1287, UC Berkeley, 2003.Google Scholar
- {35} M. Noakes, et al. The J-Machine Multicomputer: An Architectural Evaluation. 1993 ISCA, pp. 224-235. Google ScholarDigital Library
- {36} S. Palacharla. Complexity-Effective Superscalar Processors. PhD thesis, University of Wisconsin-Madison, 1998. Google ScholarDigital Library
- {37} N. Rovedo, et al. Introducing IBM's First Copper Wiring Foundry Technology: Design, Development, and Qualification of CMOS 7SF, a .18 micron Dual-Oxide Technology for SRAM, ASICs, and Embedded DRAM. Q4 2000 IBM MicroNews, pp. 34-38.Google Scholar
- {38} J. Sanchez, et al. Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture. 2000 MICRO, pp. 124-133. Google ScholarDigital Library
- {39} D. Shoemaker, et al. NuMesh: An Architecture Optimized for Scheduled Communication. Journal of Supercomputing 10, 3 (1996), pp. 285-302. Google ScholarDigital Library
- {40} G. Sohi, et al. Multiscalar Processors. 1995 ISCA, pp. 414-425. Google ScholarDigital Library
- {41} J. Suh, et al. A Performance Analysis of PIM, Stream Processing, and Tiled Processing on Memory-Intensive Signal Processing Kernels. 2003 ISCA, pp. 410- 419. Google ScholarDigital Library
- {42} M. B. Taylor. Deionizer: A Tool For Capturing And Embedding I/O Calls. Technical Memo, CSAIL/Laboratory for Computer Science, MIT, 2004. http://cag.csail.mit.edu/~mtaylor/deionizer.html.Google Scholar
- {43} M. B. Taylor. The Raw Processor Specification. Technical Memo, CSAIL/Laboratory for Computer Science, MIT, 2004.Google Scholar
- {44} M. B. Taylor, et al. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro (Mar 2002), pp. 25-35. Google ScholarDigital Library
- {45} M. B. Taylor, et al. Scalar Oper and Networks: On-Chip Interconnect for ILP in Partitioned Architectures. 2003 HPCA, pp. 341-353. Google ScholarDigital Library
- {46} M. B. Taylor, et al. Scalar Operand Networks: Design, Implementation, and Analysis. Technical Memo, CSAIL/LCS, MIT, 2004.Google Scholar
- {47} W. Thies, et al. StreamIt: A Language for Streaming Applications. 2002 Compiler Construction, pp. 179-196. Google ScholarDigital Library
- {48} E. Waingold, et al. Baring It All to Software: Raw Machines. IEEE Computer 30, 9 (September 1997), pp. 86-93. Google ScholarDigital Library
- {49} D. Wentzlaff. Architectural Implications of Bit-level Computation in Communication Applications. Master's thesis, LCS, MIT, 2002.Google Scholar
- {50} R. Whaley, et al. Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Computing 27, 1-2 (2001), pp. 3-35.Google ScholarDigital Library
- {51} S. Yang, et al. A High Performance 180 nm Generation Logic Technology. 1998 IEDM, pp. 197-200.Google Scholar
Recommendations
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams
ISCA '04: Proceedings of the 31st annual international symposium on Computer architectureThis paper evaluates the Raw microprocessor. Raw addresses thechallenge of building a general-purpose architecture that performswell on a larger class of stream and embedded computing applicationsthan existing microprocessors, while still running ...
The PowerPC 620 microprocessor: a high performance superscalar RISC microprocessor
COMPCON '95: Proceedings of the 40th IEEE Computer Society International ConferenceThe PowerPC 620 RISC microprocessor is the first chip for the application server and technical workstation product line within the PowerPC family. It utilizes a high performance microarchitecture with many advanced superscalar features to exploit ...
The IBM z13 multithreaded microprocessor
The IBM z13™ system is the latest generation of the IBM z Systems™ mainframes. The z13 microprocessor improves upon the IBM zEnterprise® EC12 (zEC12) processor with two vector execution units, higher instruction execution parallelism, and a simultaneous ...
Comments