skip to main content
article

Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Authors Info & Claims
Published:02 March 2004Publication History
Skip Abstract Section

Abstract

This paper evaluates the Raw microprocessor. Raw addresses thechallenge of building a general-purpose architecture that performswell on a larger class of stream and embedded computing applicationsthan existing microprocessors, while still running existingILP-based sequential programs with reasonable performance in theface of increasing wire delays. Raw approaches this challenge byimplementing plenty of on-chip resources - including logic, wires,and pins - in a tiled arrangement, and exposing them through a newISA, so that the software can take advantage of these resources forparallel applications. Raw supports both ILP and streams by routingoperands between architecturally-exposed functional units overa point-to-point scalar operand network. This network offers lowlatency for scalar data transport. Raw manages the effect of wiredelays by exposing the interconnect and using software to orchestrateboth scalar and stream data transport.We have implemented a prototype Raw microprocessor in IBM's180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. Wehave also implemented ILP and stream compilers. Our evaluationattempts to determine the extent to which Raw succeeds in meetingits goal of serving as a more versatile, general-purpose processor.Central to achieving this goal is Raw's ability to exploit all formsof parallelism, including ILP, DLP, TLP, and Stream parallelism.Specifically, we evaluate the performance of Raw on a diverse setof codes including traditional sequential programs, streaming applications,server workloads and bit-level embedded computation.Our experimental methodology makes use of a cycle-accurate simulatorvalidated against our real hardware. Compared to a 180 nmPentium-III, using commodity PC memory system components, Rawperforms within a factor of 2x for sequential applications with a verylow degree of ILP, about 2x to 9x better for higher levels of ILP, and10x-100x better when highly parallel applications are coded in astream language or optimized by hand. The paper also proposes anew versatility metric and uses it to discuss the generality of Raw.

References

  1. {1} V. Agarwal, et al. Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures. 2000 ISCA, pp. 248-259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. {2} E. Anderson, et al. LAPACK: A Portable Linear Algebra Library for High-Performance Computers. 1990 ICS, pp. 2-11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. {3} M. Annaratone, et al. The Warp Computer: Architecture, Implementation and Performance. IEEE Transactions on Computers 36, 12 (December 1987), pp. 1523- 1538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. {4} J. Babb, et al. The RAW Benchmark Suite: Computation Structures for General Purpose Computing. 1997 FCCM, pp. 134-143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. {5} R. Barua, et al. Maps: A Compiler-Managed Memory System for Raw Machines. 1999 ISCA, pp. 4-15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. {6} M. Bohr. Interconnect Scaling - The Real Limiter to High Performance ULSI. 1995 IEDM, pp. 241-244.Google ScholarGoogle Scholar
  7. {7} D. Chinnery, et al. Closing the Gap Between ASIC & Custom. Kluwer Academic Publishers, 2002.Google ScholarGoogle Scholar
  8. {8} K. Diefendorff. Intel Raises the Ante With P858. Microprocessor Report (January 1999), pp. 22-25.Google ScholarGoogle Scholar
  9. {9} R. Espasa, et al. Tarantula: A Vector Extension to the Alpha Architecture. 2002 ISCA, pp. 281-292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. {10} S. Goldstein, et al. PipeRench: A Coprocessor for Streaming Multimedia Acceleration. 1999 ISCA, pp. 28-39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. {11} M. I. Gordon, et al. A Stream Compiler for Communication-Exposed Architectures. 2002 ASPLOS, pp. 291-303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. {12} T. Gross, et al. iWarp, Anatomy of a Parallel Computing System. The MIT Press, Cambridge, MA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. {13} L. Gwennap. Coppermine Outruns Athlon. Microprocessor Report (October 1999), p. 1.Google ScholarGoogle Scholar
  14. {14} J. R. Hauser, et al. Garp: A MIPS Processor with Reconfigurable Coprocessor. 1997 FCCM, pp. 12-21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. {15} R. Ho, et al. The Future of Wires. Proceedings of the IEEE 89, 4 (April 2001), pp. 490-504.Google ScholarGoogle ScholarCross RefCross Ref
  16. {16} H. Hoffmann, et al. Stream Algorithms and Architecture. Technical Memo MIT-LCS-TM-636, LCS, MIT, 2003.Google ScholarGoogle Scholar
  17. {17} U. Kapasi, et al. The Imagine Stream Processor. 2002 ICCD, pp. 282-288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. {18} H.-S. Kim, et al. An ISA and Microarchitecture for Instruction Level Distributed Processing. 2002 ISCA, pp. 71-81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. {19} J. Kim, et al. Energy Characterization of a Tiled Architecture Processor with On-Chip Networks. 2003 ISLPED, pp. 424-427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. {20} A. Klein Osowski, et al. Minne SPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research. Computer Architecture Letters 1 (June 2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. {21} C. Kozyrakis, et al. A New Direction for Computer Architecture Research. IEEE Computer 30, 9 (September 1997), pp. 24-32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. {22} R. Krashinsky, et al. The Vector-Thread Architecture. 2004 ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. {23} J. Kubiatowicz. Integrated Shared-Memory and Message-Passing Communication in the Alewife Multiprocessor. PhD thesis, MIT, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. {24} W. Lee, et al. Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine. 1998 ASPLOS, pp. 46-54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. {25} W. Lee, et al. Convergent Scheduling. 2002 MICRO, pp. 111-122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. {26} D. Lenoski, et al. The Stanford DASH Multiprocessor. IEEE Computer 25, 3 (March 1992), pp. 63-79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. {27} R. Mahnkopf, et al. System on a Chip Technology Platform for .18 micron Digital, Mixed Signal & eDRAM applications. 1999 IEDM, pp. 849-852.Google ScholarGoogle Scholar
  28. {28} K. Mai, et al. Smart Memories: A Modular Reconfigurable Architecture. 2000 ISCA, pp. 161-171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. {29} D. Matzke. Will Physical Scalability Sabotage Performance Gains? IEEE Computer 30, 9 (September 1997), pp. 37-39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. {30} J. McCalpin. STREAM: Sustainable Memory Bandwidth in High Perf. Computers. http://www.cs.virginia.edu/stream.Google ScholarGoogle Scholar
  31. {31} C. A. Moritz, et al. SimpleFit: A Framework for Analyzing Design Tradeoffs in Raw Architectures. IEEE Transactions on Parallel and Distributed Systems (July 2001), pp. 730-742. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. {32} S. Naffziger, et al. The Implementation of the Next-Generation 64b Itanium Microprocessor. 2002 ISSCC, pp. 344-345, 472.Google ScholarGoogle Scholar
  33. {33} R. Nagarajan, et al. A Design Space Evaluation of Grid Processor Architectures. 2001 MICRO, pp. 40-51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. {34} M. Narayanan, et al. Generating Permutation Instructions from a High-Level Description. TR UCB-CS-03-1287, UC Berkeley, 2003.Google ScholarGoogle Scholar
  35. {35} M. Noakes, et al. The J-Machine Multicomputer: An Architectural Evaluation. 1993 ISCA, pp. 224-235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. {36} S. Palacharla. Complexity-Effective Superscalar Processors. PhD thesis, University of Wisconsin-Madison, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. {37} N. Rovedo, et al. Introducing IBM's First Copper Wiring Foundry Technology: Design, Development, and Qualification of CMOS 7SF, a .18 micron Dual-Oxide Technology for SRAM, ASICs, and Embedded DRAM. Q4 2000 IBM MicroNews, pp. 34-38.Google ScholarGoogle Scholar
  38. {38} J. Sanchez, et al. Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture. 2000 MICRO, pp. 124-133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. {39} D. Shoemaker, et al. NuMesh: An Architecture Optimized for Scheduled Communication. Journal of Supercomputing 10, 3 (1996), pp. 285-302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. {40} G. Sohi, et al. Multiscalar Processors. 1995 ISCA, pp. 414-425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. {41} J. Suh, et al. A Performance Analysis of PIM, Stream Processing, and Tiled Processing on Memory-Intensive Signal Processing Kernels. 2003 ISCA, pp. 410- 419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. {42} M. B. Taylor. Deionizer: A Tool For Capturing And Embedding I/O Calls. Technical Memo, CSAIL/Laboratory for Computer Science, MIT, 2004. http://cag.csail.mit.edu/~mtaylor/deionizer.html.Google ScholarGoogle Scholar
  43. {43} M. B. Taylor. The Raw Processor Specification. Technical Memo, CSAIL/Laboratory for Computer Science, MIT, 2004.Google ScholarGoogle Scholar
  44. {44} M. B. Taylor, et al. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro (Mar 2002), pp. 25-35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. {45} M. B. Taylor, et al. Scalar Oper and Networks: On-Chip Interconnect for ILP in Partitioned Architectures. 2003 HPCA, pp. 341-353. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. {46} M. B. Taylor, et al. Scalar Operand Networks: Design, Implementation, and Analysis. Technical Memo, CSAIL/LCS, MIT, 2004.Google ScholarGoogle Scholar
  47. {47} W. Thies, et al. StreamIt: A Language for Streaming Applications. 2002 Compiler Construction, pp. 179-196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. {48} E. Waingold, et al. Baring It All to Software: Raw Machines. IEEE Computer 30, 9 (September 1997), pp. 86-93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. {49} D. Wentzlaff. Architectural Implications of Bit-level Computation in Communication Applications. Master's thesis, LCS, MIT, 2002.Google ScholarGoogle Scholar
  50. {50} R. Whaley, et al. Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Computing 27, 1-2 (2001), pp. 3-35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. {51} S. Yang, et al. A High Performance 180 nm Generation Logic Technology. 1998 IEDM, pp. 197-200.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 32, Issue 2
    ISCA 2004
    March 2004
    373 pages
    ISSN:0163-5964
    DOI:10.1145/1028176
    Issue’s Table of Contents
    • cover image ACM Conferences
      ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture
      June 2004
      373 pages
      ISBN:0769521436

    Copyright © 2004 Authors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 2 March 2004

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader