ABSTRACT
Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor distributes all of its resources, including instruction streams, register files, memory ports, and ALUs, over a pipelined two-dimensional mesh interconnect, and exposes them fully to the compiler. Because communication in Raw machines is distributed, compiling for instruction-level parallelism (ILP) requires both spatial instruction partitioning as well as traditional temporal instruction scheduling. In addition, the compiler must explicitly manage all communication through the interconnect, including the global synchronization required at branch points. This paper describes RAWCC, the compiler we have developed for compiling general-purpose sequential programs to the distributed Raw architecture. We present performance results that demonstrate that although Raw machines provide no mechanisms for global communication the Raw compiler can schedule to achieve speedups that scale with the number of available functional units.
- 1.I. Ahmad, Y. Kwok, and M. Wu. Analysis, Evaluation, and Comparison of Algorithms for Scheduling Task Graphs on Parallel Processors. in Proceedings of the Second international Symposium on Parallel Architectures, Algorithms, and Networks, pages 207-213, June 1996. Google ScholarDigital Library
- 2.S. Amarasingle, J. Anderson, C. Wilson, S. Liao, B. Murphy, R. French, and M. Lam. Multiprocessors from a Software Perspective. iEEE Micro, pages 52-61, June 1996. Google ScholarDigital Library
- 3.D. August, W. Hwu, and S. Mahlke. A Framework for Balancing Control Flow and Predication. In Proceedings of the 30th International Symposium on Microarchitecture, Dec. 1997. Google ScholarDigital Library
- 4.J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua, M. Taylor, J. Kim, S. Devabhaktuni, and A. Agarwal. The RAW Benchmark Suite: Computation Structures for General Purpose Computing. In IEEE Symposium on Field- Programmable Custom Computing Machines, Napa Valley, CA, Apr. 1997. Google ScholarDigital Library
- 5.J. Babb, R. Tessier, M. Dahl, S. Hanono, D. Hoki, and A. Agarwal. Logic emulation with virtual wires. IEEE Transactions on Computer Aided Design, 16(6):609-626, June 1997. Google ScholarDigital Library
- 6.S. Banerjia. Instruction Scheduling and Fetch Mechanism for Clustered VLIW Processors. In Ph.D Thesis, North Carolina State University, 1998. Google ScholarDigital Library
- 7.R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal. Maps: A Compiler-Managed Memory System for Raw Machines. Technical Memo TM-583, MIT LCS, July 1998. Google ScholarDigital Library
- 8.R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal. Memory Bank Disambiguation using Modulo Unrolling for Raw Machines. In Proceedings of the A CM/IEEE Fifth lnt 'l Conference on High Performance Computing(HiPC), Dec. 1998. Google ScholarDigital Library
- 9.A. Capitanio, N. Dutt, and A. Nicolau. Partitioned Register Files for VLIWs: A Preliminary Analysis of Tradeoffs. In Proceedings of the ~5th International Symposium on Microarchitecture, Dec. 1992. Google ScholarDigital Library
- 10.G. Deso}i. Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach. Technical report, HP HPL- 98-13, Feb. 1998.Google Scholar
- 11.M. Dikaiakos, A. Rogers, and K. Steiglitz. A Comparison of Techniques Used for Mapping Parallel Algorithms to Message-Passing Multiprocessors. In Proceedings of the 6th IEEE Symposium on Parallel and Distributed Processing, Oct. 1994.Google ScholarDigital Library
- 12.J. R. Ellis. Bulldog: A Compiler for VLIW Architectures. In Ph.D Thesis, Yale University, 1985. Google ScholarDigital Library
- 13.J. A. Fisher. Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Transactions on Computers, 7(C-30):478-490, July 1981.Google Scholar
- 14.L. Gwennap. Digital 21264 Sets New Standard. Microprocessor Report, pages 11-16, Oct. 1996.Google Scholar
- 15.W. Hwu, S. Mahlke, W. Chen, P. Chang, N. Wafter, R. Bringmann, R. Ouellette, R. Hank, T. Kiyohara, G. Haab, J. Holm, and D. Lavery. Tile Superblock: An Effective Technique for VLIW and Superscalar Compilation. The Journal of Supercomputing, 7(1), Jan. 1993. Google ScholarDigital Library
- 16.V. Kathail, M. Schlansker, and B. R. Rau. HPL PlayDoh Architecture Specification: Version 1.0. Technical report, HP HPL-93-80, Feb. 1994.Google Scholar
- 17.M. S. Lam. Software Pipelining: An Effective Scheduling Technique for VLIW Machines. In Proc. Int'l Conf. on Programming Language Design and Implementation (PLDI), pages 318-328, June 1988. Google ScholarDigital Library
- 18.M. S. Lam and R. P. Wilson. Limits of Control Flow on Parallelism. In Proceedings of 19th Annual International Symposium on Computer Architecture, pages 46-57, May 1992. Google ScholarDigital Library
- 19.P. Lowney, S. Freudenberger, T. Karzes, W. Lichtenstein, R. Nix, J. O'Donnell, and J. Ruttenberg. The Multifiow Trace Scheduling Compiler. In Journal of Supercomputing, pages 51-142, Jan. 1993. Google ScholarDigital Library
- 20.V. Sarkar. Partitioning and Scheduling Parallel Programs for Multiprocessors. Pitman, London and The MIT Press, Cambridge, Massachusetts, 1989. In the series, Research Monographs in Parallel and Distributed Computing. Google ScholarDigital Library
- 21.M. D. Smith. Extending SUIF for Machine-dependent Optimizations. In Proceedings of the First SUiF Compiler Workshop, pages 14-25, Stanford, CA, Jan. 1996.Google Scholar
- 22.G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In Proceedings of the 22nd Annual international Symposium on Computer Architecture, pages 414-425, 1995. Google ScholarDigital Library
- 23.E. Waingold, M. Taylor, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, S. Devabhaktuni, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring It All To Software: Raw Machines. Computer, pages 86-93, Sept. 1997. Google ScholarDigital Library
- 24.R. Wilson, R. French, C. Wilson, S. Amarasinghe, J. Anderson, S. Tjiang, S. Liao, C.-W. Tseng, M. Hall, M. Lam, and J. Hennessy. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. ACM SIGPLAN Notices, 29(12), Dec. 1996. Google ScholarDigital Library
- 25.T. Yang and A. Gerasoulis. List Scheduling With and Without Communication. Parallel Computing Journal, 19:1321- 1344, 1993. Google ScholarDigital Library
- 26.T. Yang and A. Gerasoulis. DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors. IEEE Transactions on Parallel and Distributed Systems, 5(9):951-967, 1994. Google ScholarDigital Library
Index Terms
- Space-time scheduling of instruction-level parallelism on a raw machine
Recommendations
Space-time scheduling of instruction-level parallelism on a raw machine
Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor ...
Space-time scheduling of instruction-level parallelism on a raw machine
Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor ...
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Comments