ABSTRACT
Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing the potential for high computation throughput, scalability, low cost and energy efficiency. CGRAs consist of an array of function units and register files generally organized as a two dimensional grid. The most difficult challenge with deploying CGRAs is compiler scheduling technology that can map software implementations of compute intensive loops onto the array. Traditional schedulers are not suitable because they do not take into account the explicit routing of operand values that is necessary. In essence, the problem of binding operations to time slots and resources is extended to also include explicit routing of operands from producers to consumers. To tackle this problem, this paper introduces a software pipelining technique for mapping loop bodies onto CGRAs, referred to as modulo graph embedding. We leverage graph embedding from graph theory, which is used to draw graphs onto a target space. The loop body is essentially drawn onto the CGRA mesh, subject to modulo resource usage constraints. Modulo graph embedding is effective because it can take into account the communication structure of the loop body during mapping. On average, a compute utilization of 56-68% is achieved for a set of loop kernels across three 4x4 CGRA designs.
- T. Callahan, J. Hauser, and J. Wawrzynek. The Garp architecture and C compiler. IEEE Computer, 33(4):62--69, Apr. 2000. Google ScholarDigital Library
- R. Davidson and D. Harel. Drawing graphs nicely using simulated annealing. ACM Transactions on Graphics, 15(4):301--331, 1996. Google ScholarDigital Library
- P. Eades. A heuristic for graph drawing. Congressus Numerantium, 42:149--160, 1984.Google Scholar
- C. Ebeling et al. Mapping applications to the RaPiD configurable architecture. In Proc. of the 5th IEEE Symposium on Field-Programmable Custom Computing Machines, pages 106--115, Apr. 1997. Google ScholarDigital Library
- J. Ellis. Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, MA, 1985. Google ScholarDigital Library
- S. Goldstein et al. PipeRench: A coprocessor for streaming multimedia acceleration. In Proc. of the 26th Annual International Symposium on Computer Architecture, pages 28--39, June 1999. Google ScholarDigital Library
- T. Kamada and S. Kawai. An algorithm for drawing general undirected graphs. Information Processing Letters, 31:7--15, 1989. Google ScholarDigital Library
- G. Krishnamurthy, E. Granston, and E. Stotzer. Affinity-based cluster assignment for unrolled loops. In Proc. of the 2002 International Conference on Supercomputing, pages 107--116, June 2002. Google ScholarDigital Library
- J. Lee, K. Choi, and N. Dutt. Compilation approach for coarse-grained reconfigurable architectures. IEEE Design & Test of Computers, 20(1):26--33, Jan. 2003. Google ScholarDigital Library
- W. Lee et al. Space-time scheduling of instruction-level parallelism on a RAW machine. In Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 46--57, Oct. 1998. Google ScholarDigital Library
- W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent scheduling. In Proc. of the 35th Annual International Symposium on Microarchitecture, pages 111--122, 2002. Google ScholarDigital Library
- W. Li and H. Kurata. A grid layout algorithm for automatic drawing of biochemical networks. Bioinformatics, 21(9):2036--2042, 2005. Google ScholarDigital Library
- G. Lu, H. Singh, M. -H. Lee, N. Bagherzadeh, F. J. Kurdahi, and E. M. C. Filho. The MorphoSys parallel reconfigurable system. In Proc. of the 5th International Euro-Par Conference, pages 727--734, 1999. Google ScholarDigital Library
- B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In Proc. of the 2003 Design, Automation and Test in Europe, pages 296--301, 2003. Google ScholarDigital Library
- E. Nystrom and A. E. Eichenberger. Effective cluster assignment for modulo scheduling. In Proc. of the 31st Annual International Symposium on Microarchitecture, pages 103--114, Dec. 1998. Google ScholarDigital Library
- B. R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proc. of the 27th Annual International Symposium on Microarchitecture, pages 63--74, Nov. 1994. Google ScholarDigital Library
- J. Sanchez and A. Gonzalez. Modulo scheduling for a fully-distributed clustered VLIW architecture. In Proc. of the 33rd Annual International Symposium on Microarchitecture, pages 124--133, Dec. 2000. Google ScholarDigital Library
- M. B. Taylor et al. The Raw microprocessor: A computational fabric for software circuits and general purpose programs. IEEE Micro, 22(2):25--35, 2002. Google ScholarDigital Library
- G. Venkataramani, W. Najjar, F. Kurdahi, N. Bagherzadeh, and W. Bohm. A compiler framework for mapping applications to a coarse-grained reconfigurable computer architecture. pages 116--125, 2001. Google ScholarDigital Library
Index Terms
- Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures
Recommendations
Edge-centric modulo scheduling for coarse-grained reconfigurable architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniquesCoarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs consist of an array of function units and register ...
Modulo scheduling for highly customized datapaths to increase hardware reusability
CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimizationIn the embedded domain, custom hardware in the form of ASICs is often used to implement critical parts of applications when performance and energy efficiency goals cannot be met with software implementations on a general purpose processor or DSP. The ...
CGRA express: accelerating execution using dynamic operation fusion
CASES '09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systemsCoarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing programmability with the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs have been effectively used ...
Comments