ABSTRACT
For loop accelerators such as coarse-grained reconfigurable architectures (CGRAs) and GP-GPUs, nested loops represent an important source of parallelism. Existing solutions to mapping nested loops on CGRAs, however, are either designed for perfectly nested loops only, or expensive and inflexible. Efficient CGRA mapping of imperfect loops with arbitrary nesting depth still remains a challenge. In this paper we propose a compiler-hardware co-operative approach that is flexible and yet able to generate efficient mappings for imperfect nested loops. It is based on loop flattening, but to mitigate the negative impact of flattening we combine loop fission and a light-weight architecture extension that is designed to accelerate common operation patterns appearing frequently in flattened loops. Our experimental results using imperfect loops from multimedia and DSP domains demonstrate that our special operations can cover a large portion of nested loop operations, improve performance of nested loops by nearly 30% over using loop flattening only, and achieve near-ideal executions on CGRAs for imperfect loops.
- T. Austin, E. Larson, and D. Ernst. SimpleScalar: an infrastructure for computer system modeling. Computer, 35, 2002. Google ScholarDigital Library
- K. Bondalapati. Parallelizing dsp nested loops on reconfigurable architectures using data context switching. In Proc. DAC, pages 273--276, 2001. Google ScholarDigital Library
- Liang Chen and T. Mitra. Graph minor approach for application mapping on cgras. In Proc. FPT, pages 285--292, 2012.Google ScholarCross Ref
- Grigorios Dimitroulakos, Stavros Georgiopoulos, Michalis D. Galanis, and Costas E. Goutis. Resource aware mapping on coarse grained reconfigurable arrays. Microprocessors and Microsystems, 33(2):91--105, 2009. Google ScholarDigital Library
- S. Friedman et al. SPR: An architecture-adaptive CGRA mapping tool. In Proc. FPGA, pages 191--200. ACM, 2009. Google ScholarDigital Library
- A. Ghuloum et al. Flattening and parallelizing irregular, recurrent loop nests. In Proc. PPOPP 95, 1995. Google ScholarDigital Library
- Hiroyuki Hamasaki et al. Soc for car navigation system with a 55.3gops image recognition engine. In Proc. ASP-DAC, pages 464--465. IEEE Press, 2010. Google ScholarDigital Library
- M. Hamzeh et al. Regimap: Register-aware application mapping on coarse-grained reconfigurable architectures (cgras). In Proc. DAC, pages 18:1--18:10. ACM, 2013. Google ScholarDigital Library
- A. Kejariwal et al. Enhanced loop coalescing: A compiler technique for transforming non-uniform iteration spaces. In ISHPC, pages 17--32, 2005. Google ScholarDigital Library
- Yongjoo Kim et al. Improving performance of nested loops on reconfigurable array processors. ACM Trans. Archit. Code Optim., 8(4):32:1--32:23, January 2012. Google ScholarDigital Library
- D. Kuck, R. Kuhn, B. Leasure, and M. Wolfe. The structure of an advanced vectorizer for pipelined processors. In Proc. IEEE 4th International Computer Software and Applications Conference, 1980.Google Scholar
- C. Lattner and V. Adve. LLVM: a compilation framework for lifelong program analysis transformation. In Proc. CGO, pages 75--86, 2004. Google ScholarDigital Library
- Jaedon Lee, Youngsam Shin, Won-Jong Lee, Soojung Ryu, and Jeongwook Kim. Real-time ray tracing on coarse-grained reconfigurable processor. In Field-Programmable Technology (FPT), pages 192--197, Dec 2013.Google ScholarCross Ref
- Jongeun Lee et al. Fast shared on-chip memory architecture for efficient hybrid computing with CGRAs. In Proc. DATE, March 2013. Google ScholarDigital Library
- D. Liu et al. Polyhedral model based mapping optimization of loop nests for cgras. In Proc. DAC, pages 19:1--19:8. ACM, 2013. Google ScholarDigital Library
- B. Mei et al. Dresc: a retargetable compiler for coarse-grained reconfigurable architectures. In Proc. FPT, pages 166--173, 2002.Google Scholar
- A. Morvan et al. Polyhedral bubble insertion: A method to improve nested loop pipelining for high-level synthesis. IEEE Trans. CAD, 32(3):339--352, 2013.Google ScholarDigital Library
- David A. Padua and Michael J. Wolfe. Advanced compiler optimizations for supercomputers. Commun. ACM, 29(12):1184--1201, December 1986. Google ScholarDigital Library
- Hyunchul Park et al. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In Proc. PACT, pages 166--176, 2008. Google ScholarDigital Library
- H. Rong, Zhizhong Tang, R. Govindarajan, A. Douillet, and G. R. Gao. Single-dimension software pipelining for multi-dimensional loops. In Proc. Code Generation and Optimization, pages 163--174, 2004. Google ScholarDigital Library
- H. Singh et al. Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers, 49:465--481, 2000. Google ScholarDigital Library
- W. Thies et al. Streamit: A language for streaming applications. In R. Horspool, editor, Compiler Construction, volume 2304 of LNCS, pages 179--196. Springer, 2002. Google ScholarDigital Library
- K. Turkington et al. Outer loop pipelining for application specific datapaths in fpgas. IEEE Trans. VLSI, 16(10):1268--1280, 2008. Google ScholarDigital Library
- B. Ylvisaker et al. Macah: A "c-level" language for programming kernels on coprocessor accelerators. In Poster at Languages, Compilers and Tools for Embedded Systems (LCTES), June 2007.Google Scholar
- Wei Zuo et al. Improving high level synthesis optimization opportunity through polyhedral transformations. In Proc. FPGA, pages 9--18. ACM, 2013. Google ScholarDigital Library
Index Terms
- Flattening-based mapping of imperfect loop nests for CGRAs
Recommendations
Polyhedral model based mapping optimization of loop nests for CGRAs
DAC '13: Proceedings of the 50th Annual Design Automation ConferenceThe coarse-grained reconfigurable architecture (CGRA) is a promising platform that provides both high performance and high power-efficiency. The compute-intensive portions of an application (e.g. loops) are often mapped onto CGRA for acceleration. To ...
Flattening and parallelizing irregular, recurrent loop nests
PPOPP '95: Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programmingIrregular loop nests in which the loop bounds are determined dynamically by indexed arrays are difficult to compile into expressive parallel constructs, such as segmented scans and reductions. In this paper, we describe a suite of transformations to ...
Joint affine transformation and loop pipelining for mapping nested loop on CGRAs
DATE '15: Proceedings of the 2015 Design, Automation & Test in Europe Conference & ExhibitionCoarse-Grained Reconfigurable Architectures (CGRAs) are the promising architectures with high performance, high power- efficiency and attractions of flexibility. The computation-intensive portions of application, i.e. loops, are often implemented on ...
Comments