research-article

An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures

Authors:
Choonki Jang

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Jungwon Kim

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Jaejin Lee

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Hee-Seok Kim

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Dong-Hoon Yoo

Samsung Electronics, Giheung, South Korea

Samsung Electronics, Giheung, South Korea
View Profile

,
Sukjin Kim

Samsung Electronics, Giheung, South Korea

Samsung Electronics, Giheung, South Korea
View Profile

,
Hong-Seok Kim

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

,
Soojung Ryu

Samsung Electronics, Giheung, South Korea

Samsung Electronics, Giheung, South Korea
View Profile

LCTES '11: Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systemsApril 2011Pages 151–160https://doi.org/10.1145/1967677.1967699

Published:11 April 2011Publication History

LCTES '11: Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems

Pages 151–160

ABSTRACT

In this paper, we propose a data partitioning technique for the memory subsystem that consists of a multi-ported scratchpad memory (SPM) unit and a single-ported data cache in coarse-grained reconfigurable arrays (CGRA) architecture. The embedded reconfigurable processor executes programs by switching between the Non-VLIW and VLIW modes depending on the type of the code region to achieve high performance. The VLIW mode exploits code regions with high ILP that require high memory bandwidth and the Non-VLIW mode exploits those with low ILP that require low memory latency. Our data partitioning technique between the SPM and the data cache is based on data interference graph reduction and profiling information. Given an SPM size, it finds the optimal data partitions by taking the VLIW instruction schedule into consideration. We evaluate our data partitioning technique for the CGRA architecture with three representative multimedia applications.

References

Federico Angiolini, Luca Benini, and Alberto Caprara. Polynomial-time algorithm for on-chip scratchpad memory partitioning. In CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, pages 318--326, 2003. Google ScholarDigital Library
Oren Avissar, Rajeev Barua, and Dave Stewart. An optimal memory allocation scheme for scratch-pad-based embedded systems. ACM Trans. Embed. Comput. Syst., 1(1):6--26, 2002. Google ScholarDigital Library
Kristof Beyls and Erik H. D'Hollander. Generating cache hints for improved program efficiency. J. Syst. Archit., 51(4):223--250, 2005. Google ScholarDigital Library
CACTI 4.2. http://quid.hpl.hp.com:9081/cacti/, 2006.Google Scholar
Hyungmin Cho, Bernhard Egger, Jaejin Lee, and Heonshik Shin. Dynamic data scratchpad memory management for a memory subsystem with an mmu. In LCTES '07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 195--206, 2007. Google ScholarDigital Library
Intel Corporation. Intel Itanium 2 Processor Reference Manual For Software Development and Optimization. 2004.Google Scholar
Eddy De Greef, Francky Catthoor, and Hugo De Man. Array placement for storage size reduction in embedded multimedia systems. In ASAP '97: Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors, pages 66--, 1997. Google ScholarDigital Library
Angel Dominguez, Nghi Nguyen, and Rajeev K. Barua. Recursive function data allocation to scratch-pad memory. In CASES '07: Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, pages 65--74, 2007. Google ScholarDigital Library
Angel Dominguez, Sumesh Udayakumaran, and Rajeev Barua. Heap data allocation to scratch-pad memory in embedded systems. Journal of Embedded Computing, 1(4):521--540, 2005. Google ScholarDigital Library
Michael R. Garey and David S. Johnson. Computers and Intractability. Freeman, 1979.Google ScholarDigital Library
Antonio Gonz&#225;lez, Carlos Aliagas, and Mateo Valero. A data cache with multiple caching strategies tuned to different types of locality. In ICS '95: Proceedings of the 9th international conference on Supercomputing, pages 338--347, 1995. Google ScholarDigital Library
AMD Inc. Software Optimization Guide for AMD64 Processors. 2005.Google Scholar
Texas Instruments Incoporated. Tms320c6000 high performance dsps. http://www.ti.com, 2006.Google Scholar
ISO/IEC. IS 13818--3 Information Technology - Generic Coding of Moving Pictures and Associated Audio: Audio. 1996. MP3.Google Scholar
ISO/IEC. IS 14496--10 Information Technology - Coding of Audio Visual Objects: Advanced Video Coding. 2005. H.264.Google Scholar
ISO/IEC. IS 14496--3 Information Technology - Coding of Audio Visual Objects: Audio. 2005. AAC.Google Scholar
Toni Juan, Juan J. Navarro, and Olivier Temam. Data caches for superscalar processors. In ICS '97: Proceedings of the 11th international conference on Supercomputing, pages 60--67, 1997. Google ScholarDigital Library
Hsien-Hsin S. Lee and Gary S. Tyson. Region-based caching: an energy-delay efficient memory architecture for embedded processors. In CASES '00: Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems, pages 120--127, 2000. Google ScholarDigital Library
Jacob Leverich, Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, and Christos Kozyrakis. Comparing memory systems for chip multiprocessors. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 358--368, 2007. Google ScholarDigital Library
ARM Limited. RealView SoC Designer 6.2,. http://www.arm.com/products/DevTools/SoCDesigner.html.Google Scholar
Guangming Lu, Hartej Singh, Ming-Hau Lee, Nader Bagherzadeh, Fadi J. Kurdahi, and Eliseu M. Chaves Filho. The morphosys parallel reconfigurable system. In Euro-Par '99: Proceedings of the 5th International Euro-Par Conference on Parallel Processing, pages 727--734, 1999. Google ScholarDigital Library
Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In DATE '03: Proceedings of the conference on Design, Automation and Test in Europe, page 10296, 2003. Google ScholarDigital Library
Bingfeng Mei, Serge Vernalde, Diederik Verkest, and Rudy Lauwereins. Design methodology for a tightly coupled vliw/reconfigurable matrix architecture: A case study. In DATE '04: Proceedings of the conference on Design, automation and test in Europe, page 21224, 2004. Google ScholarDigital Library
Wilfried Oed and O. Lange. On the effective bandwidth of interleaved memories in vector processor systems. IEEE Trans. Comput., 34(10):949--957, 1985. Google ScholarDigital Library
Taewook Oh, Bernhard Egger, Hyunchul Park, and Scott Mahlke. Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures. In LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 21--30, 2009. Google ScholarDigital Library
Hyunchul Park, Kevin Fan, Manjunath Kudlur, and Scott Mahlke. Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures. In CASES '06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pages 136--146, 2006. Google ScholarDigital Library
Hyunchul Park, Kevin Fan, Scott A. Mahlke, Taewook Oh, Heeseok Kim, and Hong-seok Kim. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 166--176, 2008. Google ScholarDigital Library
Yongjun Park, Hyunchul Park, and Scott Mahlke. Cgra express: accelerating execution using dynamic operation fusion. In CASES '09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, pages 271--280, 2009. Google ScholarDigital Library
Ram Raghavan and John P. Hayes. Reducing interference among vector accesses in interleaved memories. IEEE Trans. Comput., 42(4):471--483, 1993. Google ScholarDigital Library
B. Ramakrishna Rau. Iterative modulo scheduling: an algorithm for software pipelining loops. In MICRO 27: Proceedings of the 27th annual international symposium on Microarchitecture, pages 63--74, 1994. Google ScholarDigital Library
Rajiv Ravindran, Michael Chu, and Scott Mahlke. Compiler-managed partitioned data caches for low power. In LCTES '07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 237--247, 2007. Google ScholarDigital Library
Jude A. Rivers, Gary S. Tyson, Edward S. Davidson, and Todd M. Austin. On high-bandwidth data cache design for multi-issue processors. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 46--56, 1997. Google ScholarDigital Library
Julio Sahuquillo, Salvador Petit, Ana Pont, and Veljko Milutinovi&#263;. Exploring the performance of split data cache schemes on superscalar processors and symmetric multiprocessors. J. Syst. Archit., 51(8):451--469, 2005. Google ScholarDigital Library
Jes&#250;s S&#225;nchez and Antonio Gonz&#225;lez. A locality sensitive multi-module cache with explicit management. In ICS '99: Proceedings of the 13th international conference on Supercomputing, pages 51--59, 1999. Google ScholarDigital Library
Aviral Shrivastava, Ilya Issenin, and Nikil Dutt. Compilation techniques for energy reduction in horizontally partitioned cache architectures. In CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, pages 90--96, 2005. Google ScholarDigital Library
Gurindar S. Sohi and Manoj Franklin. High-bandwidth data memory systems for superscalar processors. In ASPLOS-IV: Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, pages 53--62, 1991. Google ScholarDigital Library
Stefan Steinke, Lars Wehmeyer, Bo-Sik Lee, and Peter Marwedel. Assigning program and data objects to scratchpad for energy reduction. In DATE '02: Proceedings of the conference on Design, automation and test in Europe, page 409, 2002. Google ScholarDigital Library
Tensilica Inc. Xtensa customizable processors. http://www.tensilica.com, 2007.Google Scholar
Remko Tron&#231;on, Maurice Bruynooghe, Gerda Janssens, and Francky Catthoor. Storage size reduction by in-place mapping of arrays. In VMCAI '02: Revised Papers from the Third International Workshop on Verification, Model Checking, and Abstract Interpretation, pages 167--181, 2002. Google ScholarDigital Library
Gary Tyson, Matthew Farrens, John Matthews, and Andrew R. Pleszkun. A modified approach to data cache management. In MICRO 28: Proceedings of the 28th annual international symposium on Microarchitecture, pages 93--103, 1995. Google ScholarDigital Library
Sumesh Udayakumaran and Rajeev Barua. Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, pages 276--286, 2003. Google ScholarDigital Library
Osman S. Unsal, Israel Koren, C. Mani Krishna, and Csaba Andras Moritz. The minimax cache: An energy-efficient framework for media processors. In HPCA '02: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, page 131, 2002. Google ScholarDigital Library
Manish Verma, Stefan Steinke, and Peter Marwedel. Data partitioning for maximal scratchpad usage. In ASP-DAC '03: Proceedings of the 2003 Asia and South Pacific Design Automation Conference, pages 77--83, 2003. Google ScholarDigital Library
Lars Wehmeyer, Urs Helmig, and Peter Marwedel. Compiler-optimized usage of partitioned memories. In WMPI '04: Proceedings of the 3rd workshop on Memory performance issues, pages 114--120, 2004. Google ScholarDigital Library

Index Terms

An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures
LCTES '10

In this paper, we propose a data partitioning technique for the memory subsystem that consists of a multi-ported scratchpad memory (SPM) unit and a single-ported data cache in coarse-grained reconfigurable arrays (CGRA) architecture. The embedded ...
Read More
Fast, frequency-based, integrated register allocation and instruction scheduling

Instruction scheduling and register allocation are two of the most important optimization phases in modern compilers as they have a significant impact on the quality of the generated code. Unfortunately, the objectives of these two optimizations are in ...
Read More
Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures

To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
LCTES '11: Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
April 2011
182 pages
ISBN:9781450305556
DOI:10.1145/1967677
General Chair:
Jan Vitek
Purdue University, USA
,
Program Chair:
Bjorn De Sutter
Ghent University, Belgium
ACM SIGPLAN Notices Volume 46, Issue 5
LCTES '10
May 2011
170 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2016603
Issue’s Table of Contents
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 April 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
coarse grained reconfigurable arrays
compilers
data partitioning
instruction scheduling
vliw
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate116of438submissions,26%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 499
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures

LCTES '11: Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures

Fast, frequency-based, integrated register allocation and instruction scheduling

Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures