research-article

Trade-offs in loop transformations

Authors:
Martin Palkovic

IMEC, Leuven, Belgium

IMEC, Leuven, Belgium
View Profile

,
Francky Catthoor

IMEC, Leuven, Belgium

IMEC, Leuven, Belgium
View Profile

,
Henk Corporaal

Technische Universiteit Eindhoven, AZ Eindhoven, The Netherlands

Technische Universiteit Eindhoven, AZ Eindhoven, The Netherlands
View Profile

ACM Transactions on Design Automation of Electronic Systems Volume 14 Issue 2Article No.: 22pp 1–30https://doi.org/10.1145/1497561.1497565

Published:07 April 2009Publication History

ACM Transactions on Design Automation of Electronic Systems

Abstract

Nowadays, multimedia systems deal with huge amounts of memory accesses and large memory footprints. To alleviate the impact of these accesses and reduce the memory footprint, high-level memory exploration and optimization techniques have been proposed. These techniques try to more efficiently utilize the memory hierarchy. An important step in these optimization techniques are loop transformations (LT). They have a crucial effect on later data memory footprint optimization steps and code generation. However, the state-of-the-art work has focused only on individual objectives. The main one in literature involves improving the locality of data accesses, and thus reducing the data memory footprint. It does not consider the trade-offs in the LT step in relation to successive optimization steps. Therefore, it is not globally efficient in mapping the application on the target platform.

In this article we will discuss several trade-offs during the loop transformations. To our knowledge, we are the first ones considering these global trade-offs. Previous work always gave mostly one solution, having the best locality and thus the optimized memory footprint, even though some research in two-dimensional trade-offs in this area exists as well. We start from this state-of-the-art solution with minimal footprint. We show that by sacrificing the footprint, we can obtain gains in data reuse (crucial for energy reduction) and reduce the control-flow complexity. We demonstrate our approach on a real-life application, namely the QSDPCM video coder. At the end, we show that considering trade-offs for this application leads to 16% energy reduction in a two-layer memory subsystem and 10% cycle reduction on the ARM platform.

References

Absar, J., Catthoor, F., and Das, K. 2003. Call-instance based function inlining for increasing data access related optimisation opportunities. Tech. rep., IMEC, Leuven, Belgium.Google Scholar
Allen, R. and Kennedy, K. 1987. Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4, 491--542. Google ScholarDigital Library
Ancourt, C., Barthou, D., Guettier, C., Irigoin, F., Jeannet, B., Jourdan, J., and Mattioli, J. 1997. Automatic data mapping of signal processing applications. In Proceedings of the IEEE International Conference on Application-Specific Processors, 350--362. Google ScholarDigital Library
Banakar, R., Steinke, S., Lee, B.-S., Balakrishnan, M., and Marwedel, P. 2002. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In Proceedings of the International Symposium on Hardware/Software Codesign (CODES), 73--78. Google ScholarDigital Library
Banerjee, U., Eigenmann, R., Nicolau, A., and Padua, D. 1993. Automatic program parallelization. Proc. IEEE 81, 2, 211--243.Google ScholarCross Ref
Bastoul, C. 2004. Code generation in the polyhedral model is easier than you think. In IEEE International Conference on Parallel Architectureand Compilation Techniques (PACT'13), 7--16. Google ScholarDigital Library
Bastoul, C., Cohen, A., Girbal, S., Sharma, S., and Temam, O. 2003. Putting polyhedral loop transformations to work. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computers (LCPC'16). Lecture Notes in Computer Science, vol. 2958, 209--225.Google ScholarCross Ref
Brockmeyer, E., Miranda, M., Catthoor, F., and Corporaal, H. 2003. Layer assignment techniques for low energy in multi-layered memory organisations. In Proceedings of the 6th ACM/IEEE Conference on Design and Test in Europe, 1070--1075. Google ScholarDigital Library
Carr, S., McKinley, K. S., and Tseng, C.-W. 1994. Compiler optimizations for improving data locality. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, 252--262. Google ScholarDigital Library
Catthoor, F. 2005. Meta model for human interaction and decision-making: meta-concepts and balloonist vs road-builder metaphore. slide set.Google Scholar
Catthoor, F., Danckaert, K., Kulkarni, C., Brockmeyer, E., Kjeldsberg, P. G., Van Achteren, T., and Omnes, T. 2002. Data Access and Storage Management for Embedded Programmable Processors. Kluwer Amsterdam.Google Scholar
Cierniak, M. and Li, W. 1995. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the Conference on Programming Language Design and Implementation (SIGPLAN'95), ACM, New York, 205--217. Google ScholarDigital Library
Clauss, P. and Loechner, V. 1998. Parametric analysis of polyhedral iteration spaces. J. VLSI Signal Processing 19, 179--194. Google ScholarDigital Library
Danckaert, K., Catthoor, F., and De Man, H. 2000. A preprocessing step for global loop transformations for data transfer and storage optimization. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems, 34--40. Google ScholarDigital Library
Darte, A. and Robert, Y. 1995. Affine-by-statement scheduling of uniform and affine loop nests over parametric domains. J. Parall. Distrib. Comput. 29, 1, 43--59. Google ScholarDigital Library
De Greef, E., Catthoor, F., and De Man, H. 1997. Memory size reduction through storage order optimization for embedded parallel multimedia applications. J. Parall. Comput. 23, 12. (Special issue on Parallel Processing and Multimedia) Google ScholarDigital Library
De Man, H., Catthoor, F., Goossens, G., Vanhoof, J., Van Meerbergen, J., Note, S., and Huisken, J. 1990. Architecture-driven synthesis techniques for vlsi implementation of dsp algorithms. Proc. IEEE, 78, 2. (Special issue on The Future of Computer-Aided Design) 319--335.Google ScholarCross Ref
Dezan, C., Le Verge, H., Quinton, P., and Saouter, Y. 1992. The Alpha du CENTAUR experiment. In Algorithms and parallel VLSI architectures II, P. Quinton and Y. Robert (eds.), Elsevier, Amsterdam, 325--334. Google ScholarDigital Library
Falk, H. and Marwedel, P. 2003. Control flow optimization by loop nest splitting at the source code level. In Proceedingsof the 6th ACM/IEEE Conference on Design and Test in Europe, 410--415. Google ScholarDigital Library
Feautrier, P. 1992. Some efficient solutions to the affine scheduling problems. Int. J. Parall. Program. 21, 5, 389--420. Google ScholarDigital Library
Fraboulet, A., Huard, G., and Mignotte, A. 1999. Loop alignment for memory access optimisation. In Proceedings of the 12th ACM/IEEE International Symposium on System-Level Synthesis (ISSS). 71--70. Google ScholarDigital Library
Franke, B. and O'Boyle, M. 2003. Array recovery and high-level transformations for dsp applications. ACM Trans. Embed. Comput. Syst. 2, 2, 132--162. Google ScholarDigital Library
Hu, Q. 2007. Hierarchical memory size estimation for loop transformation and data memory platform exploration. Ph.D. thesis, Norway University of Science and Technology.Google Scholar
Kandemir, M., Ramanujam, J., Choudhary, A., and Banerjee, P. 2001. A layout-conscious iteration space transformation technique. IEEE Trans. Computers 50, 12, 1321--1335. Google ScholarDigital Library
Kelly, W. and Pugh, W. 1993. A framework for unifying reordering transformations. Tech. rep. CS-TR-3193, Department of Computer Science, Univ. of Maryland, College Park MD.Google Scholar
Kjeldsberg, P. G. 2001. Storage requirement estimation and optimization for data intensive applications. Ph.D. thesis, Norwegian University of Science and Technology.Google Scholar
Li, W. and Pingali, K. 1992. A singular loop transformation framework based on non-singular matrices. In Proceedings of the 5th Annual Workshop on Languages and Compilers for Parallelism. Google ScholarDigital Library
Manjiakian, N. and Abdelrahman, T. 1995. Fusion of loops for parallelism and locality. Tech. rep. CSRI-315, Computer Systems Reserch Institute, University of Toronto.Google Scholar
Meng, T. H., Gordon, B., Tsern, E., and Hung, A. 1995. Portable video-on-demand in wireless communication. Proc. IEEE 83, 4 (Special Issue on Low Power Electronics). 659--680.Google ScholarCross Ref
Olsen, R. and Gao, G. 1992. Collective analysis and transformation of loop clusters. Tech. rep. ACAPS Technical Memo 24, McGill University.Google Scholar
Palkovic, M., Corporaal, H., and Catthoor, F. 2006. Dealing with data dependent conditions to enable general global source code transformations. Int. J. Embed. Syst. To appear.Google Scholar
Polychronopoulos, C. 1988. Compiler optimizations for enhancing parallelism and their impact on the architecture design. IEEE Trans. Computers 37, 8, 991--1004. Google ScholarDigital Library
Pugh, W. 1992. The omega test: A fast and practical integer programming algorithm for dependence analysis. Comm. ACM 35, 8.Google ScholarDigital Library
Qin, W. and Malik, S. 2003. Flexible and formal modeling of microprocessors with application to retargetable simulation. In Proceedings of the Conference on Design Automation and Test in Europe (DATE 03). 556--561. Google ScholarDigital Library
Quillere, F., Rajopadhye, S., and Wilde, D. 2000. Generation of efficient nested loops from polyhedra. Int. J. Parall. Program. 28, 5. Google ScholarDigital Library
Sarkar, V. and Gao, G. R. 1991. Optimization of array accesses by collective loop transformations. In Proceedings of the 5th International Conference on Supercomputing. 194--205. Google ScholarDigital Library
Schuster, T., Bougard, B., Raghavan, P., Priewasser, R., Novo, D., der Perre, L. V., and Catthoor, F. 2007. Design of a low power pre-synchronization asip for multimode sdr terminals. In Proceedings of the International Symposium on Systems, Architectures, Modeling and Simulation (SAMOS). 322--332. Google ScholarDigital Library
Semeria, L. and De Micheli, G. 1998. Spc: synthesis of pointers in c. In Proceedings of the IEEE International Conference on Computer Aided Design. 340--346. Google ScholarDigital Library
Song, Y., Xu, R., Wang, C., and Li, Z. 2001. Data locality enhancement by memory reduction. In International Conference on Supercomputing. 50--64. Google ScholarDigital Library
Strobach, P. 1988. Qsdpcm: A new technique in scene adaptive coding. In Proceedings of the 4th European Signal Processing Conference (EUSIPCO-88). 1141--1144.Google Scholar
Thies, W., Vivien, F., Sheldon, J., and Amarasinghe, S. 2001. A unified framework for schedule and storage optimization. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation. 232--242. Google ScholarDigital Library
Van Achteren, T., Deconinck, G., Catthoor, F., and Lauwereins, R. 2002. Data reuse exploration methodology for loop-dominated applications. In Proceedings of the 5th ACM/IEEE Conference on Design and Test in Europe (DATE). 428--435. Google ScholarDigital Library
van Swaaij, M., Franssen, F., Catthoor, F., and De Man, H. 1992. Modelling data and control flow for high-level memory management. In Proceedings of the 3rd ACM/IEEE European Design Automation Conference. 8--13.Google Scholar
Vanbroekhoven, P., Janssens, G., Bruynooghe, M., Corporaal, H., and Catthoor, F. 2003. Advanced copy propagation for arrays. In Proceedings of the SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'03). 24--33. Google ScholarDigital Library
Vander Aa, T., Corporaal, H., Catthoor, F., and Deconinck, G. 2005. Combining data and instruction memory energy optimizations for embedded applications. In Proceedings of the 3rd workshop on Embedded Systems for Real-Time Multimedia. 121--126.Google Scholar
Verdoolaege, S., Catthoor, F., Bruynooghe, M., and Janssens, G. 2003. Multidimensional incremental loop fusion for data locality. In Proceedings of the IEEE 14th International Conference on Application-specific Systems, Architectures and Processors.Google Scholar
Wilde, D. 1993. A library for doing polyhedral operations. M.S. thesis, Oregon State University.Google Scholar
Wolf, M. and Lam, M. 1991. A data locality optimizing algorithm. In Proceedings of the Conference on Programming Language Design and Implementation (SIGPLAN'91). 30--43. Google ScholarDigital Library
Yang, P., Wong, C., Marchal, P., Catthoor, F., Desmet, D., Verkest, D., and Lauwereins, R. 2001. Energy-aware runtime scheduling for embedded multi-processor socs. IEEE Design Test of Comput. 18, 5. (Special issue on Application-Specific Multiprocessor Mapping), 46--58. Google ScholarDigital Library

Index Terms

Trade-offs in loop transformations

Recommendations

DRAM Refresh Mechanisms, Penalties, and Trade-Offs
Ever-growing application data footprints demand faster main memory with larger capacity. DRAM has been the technology choice for main memory due to its low latency and high density. However, DRAM cells must be refreshed periodically to preserve their ...
Read More
Understanding the trade-offs in multi-level cell ReRAM memory design
DAC '13: Proceedings of the 50th Annual Design Automation Conference

Resistive Random Access Memory (ReRAM) is one of the most promising emerging memory technologies as a potential replacement for DRAM memory and/or NAND Flash. Multi-level cell (MLC) ReRAM, which can store multiple bits in a single ReRAM cell, can ...
Read More
Loop transformations for flash memory: cost models and performance effects

Loop optimization, made of a sequence of loop transformations, plays an important role in performance improvement in data centric applications. Programs using flash memory are no exception to this, but, under certain conditions careless applications of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Design Automation of Electronic Systems Volume 14, Issue 2
March 2009
384 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/1497561
Issue’s Table of Contents

Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 7 April 2009
- Accepted: 1 December 2008
- Revised: 1 November 2008
- Received: 1 December 2006
Published in todaes Volume 14, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data transfer and storage exploration
cost components
loop transformations
optimization
trade-offs
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 418
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Trade-offs in loop transformations

ACM Transactions on Design Automation of Electronic Systems

Abstract

References

Cited By

Index Terms

Recommendations

DRAM Refresh Mechanisms, Penalties, and Trade-Offs

Understanding the trade-offs in multi-level cell ReRAM memory design

Loop transformations for flash memory: cost models and performance effects

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Trade-offs in loop transformations

ACM Transactions on Design Automation of Electronic Systems

Abstract

References

Cited By

Index Terms

Recommendations

DRAM Refresh Mechanisms, Penalties, and Trade-Offs

Understanding the trade-offs in multi-level cell ReRAM memory design

Loop transformations for flash memory: cost models and performance effects

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media