skip to main content
research-article

Trade-offs in loop transformations

Published:07 April 2009Publication History
Skip Abstract Section

Abstract

Nowadays, multimedia systems deal with huge amounts of memory accesses and large memory footprints. To alleviate the impact of these accesses and reduce the memory footprint, high-level memory exploration and optimization techniques have been proposed. These techniques try to more efficiently utilize the memory hierarchy. An important step in these optimization techniques are loop transformations (LT). They have a crucial effect on later data memory footprint optimization steps and code generation. However, the state-of-the-art work has focused only on individual objectives. The main one in literature involves improving the locality of data accesses, and thus reducing the data memory footprint. It does not consider the trade-offs in the LT step in relation to successive optimization steps. Therefore, it is not globally efficient in mapping the application on the target platform.

In this article we will discuss several trade-offs during the loop transformations. To our knowledge, we are the first ones considering these global trade-offs. Previous work always gave mostly one solution, having the best locality and thus the optimized memory footprint, even though some research in two-dimensional trade-offs in this area exists as well. We start from this state-of-the-art solution with minimal footprint. We show that by sacrificing the footprint, we can obtain gains in data reuse (crucial for energy reduction) and reduce the control-flow complexity. We demonstrate our approach on a real-life application, namely the QSDPCM video coder. At the end, we show that considering trade-offs for this application leads to 16% energy reduction in a two-layer memory subsystem and 10% cycle reduction on the ARM platform.

References

  1. Absar, J., Catthoor, F., and Das, K. 2003. Call-instance based function inlining for increasing data access related optimisation opportunities. Tech. rep., IMEC, Leuven, Belgium.Google ScholarGoogle Scholar
  2. Allen, R. and Kennedy, K. 1987. Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4, 491--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ancourt, C., Barthou, D., Guettier, C., Irigoin, F., Jeannet, B., Jourdan, J., and Mattioli, J. 1997. Automatic data mapping of signal processing applications. In Proceedings of the IEEE International Conference on Application-Specific Processors, 350--362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Banakar, R., Steinke, S., Lee, B.-S., Balakrishnan, M., and Marwedel, P. 2002. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In Proceedings of the International Symposium on Hardware/Software Codesign (CODES), 73--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Banerjee, U., Eigenmann, R., Nicolau, A., and Padua, D. 1993. Automatic program parallelization. Proc. IEEE 81, 2, 211--243.Google ScholarGoogle ScholarCross RefCross Ref
  6. Bastoul, C. 2004. Code generation in the polyhedral model is easier than you think. In IEEE International Conference on Parallel Architectureand Compilation Techniques (PACT'13), 7--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bastoul, C., Cohen, A., Girbal, S., Sharma, S., and Temam, O. 2003. Putting polyhedral loop transformations to work. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computers (LCPC'16). Lecture Notes in Computer Science, vol. 2958, 209--225.Google ScholarGoogle ScholarCross RefCross Ref
  8. Brockmeyer, E., Miranda, M., Catthoor, F., and Corporaal, H. 2003. Layer assignment techniques for low energy in multi-layered memory organisations. In Proceedings of the 6th ACM/IEEE Conference on Design and Test in Europe, 1070--1075. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Carr, S., McKinley, K. S., and Tseng, C.-W. 1994. Compiler optimizations for improving data locality. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, 252--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Catthoor, F. 2005. Meta model for human interaction and decision-making: meta-concepts and balloonist vs road-builder metaphore. slide set.Google ScholarGoogle Scholar
  11. Catthoor, F., Danckaert, K., Kulkarni, C., Brockmeyer, E., Kjeldsberg, P. G., Van Achteren, T., and Omnes, T. 2002. Data Access and Storage Management for Embedded Programmable Processors. Kluwer Amsterdam.Google ScholarGoogle Scholar
  12. Cierniak, M. and Li, W. 1995. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the Conference on Programming Language Design and Implementation (SIGPLAN'95), ACM, New York, 205--217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Clauss, P. and Loechner, V. 1998. Parametric analysis of polyhedral iteration spaces. J. VLSI Signal Processing 19, 179--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Danckaert, K., Catthoor, F., and De Man, H. 2000. A preprocessing step for global loop transformations for data transfer and storage optimization. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems, 34--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Darte, A. and Robert, Y. 1995. Affine-by-statement scheduling of uniform and affine loop nests over parametric domains. J. Parall. Distrib. Comput. 29, 1, 43--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. De Greef, E., Catthoor, F., and De Man, H. 1997. Memory size reduction through storage order optimization for embedded parallel multimedia applications. J. Parall. Comput. 23, 12. (Special issue on Parallel Processing and Multimedia) Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. De Man, H., Catthoor, F., Goossens, G., Vanhoof, J., Van Meerbergen, J., Note, S., and Huisken, J. 1990. Architecture-driven synthesis techniques for vlsi implementation of dsp algorithms. Proc. IEEE, 78, 2. (Special issue on The Future of Computer-Aided Design) 319--335.Google ScholarGoogle ScholarCross RefCross Ref
  18. Dezan, C., Le Verge, H., Quinton, P., and Saouter, Y. 1992. The Alpha du CENTAUR experiment. In Algorithms and parallel VLSI architectures II, P. Quinton and Y. Robert (eds.), Elsevier, Amsterdam, 325--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Falk, H. and Marwedel, P. 2003. Control flow optimization by loop nest splitting at the source code level. In Proceedingsof the 6th ACM/IEEE Conference on Design and Test in Europe, 410--415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Feautrier, P. 1992. Some efficient solutions to the affine scheduling problems. Int. J. Parall. Program. 21, 5, 389--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Fraboulet, A., Huard, G., and Mignotte, A. 1999. Loop alignment for memory access optimisation. In Proceedings of the 12th ACM/IEEE International Symposium on System-Level Synthesis (ISSS). 71--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Franke, B. and O'Boyle, M. 2003. Array recovery and high-level transformations for dsp applications. ACM Trans. Embed. Comput. Syst. 2, 2, 132--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hu, Q. 2007. Hierarchical memory size estimation for loop transformation and data memory platform exploration. Ph.D. thesis, Norway University of Science and Technology.Google ScholarGoogle Scholar
  24. Kandemir, M., Ramanujam, J., Choudhary, A., and Banerjee, P. 2001. A layout-conscious iteration space transformation technique. IEEE Trans. Computers 50, 12, 1321--1335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kelly, W. and Pugh, W. 1993. A framework for unifying reordering transformations. Tech. rep. CS-TR-3193, Department of Computer Science, Univ. of Maryland, College Park MD.Google ScholarGoogle Scholar
  26. Kjeldsberg, P. G. 2001. Storage requirement estimation and optimization for data intensive applications. Ph.D. thesis, Norwegian University of Science and Technology.Google ScholarGoogle Scholar
  27. Li, W. and Pingali, K. 1992. A singular loop transformation framework based on non-singular matrices. In Proceedings of the 5th Annual Workshop on Languages and Compilers for Parallelism. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Manjiakian, N. and Abdelrahman, T. 1995. Fusion of loops for parallelism and locality. Tech. rep. CSRI-315, Computer Systems Reserch Institute, University of Toronto.Google ScholarGoogle Scholar
  29. Meng, T. H., Gordon, B., Tsern, E., and Hung, A. 1995. Portable video-on-demand in wireless communication. Proc. IEEE 83, 4 (Special Issue on Low Power Electronics). 659--680.Google ScholarGoogle ScholarCross RefCross Ref
  30. Olsen, R. and Gao, G. 1992. Collective analysis and transformation of loop clusters. Tech. rep. ACAPS Technical Memo 24, McGill University.Google ScholarGoogle Scholar
  31. Palkovic, M., Corporaal, H., and Catthoor, F. 2006. Dealing with data dependent conditions to enable general global source code transformations. Int. J. Embed. Syst. To appear.Google ScholarGoogle Scholar
  32. Polychronopoulos, C. 1988. Compiler optimizations for enhancing parallelism and their impact on the architecture design. IEEE Trans. Computers 37, 8, 991--1004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Pugh, W. 1992. The omega test: A fast and practical integer programming algorithm for dependence analysis. Comm. ACM 35, 8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Qin, W. and Malik, S. 2003. Flexible and formal modeling of microprocessors with application to retargetable simulation. In Proceedings of the Conference on Design Automation and Test in Europe (DATE 03). 556--561. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Quillere, F., Rajopadhye, S., and Wilde, D. 2000. Generation of efficient nested loops from polyhedra. Int. J. Parall. Program. 28, 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sarkar, V. and Gao, G. R. 1991. Optimization of array accesses by collective loop transformations. In Proceedings of the 5th International Conference on Supercomputing. 194--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Schuster, T., Bougard, B., Raghavan, P., Priewasser, R., Novo, D., der Perre, L. V., and Catthoor, F. 2007. Design of a low power pre-synchronization asip for multimode sdr terminals. In Proceedings of the International Symposium on Systems, Architectures, Modeling and Simulation (SAMOS). 322--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Semeria, L. and De Micheli, G. 1998. Spc: synthesis of pointers in c. In Proceedings of the IEEE International Conference on Computer Aided Design. 340--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Song, Y., Xu, R., Wang, C., and Li, Z. 2001. Data locality enhancement by memory reduction. In International Conference on Supercomputing. 50--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Strobach, P. 1988. Qsdpcm: A new technique in scene adaptive coding. In Proceedings of the 4th European Signal Processing Conference (EUSIPCO-88). 1141--1144.Google ScholarGoogle Scholar
  41. Thies, W., Vivien, F., Sheldon, J., and Amarasinghe, S. 2001. A unified framework for schedule and storage optimization. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation. 232--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Van Achteren, T., Deconinck, G., Catthoor, F., and Lauwereins, R. 2002. Data reuse exploration methodology for loop-dominated applications. In Proceedings of the 5th ACM/IEEE Conference on Design and Test in Europe (DATE). 428--435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. van Swaaij, M., Franssen, F., Catthoor, F., and De Man, H. 1992. Modelling data and control flow for high-level memory management. In Proceedings of the 3rd ACM/IEEE European Design Automation Conference. 8--13.Google ScholarGoogle Scholar
  44. Vanbroekhoven, P., Janssens, G., Bruynooghe, M., Corporaal, H., and Catthoor, F. 2003. Advanced copy propagation for arrays. In Proceedings of the SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'03). 24--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Vander Aa, T., Corporaal, H., Catthoor, F., and Deconinck, G. 2005. Combining data and instruction memory energy optimizations for embedded applications. In Proceedings of the 3rd workshop on Embedded Systems for Real-Time Multimedia. 121--126.Google ScholarGoogle Scholar
  46. Verdoolaege, S., Catthoor, F., Bruynooghe, M., and Janssens, G. 2003. Multidimensional incremental loop fusion for data locality. In Proceedings of the IEEE 14th International Conference on Application-specific Systems, Architectures and Processors.Google ScholarGoogle Scholar
  47. Wilde, D. 1993. A library for doing polyhedral operations. M.S. thesis, Oregon State University.Google ScholarGoogle Scholar
  48. Wolf, M. and Lam, M. 1991. A data locality optimizing algorithm. In Proceedings of the Conference on Programming Language Design and Implementation (SIGPLAN'91). 30--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Yang, P., Wong, C., Marchal, P., Catthoor, F., Desmet, D., Verkest, D., and Lauwereins, R. 2001. Energy-aware runtime scheduling for embedded multi-processor socs. IEEE Design Test of Comput. 18, 5. (Special issue on Application-Specific Multiprocessor Mapping), 46--58. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Trade-offs in loop transformations

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Design Automation of Electronic Systems
              ACM Transactions on Design Automation of Electronic Systems  Volume 14, Issue 2
              March 2009
              384 pages
              ISSN:1084-4309
              EISSN:1557-7309
              DOI:10.1145/1497561
              Issue’s Table of Contents

              Copyright © 2009 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 7 April 2009
              • Accepted: 1 December 2008
              • Revised: 1 November 2008
              • Received: 1 December 2006
              Published in todaes Volume 14, Issue 2

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader