Abstract
Nowadays, multimedia systems deal with huge amounts of memory accesses and large memory footprints. To alleviate the impact of these accesses and reduce the memory footprint, high-level memory exploration and optimization techniques have been proposed. These techniques try to more efficiently utilize the memory hierarchy. An important step in these optimization techniques are loop transformations (LT). They have a crucial effect on later data memory footprint optimization steps and code generation. However, the state-of-the-art work has focused only on individual objectives. The main one in literature involves improving the locality of data accesses, and thus reducing the data memory footprint. It does not consider the trade-offs in the LT step in relation to successive optimization steps. Therefore, it is not globally efficient in mapping the application on the target platform.
In this article we will discuss several trade-offs during the loop transformations. To our knowledge, we are the first ones considering these global trade-offs. Previous work always gave mostly one solution, having the best locality and thus the optimized memory footprint, even though some research in two-dimensional trade-offs in this area exists as well. We start from this state-of-the-art solution with minimal footprint. We show that by sacrificing the footprint, we can obtain gains in data reuse (crucial for energy reduction) and reduce the control-flow complexity. We demonstrate our approach on a real-life application, namely the QSDPCM video coder. At the end, we show that considering trade-offs for this application leads to 16% energy reduction in a two-layer memory subsystem and 10% cycle reduction on the ARM platform.
- Absar, J., Catthoor, F., and Das, K. 2003. Call-instance based function inlining for increasing data access related optimisation opportunities. Tech. rep., IMEC, Leuven, Belgium.Google Scholar
- Allen, R. and Kennedy, K. 1987. Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4, 491--542. Google ScholarDigital Library
- Ancourt, C., Barthou, D., Guettier, C., Irigoin, F., Jeannet, B., Jourdan, J., and Mattioli, J. 1997. Automatic data mapping of signal processing applications. In Proceedings of the IEEE International Conference on Application-Specific Processors, 350--362. Google ScholarDigital Library
- Banakar, R., Steinke, S., Lee, B.-S., Balakrishnan, M., and Marwedel, P. 2002. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In Proceedings of the International Symposium on Hardware/Software Codesign (CODES), 73--78. Google ScholarDigital Library
- Banerjee, U., Eigenmann, R., Nicolau, A., and Padua, D. 1993. Automatic program parallelization. Proc. IEEE 81, 2, 211--243.Google ScholarCross Ref
- Bastoul, C. 2004. Code generation in the polyhedral model is easier than you think. In IEEE International Conference on Parallel Architectureand Compilation Techniques (PACT'13), 7--16. Google ScholarDigital Library
- Bastoul, C., Cohen, A., Girbal, S., Sharma, S., and Temam, O. 2003. Putting polyhedral loop transformations to work. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computers (LCPC'16). Lecture Notes in Computer Science, vol. 2958, 209--225.Google ScholarCross Ref
- Brockmeyer, E., Miranda, M., Catthoor, F., and Corporaal, H. 2003. Layer assignment techniques for low energy in multi-layered memory organisations. In Proceedings of the 6th ACM/IEEE Conference on Design and Test in Europe, 1070--1075. Google ScholarDigital Library
- Carr, S., McKinley, K. S., and Tseng, C.-W. 1994. Compiler optimizations for improving data locality. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, 252--262. Google ScholarDigital Library
- Catthoor, F. 2005. Meta model for human interaction and decision-making: meta-concepts and balloonist vs road-builder metaphore. slide set.Google Scholar
- Catthoor, F., Danckaert, K., Kulkarni, C., Brockmeyer, E., Kjeldsberg, P. G., Van Achteren, T., and Omnes, T. 2002. Data Access and Storage Management for Embedded Programmable Processors. Kluwer Amsterdam.Google Scholar
- Cierniak, M. and Li, W. 1995. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the Conference on Programming Language Design and Implementation (SIGPLAN'95), ACM, New York, 205--217. Google ScholarDigital Library
- Clauss, P. and Loechner, V. 1998. Parametric analysis of polyhedral iteration spaces. J. VLSI Signal Processing 19, 179--194. Google ScholarDigital Library
- Danckaert, K., Catthoor, F., and De Man, H. 2000. A preprocessing step for global loop transformations for data transfer and storage optimization. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems, 34--40. Google ScholarDigital Library
- Darte, A. and Robert, Y. 1995. Affine-by-statement scheduling of uniform and affine loop nests over parametric domains. J. Parall. Distrib. Comput. 29, 1, 43--59. Google ScholarDigital Library
- De Greef, E., Catthoor, F., and De Man, H. 1997. Memory size reduction through storage order optimization for embedded parallel multimedia applications. J. Parall. Comput. 23, 12. (Special issue on Parallel Processing and Multimedia) Google ScholarDigital Library
- De Man, H., Catthoor, F., Goossens, G., Vanhoof, J., Van Meerbergen, J., Note, S., and Huisken, J. 1990. Architecture-driven synthesis techniques for vlsi implementation of dsp algorithms. Proc. IEEE, 78, 2. (Special issue on The Future of Computer-Aided Design) 319--335.Google ScholarCross Ref
- Dezan, C., Le Verge, H., Quinton, P., and Saouter, Y. 1992. The Alpha du CENTAUR experiment. In Algorithms and parallel VLSI architectures II, P. Quinton and Y. Robert (eds.), Elsevier, Amsterdam, 325--334. Google ScholarDigital Library
- Falk, H. and Marwedel, P. 2003. Control flow optimization by loop nest splitting at the source code level. In Proceedingsof the 6th ACM/IEEE Conference on Design and Test in Europe, 410--415. Google ScholarDigital Library
- Feautrier, P. 1992. Some efficient solutions to the affine scheduling problems. Int. J. Parall. Program. 21, 5, 389--420. Google ScholarDigital Library
- Fraboulet, A., Huard, G., and Mignotte, A. 1999. Loop alignment for memory access optimisation. In Proceedings of the 12th ACM/IEEE International Symposium on System-Level Synthesis (ISSS). 71--70. Google ScholarDigital Library
- Franke, B. and O'Boyle, M. 2003. Array recovery and high-level transformations for dsp applications. ACM Trans. Embed. Comput. Syst. 2, 2, 132--162. Google ScholarDigital Library
- Hu, Q. 2007. Hierarchical memory size estimation for loop transformation and data memory platform exploration. Ph.D. thesis, Norway University of Science and Technology.Google Scholar
- Kandemir, M., Ramanujam, J., Choudhary, A., and Banerjee, P. 2001. A layout-conscious iteration space transformation technique. IEEE Trans. Computers 50, 12, 1321--1335. Google ScholarDigital Library
- Kelly, W. and Pugh, W. 1993. A framework for unifying reordering transformations. Tech. rep. CS-TR-3193, Department of Computer Science, Univ. of Maryland, College Park MD.Google Scholar
- Kjeldsberg, P. G. 2001. Storage requirement estimation and optimization for data intensive applications. Ph.D. thesis, Norwegian University of Science and Technology.Google Scholar
- Li, W. and Pingali, K. 1992. A singular loop transformation framework based on non-singular matrices. In Proceedings of the 5th Annual Workshop on Languages and Compilers for Parallelism. Google ScholarDigital Library
- Manjiakian, N. and Abdelrahman, T. 1995. Fusion of loops for parallelism and locality. Tech. rep. CSRI-315, Computer Systems Reserch Institute, University of Toronto.Google Scholar
- Meng, T. H., Gordon, B., Tsern, E., and Hung, A. 1995. Portable video-on-demand in wireless communication. Proc. IEEE 83, 4 (Special Issue on Low Power Electronics). 659--680.Google ScholarCross Ref
- Olsen, R. and Gao, G. 1992. Collective analysis and transformation of loop clusters. Tech. rep. ACAPS Technical Memo 24, McGill University.Google Scholar
- Palkovic, M., Corporaal, H., and Catthoor, F. 2006. Dealing with data dependent conditions to enable general global source code transformations. Int. J. Embed. Syst. To appear.Google Scholar
- Polychronopoulos, C. 1988. Compiler optimizations for enhancing parallelism and their impact on the architecture design. IEEE Trans. Computers 37, 8, 991--1004. Google ScholarDigital Library
- Pugh, W. 1992. The omega test: A fast and practical integer programming algorithm for dependence analysis. Comm. ACM 35, 8.Google ScholarDigital Library
- Qin, W. and Malik, S. 2003. Flexible and formal modeling of microprocessors with application to retargetable simulation. In Proceedings of the Conference on Design Automation and Test in Europe (DATE 03). 556--561. Google ScholarDigital Library
- Quillere, F., Rajopadhye, S., and Wilde, D. 2000. Generation of efficient nested loops from polyhedra. Int. J. Parall. Program. 28, 5. Google ScholarDigital Library
- Sarkar, V. and Gao, G. R. 1991. Optimization of array accesses by collective loop transformations. In Proceedings of the 5th International Conference on Supercomputing. 194--205. Google ScholarDigital Library
- Schuster, T., Bougard, B., Raghavan, P., Priewasser, R., Novo, D., der Perre, L. V., and Catthoor, F. 2007. Design of a low power pre-synchronization asip for multimode sdr terminals. In Proceedings of the International Symposium on Systems, Architectures, Modeling and Simulation (SAMOS). 322--332. Google ScholarDigital Library
- Semeria, L. and De Micheli, G. 1998. Spc: synthesis of pointers in c. In Proceedings of the IEEE International Conference on Computer Aided Design. 340--346. Google ScholarDigital Library
- Song, Y., Xu, R., Wang, C., and Li, Z. 2001. Data locality enhancement by memory reduction. In International Conference on Supercomputing. 50--64. Google ScholarDigital Library
- Strobach, P. 1988. Qsdpcm: A new technique in scene adaptive coding. In Proceedings of the 4th European Signal Processing Conference (EUSIPCO-88). 1141--1144.Google Scholar
- Thies, W., Vivien, F., Sheldon, J., and Amarasinghe, S. 2001. A unified framework for schedule and storage optimization. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation. 232--242. Google ScholarDigital Library
- Van Achteren, T., Deconinck, G., Catthoor, F., and Lauwereins, R. 2002. Data reuse exploration methodology for loop-dominated applications. In Proceedings of the 5th ACM/IEEE Conference on Design and Test in Europe (DATE). 428--435. Google ScholarDigital Library
- van Swaaij, M., Franssen, F., Catthoor, F., and De Man, H. 1992. Modelling data and control flow for high-level memory management. In Proceedings of the 3rd ACM/IEEE European Design Automation Conference. 8--13.Google Scholar
- Vanbroekhoven, P., Janssens, G., Bruynooghe, M., Corporaal, H., and Catthoor, F. 2003. Advanced copy propagation for arrays. In Proceedings of the SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'03). 24--33. Google ScholarDigital Library
- Vander Aa, T., Corporaal, H., Catthoor, F., and Deconinck, G. 2005. Combining data and instruction memory energy optimizations for embedded applications. In Proceedings of the 3rd workshop on Embedded Systems for Real-Time Multimedia. 121--126.Google Scholar
- Verdoolaege, S., Catthoor, F., Bruynooghe, M., and Janssens, G. 2003. Multidimensional incremental loop fusion for data locality. In Proceedings of the IEEE 14th International Conference on Application-specific Systems, Architectures and Processors.Google Scholar
- Wilde, D. 1993. A library for doing polyhedral operations. M.S. thesis, Oregon State University.Google Scholar
- Wolf, M. and Lam, M. 1991. A data locality optimizing algorithm. In Proceedings of the Conference on Programming Language Design and Implementation (SIGPLAN'91). 30--43. Google ScholarDigital Library
- Yang, P., Wong, C., Marchal, P., Catthoor, F., Desmet, D., Verkest, D., and Lauwereins, R. 2001. Energy-aware runtime scheduling for embedded multi-processor socs. IEEE Design Test of Comput. 18, 5. (Special issue on Application-Specific Multiprocessor Mapping), 46--58. Google ScholarDigital Library
Index Terms
- Trade-offs in loop transformations
Recommendations
DRAM Refresh Mechanisms, Penalties, and Trade-Offs
Ever-growing application data footprints demand faster main memory with larger capacity. DRAM has been the technology choice for main memory due to its low latency and high density. However, DRAM cells must be refreshed periodically to preserve their ...
Understanding the trade-offs in multi-level cell ReRAM memory design
DAC '13: Proceedings of the 50th Annual Design Automation ConferenceResistive Random Access Memory (ReRAM) is one of the most promising emerging memory technologies as a potential replacement for DRAM memory and/or NAND Flash. Multi-level cell (MLC) ReRAM, which can store multiple bits in a single ReRAM cell, can ...
Loop transformations for flash memory: cost models and performance effects
Loop optimization, made of a sequence of loop transformations, plays an important role in performance improvement in data centric applications. Programs using flash memory are no exception to this, but, under certain conditions careless applications of ...
Comments