Skip to main content
Erschienen in: The Journal of Supercomputing 10/2019

13.05.2019

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

verfasst von: Vasilios Kelefouras, Karim Djemame

Erschienen in: The Journal of Supercomputing | Ausgabe 10/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The advent of data proliferation and electronic devices gets low execution time and energy consumption software in the spotlight. The key to optimizing software is the correct choice, order as well as parameters of optimization transformations that has remained an open problem in compilation research for decades for various reasons. First, most of the transformations are interdependent and thus addressing them separately is not effective. Second, it is very hard to couple the transformation parameters to the processor architecture (e.g., cache size) and algorithm characteristics (e.g., data reuse); therefore, compiler designers and researchers either do not take them into account at all or do it partly. Third, the exploration space, i.e., the set of all optimization configurations that have to be explored, is huge and thus searching is impractical. In this paper, the above problems are addressed for data-dominant affine loop kernels, delivering significant contributions. A novel methodology is presented reducing the exploration space of six code optimizations by many orders of magnitude. The objective can be execution time (ET), energy consumption (E) or the number of L1, L2 and main memory accesses. The exploration space is reduced in two phases: firstly, by applying a novel register blocking algorithm and a novel loop tiling algorithm and secondly, by computing the maximum and minimum ET/E values for each optimization set. The proposed methodology has been evaluated for both embedded and general-purpose CPUs and for seven well-known algorithms, achieving high memory access, speedup and energy consumption gain values (from 1.17 up to 40) over gcc compiler, hand-written optimized code and Polly. The exploration space from which the near-optimum parameters are selected is reduced from 17 up to 30 orders of magnitude.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
This is an extension of the conference paper ‘A methodology for efficient code optimizations and memory management’, ACM International Conference on Computing Frontiers 2018.
 
Literatur
1.
Zurück zum Zitat Artés A, Ayala JL, Huisken J, Catthoor F (2013) Survey of low-energy techniques for instruction memory organisations in embedded systems. Signal Process Syst 70(1):1–19CrossRef Artés A, Ayala JL, Huisken J, Catthoor F (2013) Survey of low-energy techniques for instruction memory organisations in embedded systems. Signal Process Syst 70(1):1–19CrossRef
2.
Zurück zum Zitat Ashouri AH, Bignoli A, Palermo G, Silvano C (2016) Predictive modeling methodology for compiler phase-ordering. In: Proceedings of the 7th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and the 5th Workshop on Design Tools and Architectures For Multicore Embedded Computing Platforms, PARMA-DITAM ’16. ACM, New York, NY, USA, pp 7–12. https://doi.org/10.1145/2872421.2872424 Ashouri AH, Bignoli A, Palermo G, Silvano C (2016) Predictive modeling methodology for compiler phase-ordering. In: Proceedings of the 7th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and the 5th Workshop on Design Tools and Architectures For Multicore Embedded Computing Platforms, PARMA-DITAM ’16. ACM, New York, NY, USA, pp 7–12. https://​doi.​org/​10.​1145/​2872421.​2872424
3.
Zurück zum Zitat Ashouri AH, Bignoli A, Palermo G, Silvano C, Kulkarni S, Cavazos J (2017) Micomp: mitigating the compiler phase-ordering problem using optimization sub-sequences and machine learning. ACM Trans Archit Code Optim 14(3):29:1–29:28. https://doi.org/10.1145/3124452 CrossRef Ashouri AH, Bignoli A, Palermo G, Silvano C, Kulkarni S, Cavazos J (2017) Micomp: mitigating the compiler phase-ordering problem using optimization sub-sequences and machine learning. ACM Trans Archit Code Optim 14(3):29:1–29:28. https://​doi.​org/​10.​1145/​3124452 CrossRef
6.
Zurück zum Zitat Balaprakash P, Wild SM, Hovland PD (2013) An experimental study of global and local search algorithms in empirical performance tuning. In: High Performance Computing for Computational Science - VECPAR 2012, 10th International Conference, Revised Selected Papers, Lecture Notes in Computer Science. Springer, pp 261–269. https://doi.org/10.1007/978-3-642-38718-0_26 Balaprakash P, Wild SM, Hovland PD (2013) An experimental study of global and local search algorithms in empirical performance tuning. In: High Performance Computing for Computational Science - VECPAR 2012, 10th International Conference, Revised Selected Papers, Lecture Notes in Computer Science. Springer, pp 261–269. https://​doi.​org/​10.​1007/​978-3-642-38718-0_​26
7.
10.
Zurück zum Zitat Bondhugula U, Ramanujam J et al (2008) Pluto: a practical and fully automatic polyhedral program optimization system. In: Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 2008) Bondhugula U, Ramanujam J et al (2008) Pluto: a practical and fully automatic polyhedral program optimization system. In: Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 2008)
11.
Zurück zum Zitat Brockmeyer E, Durinck B, Corporaal H, Catthoor F (2007) Layer assignment techniques for low energy in multi-layered memory organizations. Springer, DordrechtCrossRef Brockmeyer E, Durinck B, Corporaal H, Catthoor F (2007) Layer assignment techniques for low energy in multi-layered memory organizations. Springer, DordrechtCrossRef
12.
Zurück zum Zitat Cavazos J, Fursin G, Agakov F, Bonilla E, O’Boyle MFP, Temam O (2007) Rapidly selecting good compiler optimizations using performance counters. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’07. IEEE Computer Society, Washington, DC, USA, pp 185–197. https://doi.org/10.1109/CGO.2007.32 Cavazos J, Fursin G, Agakov F, Bonilla E, O’Boyle MFP, Temam O (2007) Rapidly selecting good compiler optimizations using performance counters. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’07. IEEE Computer Society, Washington, DC, USA, pp 185–197. https://​doi.​org/​10.​1109/​CGO.​2007.​32
13.
Zurück zum Zitat Chen C, Chame J, Hall M (2008) Chill: a framework for composing high-level loop transformations. Technical report Chen C, Chame J, Hall M (2008) Chill: a framework for composing high-level loop transformations. Technical report
14.
Zurück zum Zitat de Mesmay F, Voronenko Y, Püschel M (2010) Offline library adaptation using automatically generated heuristics. In: International Parallel and Distributed Processing Symposium (IPDPS), pp 1–10 de Mesmay F, Voronenko Y, Püschel M (2010) Offline library adaptation using automatically generated heuristics. In: International Parallel and Distributed Processing Symposium (IPDPS), pp 1–10
15.
Zurück zum Zitat Fursin G, O’Boyle MFP, Knijnenburg PMW (2002) Evaluating iterative compilation. In: Languages and Compilers for Parallel Computing, 15th Workshop, LCPC 2002, College Park, MD, USA, July 25–27, 2002, Revised Papers, pp 362–376. https://doi.org/10.1007/11596110_24 Fursin G, O’Boyle MFP, Knijnenburg PMW (2002) Evaluating iterative compilation. In: Languages and Compilers for Parallel Computing, 15th Workshop, LCPC 2002, College Park, MD, USA, July 25–27, 2002, Revised Papers, pp 362–376. https://​doi.​org/​10.​1007/​11596110_​24
17.
Zurück zum Zitat Haneda M, Khnijnenburg PMW, Wijshoff HAG (2005) Automatic selection of compiler options using non-parametric inferential statistics. In: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT ’05. IEEE Computer Society, Washington, DC, USA, pp 123–132. https://doi.org/10.1109/PACT.2005.9 Haneda M, Khnijnenburg PMW, Wijshoff HAG (2005) Automatic selection of compiler options using non-parametric inferential statistics. In: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT ’05. IEEE Computer Society, Washington, DC, USA, pp 123–132. https://​doi.​org/​10.​1109/​PACT.​2005.​9
18.
Zurück zum Zitat Hartono A, Norris B, Sadayappan P (2009) Annotation-based empirical performance tuning using Orio. In: IEEE International Symposium on Parallel & Distributed Processing. IEEE, pp 1–11 Hartono A, Norris B, Sadayappan P (2009) Annotation-based empirical performance tuning using Orio. In: IEEE International Symposium on Parallel & Distributed Processing. IEEE, pp 1–11
20.
Zurück zum Zitat Kandemir M, Muralidhara SP, Narayanan SHK, Zhang Y, Ozturk O (2009) Optimizing shared cache behavior of chip multiprocessors. In: MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, USA, pp 505–516. https://doi.org/10.1145/1669112.1669176 Kandemir M, Muralidhara SP, Narayanan SHK, Zhang Y, Ozturk O (2009) Optimizing shared cache behavior of chip multiprocessors. In: MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, USA, pp 505–516. https://​doi.​org/​10.​1145/​1669112.​1669176
22.
Zurück zum Zitat Kim D, Renganarayanan L, Rostron D, Rajopadhye S, Strout MM (2007) Multi-level tiling: M for the price of one. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC ’07. ACM, New York, NY, USA, pp 51:1–51:12. https://doi.org/10.1145/1362622.1362691 Kim D, Renganarayanan L, Rostron D, Rajopadhye S, Strout MM (2007) Multi-level tiling: M for the price of one. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC ’07. ACM, New York, NY, USA, pp 51:1–51:12. https://​doi.​org/​10.​1145/​1362622.​1362691
23.
Zurück zum Zitat Knijnenburg PMW, Kisuki T, Gallivan K, O’Boyle MFP (2004) The effect of cache models on iterative compilation for combined tiling and unrolling. j-CCPE 16(2–3):247–270 Knijnenburg PMW, Kisuki T, Gallivan K, O’Boyle MFP (2004) The effect of cache models on iterative compilation for combined tiling and unrolling. j-CCPE 16(2–3):247–270
24.
Zurück zum Zitat Krzikalla O, Feldhoff K, Müller-Pfefferkorn R, Nagel WE (2012) Scout: a source-to-source transformator for simd-optimizations. In: Proceedings of the 2011 International Conference on Parallel Processing—vol 2, Euro-Par’11. Springer, pp 137–145. https://doi.org/10.1007/978-3-642-29740-3_17 Krzikalla O, Feldhoff K, Müller-Pfefferkorn R, Nagel WE (2012) Scout: a source-to-source transformator for simd-optimizations. In: Proceedings of the 2011 International Conference on Parallel Processing—vol 2, Euro-Par’11. Springer, pp 137–145. https://​doi.​org/​10.​1007/​978-3-642-29740-3_​17
28.
Zurück zum Zitat Leather H, Bonilla E, O’Boyle M (2009) Automatic feature generation for machine learning based optimizing compilation. In: Proceedings of the 7th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’09. IEEE Computer Society, Washington, DC, USA, pp 81–91. https://doi.org/10.1109/CGO.2009.21 Leather H, Bonilla E, O’Boyle M (2009) Automatic feature generation for machine learning based optimizing compilation. In: Proceedings of the 7th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’09. IEEE Computer Society, Washington, DC, USA, pp 81–91. https://​doi.​org/​10.​1109/​CGO.​2009.​21
30.
Zurück zum Zitat Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42. ACM, New York, NY, USA, pp 469–480. https://doi.org/10.1145/1669112.1669172 Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42. ACM, New York, NY, USA, pp 469–480. https://​doi.​org/​10.​1145/​1669112.​1669172
31.
Zurück zum Zitat Lidman J, Quinlan DJ, Liao C, McKee SA (2012) Rose:: Fttransform-a source-to-source translation framework for exascale fault-tolerance research. In: 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, pp 1–6 Lidman J, Quinlan DJ, Liao C, McKee SA (2012) Rose:: Fttransform-a source-to-source translation framework for exascale fault-tolerance research. In: 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, pp 1–6
34.
Zurück zum Zitat Nobre R, Martins LGA, Cardoso JaMP (2015) Use of previously acquired positioning of optimizations for phase ordering exploration. In: Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES ’15, pp 58–67. https://doi.org/10.1145/2764967.2764978 Nobre R, Martins LGA, Cardoso JaMP (2015) Use of previously acquired positioning of optimizations for phase ordering exploration. In: Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES ’15, pp 58–67. https://​doi.​org/​10.​1145/​2764967.​2764978
35.
Zurück zum Zitat Nobre R, Reis L, Cardoso JMP (2016) Compiler phase ordering as an orthogonal approach for reducing energy consumption. In: Proceedings of the 19th Workshop on Compilers for Parallel Computing (CPC’16) Nobre R, Reis L, Cardoso JMP (2016) Compiler phase ordering as an orthogonal approach for reducing energy consumption. In: Proceedings of the 19th Workshop on Compilers for Parallel Computing (CPC’16)
38.
Zurück zum Zitat Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES ’11. ACM, New York, NY, USA, pp 65–74. https://doi.org/10.1145/2038698.2038711 Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES ’11. ACM, New York, NY, USA, pp 65–74. https://​doi.​org/​10.​1145/​2038698.​2038711
41.
Zurück zum Zitat Pouchet LN, Bastoul C, Cohen A, Vasilache N (2007) Iterative optimization in the polyhedral model: part I, one-dimensional time. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’07. IEEE Computer Society, Washington, DC, USA, pp 144–156. https://doi.org/10.1109/CGO.2007.21 Pouchet LN, Bastoul C, Cohen A, Vasilache N (2007) Iterative optimization in the polyhedral model: part I, one-dimensional time. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’07. IEEE Computer Society, Washington, DC, USA, pp 144–156. https://​doi.​org/​10.​1109/​CGO.​2007.​21
50.
Zurück zum Zitat Stephenson M, Amarasinghe S (2005) Predicting unroll factors using supervised classification. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’05. IEEE Computer Society, Washington, DC, USA, pp 123–134. https://doi.org/10.1109/CGO.2005.29 Stephenson M, Amarasinghe S (2005) Predicting unroll factors using supervised classification. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’05. IEEE Computer Society, Washington, DC, USA, pp 123–134. https://​doi.​org/​10.​1109/​CGO.​2005.​29
52.
Zurück zum Zitat Sung IJ, Stratton JA, Hwu WMW (2010) Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10. ACM, New York, NY, USA, pp 513–522. https://doi.org/10.1145/1854273.1854336 Sung IJ, Stratton JA, Hwu WMW (2010) Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10. ACM, New York, NY, USA, pp 513–522. https://​doi.​org/​10.​1145/​1854273.​1854336
53.
Zurück zum Zitat Tartara M, Crespi Reghizzi S (2012) Parallel iterative compilation: using MapReduce to speedup machine learning in compilers. In: Proceedings of Third International Workshop on Mapreduce and Its Applications Date, MapReduce ’12, pp 33–40. https://doi.org/10.1145/2287016.2287023 Tartara M, Crespi Reghizzi S (2012) Parallel iterative compilation: using MapReduce to speedup machine learning in compilers. In: Proceedings of Third International Workshop on Mapreduce and Its Applications Date, MapReduce ’12, pp 33–40. https://​doi.​org/​10.​1145/​2287016.​2287023
55.
Zurück zum Zitat Trifunovic K, Nuzman D, Cohen A, Zaks A, Rosen I (2009) Polyhedral-model guided loop-nest auto-vectorization. In: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques, PACT ’09. IEEE Computer Society, Washington, DC, USA, pp 327–337. https://doi.org/10.1109/PACT.2009.18 Trifunovic K, Nuzman D, Cohen A, Zaks A, Rosen I (2009) Polyhedral-model guided loop-nest auto-vectorization. In: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques, PACT ’09. IEEE Computer Society, Washington, DC, USA, pp 327–337. https://​doi.​org/​10.​1109/​PACT.​2009.​18
57.
Zurück zum Zitat Zhou X, Giacalone JP, Garzarán MJ, Kuhn RH, Ni Y, Padua D (2012) Hierarchical overlapped tiling. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12. ACM, New York, NY, USA, pp 207–218. https://doi.org/10.1145/2259016.2259044 Zhou X, Giacalone JP, Garzarán MJ, Kuhn RH, Ni Y, Padua D (2012) Hierarchical overlapped tiling. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12. ACM, New York, NY, USA, pp 207–218. https://​doi.​org/​10.​1145/​2259016.​2259044
Metadaten
Titel
A methodology correlating code optimizations with data memory accesses, execution time and energy consumption
verfasst von
Vasilios Kelefouras
Karim Djemame
Publikationsdatum
13.05.2019
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 10/2019
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-019-02880-z

Weitere Artikel der Ausgabe 10/2019

The Journal of Supercomputing 10/2019 Zur Ausgabe