Top

International Journal of Parallel Programming

Published in:

23-05-2022

A Methodology for Efficient Tile Size Selection for Affine Loop Kernels

Authors: Vasilios Kelefouras, Karim Djemame, Georgios Keramidas, Nikolaos Voros

Published in: International Journal of Parallel Programming | Issue 3-4/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Reducing the number of data accesses in memory hierarchy is of paramount importance on modern computer systems. One of the key optimizations addressing this problem is loop tiling, a well-known loop transformation that enhances data locality in memory hierarchy. The selection of an appropriate tile size is tackled by using both static (analytical) and dynamic empirical (auto-tuning) methods. Current analytical models are not accurate enough to effectively model the complex modern memory hierarchies and loop kernels with diverse characteristics, while auto-tuning methods are either too time-consuming (due to the huge search space) or less accurate (when heuristics are used to reduce the search space). In this paper, we reveal two important inefficiencies of current analytical loop tiling methods and we provide the theoretical background on how current methods can address these inefficiencies. To this end, we propose a new loop tiling method for affine loop kernels where the cache size, cache line size and cache associativity are better utilized, compared to the existing methods. Our evaluation results prove the efficiency of the proposed method in terms of cache misses and execution time, against related works, icc/gcc compilers and Pluto tool, on x86 and ARM based platforms.

previous article An Improved/Optimized Practical Non-Blocking PageRank Algorithm for Massive Graphs*

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Notices 43(6), 101–113 (2008). https://doi.org/10.1145/1379022.1375595CrossRef

Mehta, S., Beeraka, G., Yew, P.C.: Tile size selection revisited. ACM Trans. Archit. Code Optim. 10(4), 1–27 (2013)CrossRef

Tavarageri, S., Pouchet, L.N., Ramanujam, J., Rountev, A., Sadayappan, P.: Dynamic selection of tile sizes. In: Proceedings of the 2011 18th International Conference on High Performance Computing, HIPC ’11, p. 1–10. IEEE Computer Society (2011)

Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)CrossRef

Sarkar, V., Megiddo, N.: An analytical model for loop tiling and its solution. In: 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422), pp. 146–153 (2000)

Chatterjee, S., Parker, E., Hanlon, P.J., Lebeck, A.R.: Exact analysis of the cache behavior of nested loops. ACM SIGPLAN Notices 36(5), 286–297 (2001)CrossRef

Narasimhan, K., Acharya, A., Baid, A., Bondhugula, U.: A practical tile size selection model for affine loop nests. In: Proceedings of the ACM International Conference on Supercomputing, ICS ’21, p. 27–39. Association for Computing Machinery, New York, NY (2021). https://doi.org/10.1145/3447818.3462213

Li, R., Sukumaran-Rajam, A., Veras, R., Low, T.M., Rastello, F., Rountev, A., Sadayappan, P.: Analytical cache modeling and tilesize optimization for tensor contractions. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. Association for Computing Machinery, New York, NY (2019). https://doi.org/10.1145/3295500.3356218

Hsu, Ch., Kremer, U.: A quantitative analysis of tile size selection algorithms. J. Supercomput. 27(3), 279–294 (2004). https://doi.org/10.1023/B:SUPE.0000011388.54204.8eCrossRefMATH

10.

Kelefouras, V., Kritikakou, A., Mporas, I., Kolonias, V.: A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures. J. Supercomput. 72(3), 804–844 (2016)CrossRef

11.

Kelefouras, V.I., Kritikakou, A., Goutis, C.: A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD. J. Supercomput. (2014). https://doi.org/10.1007/s11227-014-1098-9CrossRefMATH

12.

Kelefouras, V., Kritikakou, A., Papadima, E., Goutis, C.: A methodology for speeding up matrix vector multiplication for single/multi-core architectures. J. Supercomput. 71(7), 2644–2667 (2015)CrossRef

13.

Kelefouras, V.I., Athanasiou, G.S., Alachiotis, N., Michail, H.E., Kritikakou, A.S., Goutis, C.E.: A methodology for speeding up fast Fourier transform focusing on memory architecture utilization. IEEE Trans. Signal Process. 59(12), 6217–6226 (2011)MathSciNetCrossRef

14.

Li, Y., Sun, H., Pang, J.: Revisiting split tiling for stencil computations in polyhedral compilation. J. Supercomput. 78(1), 440–470 (2021)CrossRef

15.

Cohen, A., Zhao, J.: Flextended tiles: a flexible extension of overlapped tiles for polyhedral compilation. ACM Trans. Archit. Code Optim. (2020). https://doi.org/10.1145/3369382CrossRef

16.

Zhou, X., Giacalone, J.P., Garzarán, M.J., Kuhn, R.H., Ni, Y., Padua, D.: Hierarchical overlapped tiling. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12, p. 207-218. Association for Computing Machinery, New York, NY (2012). https://doi.org/10.1145/2259016.2259044

17.

Bondhugula, U., Bandishti, V., Pananilath, I.: Diamond tiling: tiling techniques to maximize parallelism for stencil computations. IEEE Trans. Parallel Distrib. Syst. 28(5), 1285–1298 (2017). https://doi.org/10.1109/TPDS.2016.2615094CrossRef

18.

Alshboul, M., Tuck, J., Solihin, Y.: Wet: write efficient loop tiling for non-volatile main memory. In: Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference, DAC ’20. IEEE Press (2020)

19.

Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, p. 147–157. Association for Computing Machinery, New York, NY (2009)

20.

Baskaran, M.M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: CGO ’10, p. 200–209. Association for Computing Machinery, New York, NY (2010)

21.

Hartono, A., Baskaran, M., Ramanujam, J., Sadayappan, P.: Dyntile: parametric tiled loop generation for parallel execution on multicore processors. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (2010)

22.

Renganarayanan, L., Kim, D., Strout, M.M., Rajopadhye, S.: Parameterized loop tiling. ACM Trans. Program. Lang. Syst. 34(1), 1–41 (2012)CrossRef

23.

Mehdi, A., Béatrice, C., Stéphanie, E., Ronan, K., Onil, G., Serge, G., Janice, O., François Xavier, P., Grégoire, P., Villalon., P.: Par4all : from convex array regions to heterogeneous computing. In: 2nd International Workshop on Polyhedral Compilation Techniques (2012)

24.

Tavarageri, S., Hartono, A., Baskaran, M., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: Parametric tiling of affine loop nests. In: 15th Workshop on Compilers for Parallel Computing (CPC’10). Vienna, Austria (2010)

25.

Hammami, E., Slama, Y.: An overview on loop tiling techniques for code generation. In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), pp. 280–287 (2017)

26.

Yuki, T., Renganarayanan, L., Rajopadhye, S., Anderson, C., Eichenberger, A.E., O’Brien, K.: Automatic creation of tile size selection models. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10, p. 190–199. Association for Computing Machinery, New York, NY (2010)

27.

Sato, Y., Yuki, T., Endo, T.: An autotuning framework for scalable execution of tiled code via iterative polyhedral compilation. ACM Trans. Archit. Code Optim. (2019). https://doi.org/10.1145/3293449CrossRef

28.

Abella, J.: Near-optimal loop tiling by means of cache miss equations and genetic algorithms. In: Proceedings of the 2002 International Conference on Parallel Processing Workshops, ICPPW ’02, p. 568. IEEE Computer Society (2002)

29.

Parsa, S., Lotfi, S.: A new genetic algorithm for loop tiling. J. Supercomput. 37, 249–269 (2006)CrossRef

30.

Chen, C., Chame, J., Hall, M.: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’05, p. 111–122. IEEE Computer Society (2005)

31.

Shirako, J., Sharma, K., Fauzia, N., Pouchet, L.N., Ramanujam, J., Sadayappan, P., Sarkar, V.: Analytical bounds for optimal tile size selection. In: Proceedings of the 21st International Conference on Compiler Construction, CC’12, p. 101–121. Springer-Verlag, Berlin, Heidelberg (2012)

32.

Bao, B., Ding, C.: Defensive loop tiling for shared cache. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), CGO ’13, pp. 1–11. IEEE Computer Society, Washington, DC (2013). https://doi.org/10.1109/CGO.2013.6495008

33.

Kelefouras, V., Georgios, K., Nikolaos, V.: Combining software cache partitioning and loop tiling for effective shared cache management. ACM Trans. Embed. Comput. Syst. (2018). https://doi.org/10.1145/3202663CrossRef

34.

Nethercote, N., Walsh, R., Fitzhardinge, J.: Building workload characterization tools with valgrind. In: IISWC, p. 2. IEEE Computer Society (2006)

35.

Bao, W., Krishnamoorthy, S., Pouchet, L.N., Sadayappan, P.: Analytical modeling of cache behavior for affine programs. Proc. ACM Program. Lang. (2017). https://doi.org/10.1145/3158120CrossRef

36.

Gysi, T., Grosser, T., Brandner, L., Hoefler, T.: A fast analytical model of fully associative caches. In: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, p. 816–829. Association for Computing Machinery, New York, NY (2019). https://doi.org/10.1145/3314221.3314606

37.

Wang, D., Sun, X.H.: APC: a novel memory metric and measurement methodology for modern memory systems. IEEE Trans. Comput. 63(7), 1626–1639 (2014). https://doi.org/10.1109/TC.2013.38MathSciNetCrossRefMATH

38.

Pouchet, L.: Polybench/c. http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/. Accessed 10 Oct 2020

39.

Linux kernel profiling with perf. https://perf.wiki.kernel.org/index.php/Tutorial. Accessed 10 Oct 2020

Title: A Methodology for Efficient Tile Size Selection for Affine Loop Kernels
Authors: Vasilios Kelefouras
Karim Djemame
Georgios Keramidas
Nikolaos Voros
Publication date: 23-05-2022
Publisher: Springer US
Published in: International Journal of Parallel Programming / Issue 3-4/2022
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-022-00734-5

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 3-4/2022

An Improved/Optimized Practical Non-Blocking PageRank Algorithm for Massive Graphs*

A Scalable Similarity Join Algorithm Based on MapReduce and LSH

A Deterministic Portable Parallel Pseudo-Random Number Generator for Pattern-Based Programming of Heterogeneous Parallel Systems

The Celerity High-level API: C++20 for Accelerator Clusters

Premium Partner