Top

International Journal of Parallel Programming

Published in:

29-04-2016

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

Authors: Victor Garcia, Alejandro Rico, Carlos Villavieja, Paul Carpenter, Nacho Navarro, Alex Ramirez

Published in: International Journal of Parallel Programming | Issue 3/2017

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.

previous article Adaptive Optimization -Minimization Solvers on GPU

next article RT-CUDA: A Software Tool for CUDA Code Restructuring

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

ARM, Cortex-A9 Technical Reference Manual. http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388f/DDI0388F_cortex_a9_r2p2_trm.pdf (2008). Accessed 10 Nov 2014

Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)CrossRef

Baer, J.-L., Chen, T.-F.: Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44(5), 609–623 (1995)CrossRefMATH

Byna, S., Chen, Y., Sun, X.H.: A taxonomy of data prefetching mechanisms. In: 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008), Sydney, NSW, pp. 19–24 (2008). doi:10.1109/I-SPAN.2008.24

Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)CrossRef

Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: An object-oriented approach to non-uniform cluster computing. SIGPLAN Not. 40(10), 519–538

Chen, T.-F., Baer, J.-L.: A performance study of software and hardware data prefetching Schemes. In: Proceedings the 21st Annual International Symposium on Computer Architecture, 1994, Chicago, IL, pp. 223–232 (1994)

Chung, I.H., Hollingsworth, J.K.: A case study using automatic performance tuning for large-scale scientific programs. In: 15th IEEE International Conference on High Performance Distributed Computing, Paris, 2006, pp. 45–56 (2006). doi:10.1109/HPDC.2006.1652135

OpenMP Consortium. http://openmp.org/wp/ (2014). Accessed 25 July 2014

10.

Dahlgren, F., Dubois, M., Stenstrom, P.: Fixed and adaptive sequential prefetching in shared memory multiprocessors. In: International Conference on Parallel Processing, 1993. ICPP 1993, Syracuse, NY, pp. 56–63 (1993)

11.

Dahlgren, F., Stenstrom, P.: Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors. In: Proceedings., First IEEE Symposium on High-Performance Computer Architecture, 1995, Raleigh, NC, pp. 68–77 (1995). doi:10.1109/HPCA.1995.386554

12.

GCC Developers: GCC Optimization Options. https://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Optimize-Options.html (2014). Accessed 14 Aug 2014

13.

Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(2), 173–193 (2011)MathSciNetCrossRef

14.

Ebrahimi, E., Lee, C.J., Mutlu, O., Patt, Y.N.: Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. SIGARCH Comput. Archit. News 38(1), 335–346 (2010)CrossRef

15.

Fatahalian, K., Horn, D.R., Knight, T.J., Leem, L., Houston, M., Park, J.Y., Erez, M., Ren, M., Aiken, A., Dally, W.J., Hanrahan, P.: Sequoia: programming the memory hierarchy. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, ACM, New York, NY, USA (2006)

16.

Feng, X., Cameron, K.W., Buell, D.A.: PBPI: a high performance implementation of bayesian phylogenetic inference. In: Proceedings of the ACM/IEEE, SC 2006 Conference, Tampa, FL, pp. 40 (2006). doi:10.1109/SC.2006.47

17.

Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI ’98, pp. 212–223, ACM, New York, NY, USA (1998)

18.

Gornish, E.H., Granston, E.D., Veidenbaum, A.V.: Compiler-directed data prefetching in multiprocessors with memory hierarchies. In: In International Conference on Supercomputing, pp. 354–368 (1990)

19.

Guo, Y., Narayanan, P., Bennaser, M., Chheda, S., Moritz, C.: Energy-efficient hardware data prefetching. Very Large Scale Integration (VLSI) Syst. IEEE Trans. 19(2), 250–263 (2011)CrossRef

20.

D. Lowenthal and M. James. Run-time selection of block size in pipelined parallel programs. In: Parallel Processing, 1999. Proceedings13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP, pp. 82–87

21.

Lu, J.: Design and Implementation of a Lightweight Runtime Optimization System on Modern Computer Architectures. Ph.D. Thesis, Minneapolis, MN, USA, AAI3220014 (2006)

22.

Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pp. 190–200, New York, NY, USA, 2005. ACM

23.

Martonosi, M.R.: Analyzing and tuning memory performance in sequential and parallel programs. Technical report, Stanford, CA, USA (1994)

24.

Mowry, T., Gupta, A.: Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J. Parallel Distrib. Comput. 12, 87–106 (1991)CrossRef

25.

Nesbit, K., Smith, J.: Data cache prefetching using a global history buffer. In: Software, IEE Proceedings, p. 96 (2004)

26.

Papaefstathiou, V., Katevenis, M.G., Nikolopoulos, D.S., Pnevmatikatos, D.: Prefetching and cache management using task lifetimes. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, pp. 325–334, ACM, New York, NY, USA (2013)

27.

Reinders, J.: Intel Threading Building Blocks, 1st edn. O’Reilly and Associates Inc, Sebastopol (2007)

28.

Rico, A., Cabarcas, F., Villavieja, C., Pavlovic, M., Vega, A., Etsion, Y., Ramirez, A., Valero, M.: On the simulation of large-scale architectures using multiple application abstraction levels. ACM Trans. Archit. Code Optim. 8(4), 36:1–36:20 (2012)CrossRef

29.

Rico, A., Ramirez, A., Valero, M.: Available task-level parallelism on the cell BE. Sci. Program. 17(1–2), 59–76 (2009)

30.

Rothberg, E., Singh, J.P., Gupta, A.: Working sets, cache sizes, and node granularity issues for large-scale multiprocessors. In: Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993, pp. 14–25 (1993)

31.

Solihin, Y., Lee, J., Torrellas, J.: Using a user-level memory thread for correlation prefetching. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, ISCA ’02, pp. 171–182, IEEE Computer Society, Washington, DC, USA (2002)

32.

Tandri, S., Abdelrahman, T.S.: Automatic partitioning of data and computations on scalable shared memory multiprocessors. In: Proceedings of the 1997 International Conference on Parallel Processing, 1997, Bloomington, IL, pp. 64–73 (1997). doi:10.1109/ICPP.1997.622557

33.

Tullsen, D.M., Eggers, S.J.: Effective cache prefetching on bus-based multiprocessors. ACM Trans. Comput. Syst. 13(1), 57–88 (1995)CrossRef

34.

Villavieja, C., Karakostas, V., Vilanova, L., Etsion, Y., Ramirez, A., Mendelson, A., Navarro, N., Cristal, A., Unsal, O.S.: Didi: Mitigating the performance impact of tlb shootdowns using a shared tlb directory. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), Galveston, TX, pp. 340–349 (2011)

35.

Wall, M.: Using block prefetch for optimized memory performance. http://web.mit.edu/ehliu/Public/ProjectX/Meetings/AMD_block_prefetch_paper.pdf (2001). Accessed 10 July 2014

36.

Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995)CrossRef

Title: Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors
Authors: Victor Garcia
Alejandro Rico
Carlos Villavieja
Paul Carpenter
Nacho Navarro
Alex Ramirez
Publication date: 29-04-2016
Publisher: Springer US
Published in: International Journal of Parallel Programming / Issue 3/2017
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-016-0431-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 3/2017

Panda: A Compiler Framework for Concurrent CPUGPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

RT-CUDA: A Software Tool for CUDA Code Restructuring

Adaptive Optimization -Minimization Solvers on GPU

Scalable Loop Self-Scheduling Schemes for Large-Scale Clusters and Cloud Systems

A Wait-Free Hash Map

Porting the PLASMA Numerical Library to the OpenMP Standard

Premium Partner