Skip to main content
Top
Published in: International Journal of Parallel Programming 3/2017

29-04-2016

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

Authors: Victor Garcia, Alejandro Rico, Carlos Villavieja, Paul Carpenter, Nacho Navarro, Alex Ramirez

Published in: International Journal of Parallel Programming | Issue 3/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
2.
go back to reference Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)CrossRef Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)CrossRef
3.
go back to reference Baer, J.-L., Chen, T.-F.: Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44(5), 609–623 (1995)CrossRefMATH Baer, J.-L., Chen, T.-F.: Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44(5), 609–623 (1995)CrossRefMATH
4.
go back to reference Byna, S., Chen, Y., Sun, X.H.: A taxonomy of data prefetching mechanisms. In: 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008), Sydney, NSW, pp. 19–24 (2008). doi:10.1109/I-SPAN.2008.24 Byna, S., Chen, Y., Sun, X.H.: A taxonomy of data prefetching mechanisms. In: 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008), Sydney, NSW, pp. 19–24 (2008). doi:10.​1109/​I-SPAN.​2008.​24
5.
go back to reference Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)CrossRef Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)CrossRef
6.
go back to reference Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: An object-oriented approach to non-uniform cluster computing. SIGPLAN Not. 40(10), 519–538 Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: An object-oriented approach to non-uniform cluster computing. SIGPLAN Not. 40(10), 519–538
7.
go back to reference Chen, T.-F., Baer, J.-L.: A performance study of software and hardware data prefetching Schemes. In: Proceedings the 21st Annual International Symposium on Computer Architecture, 1994, Chicago, IL, pp. 223–232 (1994) Chen, T.-F., Baer, J.-L.: A performance study of software and hardware data prefetching Schemes. In: Proceedings the 21st Annual International Symposium on Computer Architecture, 1994, Chicago, IL, pp. 223–232 (1994)
8.
go back to reference Chung, I.H., Hollingsworth, J.K.: A case study using automatic performance tuning for large-scale scientific programs. In: 15th IEEE International Conference on High Performance Distributed Computing, Paris, 2006, pp. 45–56 (2006). doi:10.1109/HPDC.2006.1652135 Chung, I.H., Hollingsworth, J.K.: A case study using automatic performance tuning for large-scale scientific programs. In: 15th IEEE International Conference on High Performance Distributed Computing, Paris, 2006, pp. 45–56 (2006). doi:10.​1109/​HPDC.​2006.​1652135
10.
go back to reference Dahlgren, F., Dubois, M., Stenstrom, P.: Fixed and adaptive sequential prefetching in shared memory multiprocessors. In: International Conference on Parallel Processing, 1993. ICPP 1993, Syracuse, NY, pp. 56–63 (1993) Dahlgren, F., Dubois, M., Stenstrom, P.: Fixed and adaptive sequential prefetching in shared memory multiprocessors. In: International Conference on Parallel Processing, 1993. ICPP 1993, Syracuse, NY, pp. 56–63 (1993)
11.
go back to reference Dahlgren, F., Stenstrom, P.: Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors. In: Proceedings., First IEEE Symposium on High-Performance Computer Architecture, 1995, Raleigh, NC, pp. 68–77 (1995). doi:10.1109/HPCA.1995.386554 Dahlgren, F., Stenstrom, P.: Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors. In: Proceedings., First IEEE Symposium on High-Performance Computer Architecture, 1995, Raleigh, NC, pp. 68–77 (1995). doi:10.​1109/​HPCA.​1995.​386554
13.
go back to reference Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(2), 173–193 (2011)MathSciNetCrossRef Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(2), 173–193 (2011)MathSciNetCrossRef
14.
go back to reference Ebrahimi, E., Lee, C.J., Mutlu, O., Patt, Y.N.: Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. SIGARCH Comput. Archit. News 38(1), 335–346 (2010)CrossRef Ebrahimi, E., Lee, C.J., Mutlu, O., Patt, Y.N.: Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. SIGARCH Comput. Archit. News 38(1), 335–346 (2010)CrossRef
15.
go back to reference Fatahalian, K., Horn, D.R., Knight, T.J., Leem, L., Houston, M., Park, J.Y., Erez, M., Ren, M., Aiken, A., Dally, W.J., Hanrahan, P.: Sequoia: programming the memory hierarchy. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, ACM, New York, NY, USA (2006) Fatahalian, K., Horn, D.R., Knight, T.J., Leem, L., Houston, M., Park, J.Y., Erez, M., Ren, M., Aiken, A., Dally, W.J., Hanrahan, P.: Sequoia: programming the memory hierarchy. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, ACM, New York, NY, USA (2006)
16.
go back to reference Feng, X., Cameron, K.W., Buell, D.A.: PBPI: a high performance implementation of bayesian phylogenetic inference. In: Proceedings of the ACM/IEEE, SC 2006 Conference, Tampa, FL, pp. 40 (2006). doi:10.1109/SC.2006.47 Feng, X., Cameron, K.W., Buell, D.A.: PBPI: a high performance implementation of bayesian phylogenetic inference. In: Proceedings of the ACM/IEEE, SC 2006 Conference, Tampa, FL, pp. 40 (2006). doi:10.​1109/​SC.​2006.​47
17.
go back to reference Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI ’98, pp. 212–223, ACM, New York, NY, USA (1998) Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI ’98, pp. 212–223, ACM, New York, NY, USA (1998)
18.
go back to reference Gornish, E.H., Granston, E.D., Veidenbaum, A.V.: Compiler-directed data prefetching in multiprocessors with memory hierarchies. In: In International Conference on Supercomputing, pp. 354–368 (1990) Gornish, E.H., Granston, E.D., Veidenbaum, A.V.: Compiler-directed data prefetching in multiprocessors with memory hierarchies. In: In International Conference on Supercomputing, pp. 354–368 (1990)
19.
go back to reference Guo, Y., Narayanan, P., Bennaser, M., Chheda, S., Moritz, C.: Energy-efficient hardware data prefetching. Very Large Scale Integration (VLSI) Syst. IEEE Trans. 19(2), 250–263 (2011)CrossRef Guo, Y., Narayanan, P., Bennaser, M., Chheda, S., Moritz, C.: Energy-efficient hardware data prefetching. Very Large Scale Integration (VLSI) Syst. IEEE Trans. 19(2), 250–263 (2011)CrossRef
20.
go back to reference D. Lowenthal and M. James. Run-time selection of block size in pipelined parallel programs. In: Parallel Processing, 1999. Proceedings13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP, pp. 82–87 D. Lowenthal and M. James. Run-time selection of block size in pipelined parallel programs. In: Parallel Processing, 1999. Proceedings13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP, pp. 82–87
21.
go back to reference Lu, J.: Design and Implementation of a Lightweight Runtime Optimization System on Modern Computer Architectures. Ph.D. Thesis, Minneapolis, MN, USA, AAI3220014 (2006) Lu, J.: Design and Implementation of a Lightweight Runtime Optimization System on Modern Computer Architectures. Ph.D. Thesis, Minneapolis, MN, USA, AAI3220014 (2006)
22.
go back to reference Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pp. 190–200, New York, NY, USA, 2005. ACM Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pp. 190–200, New York, NY, USA, 2005. ACM
23.
go back to reference Martonosi, M.R.: Analyzing and tuning memory performance in sequential and parallel programs. Technical report, Stanford, CA, USA (1994) Martonosi, M.R.: Analyzing and tuning memory performance in sequential and parallel programs. Technical report, Stanford, CA, USA (1994)
24.
go back to reference Mowry, T., Gupta, A.: Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J. Parallel Distrib. Comput. 12, 87–106 (1991)CrossRef Mowry, T., Gupta, A.: Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J. Parallel Distrib. Comput. 12, 87–106 (1991)CrossRef
25.
go back to reference Nesbit, K., Smith, J.: Data cache prefetching using a global history buffer. In: Software, IEE Proceedings, p. 96 (2004) Nesbit, K., Smith, J.: Data cache prefetching using a global history buffer. In: Software, IEE Proceedings, p. 96 (2004)
26.
go back to reference Papaefstathiou, V., Katevenis, M.G., Nikolopoulos, D.S., Pnevmatikatos, D.: Prefetching and cache management using task lifetimes. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, pp. 325–334, ACM, New York, NY, USA (2013) Papaefstathiou, V., Katevenis, M.G., Nikolopoulos, D.S., Pnevmatikatos, D.: Prefetching and cache management using task lifetimes. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, pp. 325–334, ACM, New York, NY, USA (2013)
27.
go back to reference Reinders, J.: Intel Threading Building Blocks, 1st edn. O’Reilly and Associates Inc, Sebastopol (2007) Reinders, J.: Intel Threading Building Blocks, 1st edn. O’Reilly and Associates Inc, Sebastopol (2007)
28.
go back to reference Rico, A., Cabarcas, F., Villavieja, C., Pavlovic, M., Vega, A., Etsion, Y., Ramirez, A., Valero, M.: On the simulation of large-scale architectures using multiple application abstraction levels. ACM Trans. Archit. Code Optim. 8(4), 36:1–36:20 (2012)CrossRef Rico, A., Cabarcas, F., Villavieja, C., Pavlovic, M., Vega, A., Etsion, Y., Ramirez, A., Valero, M.: On the simulation of large-scale architectures using multiple application abstraction levels. ACM Trans. Archit. Code Optim. 8(4), 36:1–36:20 (2012)CrossRef
29.
go back to reference Rico, A., Ramirez, A., Valero, M.: Available task-level parallelism on the cell BE. Sci. Program. 17(1–2), 59–76 (2009) Rico, A., Ramirez, A., Valero, M.: Available task-level parallelism on the cell BE. Sci. Program. 17(1–2), 59–76 (2009)
30.
go back to reference Rothberg, E., Singh, J.P., Gupta, A.: Working sets, cache sizes, and node granularity issues for large-scale multiprocessors. In: Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993, pp. 14–25 (1993) Rothberg, E., Singh, J.P., Gupta, A.: Working sets, cache sizes, and node granularity issues for large-scale multiprocessors. In: Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993, pp. 14–25 (1993)
31.
go back to reference Solihin, Y., Lee, J., Torrellas, J.: Using a user-level memory thread for correlation prefetching. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, ISCA ’02, pp. 171–182, IEEE Computer Society, Washington, DC, USA (2002) Solihin, Y., Lee, J., Torrellas, J.: Using a user-level memory thread for correlation prefetching. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, ISCA ’02, pp. 171–182, IEEE Computer Society, Washington, DC, USA (2002)
32.
go back to reference Tandri, S., Abdelrahman, T.S.: Automatic partitioning of data and computations on scalable shared memory multiprocessors. In: Proceedings of the 1997 International Conference on Parallel Processing, 1997, Bloomington, IL, pp. 64–73 (1997). doi:10.1109/ICPP.1997.622557 Tandri, S., Abdelrahman, T.S.: Automatic partitioning of data and computations on scalable shared memory multiprocessors. In: Proceedings of the 1997 International Conference on Parallel Processing, 1997, Bloomington, IL, pp. 64–73 (1997). doi:10.​1109/​ICPP.​1997.​622557
33.
go back to reference Tullsen, D.M., Eggers, S.J.: Effective cache prefetching on bus-based multiprocessors. ACM Trans. Comput. Syst. 13(1), 57–88 (1995)CrossRef Tullsen, D.M., Eggers, S.J.: Effective cache prefetching on bus-based multiprocessors. ACM Trans. Comput. Syst. 13(1), 57–88 (1995)CrossRef
34.
go back to reference Villavieja, C., Karakostas, V., Vilanova, L., Etsion, Y., Ramirez, A., Mendelson, A., Navarro, N., Cristal, A., Unsal, O.S.: Didi: Mitigating the performance impact of tlb shootdowns using a shared tlb directory. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), Galveston, TX, pp. 340–349 (2011) Villavieja, C., Karakostas, V., Vilanova, L., Etsion, Y., Ramirez, A., Mendelson, A., Navarro, N., Cristal, A., Unsal, O.S.: Didi: Mitigating the performance impact of tlb shootdowns using a shared tlb directory. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), Galveston, TX, pp. 340–349 (2011)
36.
go back to reference Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995)CrossRef Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995)CrossRef
Metadata
Title
Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors
Authors
Victor Garcia
Alejandro Rico
Carlos Villavieja
Paul Carpenter
Nacho Navarro
Alex Ramirez
Publication date
29-04-2016
Publisher
Springer US
Published in
International Journal of Parallel Programming / Issue 3/2017
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-016-0431-8

Other articles of this Issue 3/2017

International Journal of Parallel Programming 3/2017 Go to the issue

Premium Partner