Skip to main content
Erschienen in: International Journal of Parallel Programming 5/2017

01.10.2016

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

verfasst von: Moritz Kreutzer, Jonas Thies, Melven Röhrig-Zöllner, Andreas Pieper, Faisal Shahzad, Martin Galgon, Achim Basermann, Holger Fehske, Georg Hager, Gerhard Wellein

Erschienen in: International Journal of Parallel Programming | Ausgabe 5/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring “standard” as well as “accelerated” resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The “General, Hybrid, and Optimized Sparse Toolkit” (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the “MPI+X” paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack. The library code and several applications are available as open source.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Alvermann, A., Basermann, A., Fehske, H., Galgon, M., Hager, G., Kreutzer, M., Krämer, L., Lang, B., Pieper, A., Röhrig-Zöllner, M., Shahzad, F., Thies, J., Wellein, G.: ESSEX: Equipping Sparse Solvers for Exascale, pp. 577–588. Springer International Publishing, Cham (2014). doi:10.1007/978-3-319-14313-2_49 Alvermann, A., Basermann, A., Fehske, H., Galgon, M., Hager, G., Kreutzer, M., Krämer, L., Lang, B., Pieper, A., Röhrig-Zöllner, M., Shahzad, F., Thies, J., Wellein, G.: ESSEX: Equipping Sparse Solvers for Exascale, pp. 577–588. Springer International Publishing, Cham (2014). doi:10.​1007/​978-3-319-14313-2_​49
2.
Zurück zum Zitat Anderson, M., Ballard, G., Demmel, J., Keutzer, K.: Communication-avoiding QR decomposition for GPUs. In: IEEE International on Parallel Distributed Processing Symposium (IPDPS), 2011, pp. 48–58 (2011). doi:10.1109/IPDPS.2011.15 Anderson, M., Ballard, G., Demmel, J., Keutzer, K.: Communication-avoiding QR decomposition for GPUs. In: IEEE International on Parallel Distributed Processing Symposium (IPDPS), 2011, pp. 48–58 (2011). doi:10.​1109/​IPDPS.​2011.​15
3.
Zurück zum Zitat Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exp. 23(2), 187–198 (2011). doi:10.1002/cpe.1631 CrossRef Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exp. 23(2), 187–198 (2011). doi:10.​1002/​cpe.​1631 CrossRef
4.
Zurück zum Zitat Baker, C.G., Heroux, M.A.: Tpetra, and the use of generic programming in scientific computing. Sci. Program. 20(2), 115–128 (2012). doi:10.1155/2012/693861 Baker, C.G., Heroux, M.A.: Tpetra, and the use of generic programming in scientific computing. Sci. Program. 20(2), 115–128 (2012). doi:10.​1155/​2012/​693861
5.
Zurück zum Zitat Baker, C.G., Hetmaniuk, U.L., Lehoucq, R.B., Thornquist, H.K.: Anasazi software for the numerical solution of large-scale eigenvalue problems. ACM Trans. Math. Softw. 36(3), 13:1–13:23 (2009). doi:10.1145/1527286.1527287 Baker, C.G., Hetmaniuk, U.L., Lehoucq, R.B., Thornquist, H.K.: Anasazi software for the numerical solution of large-scale eigenvalue problems. ACM Trans. Math. Softw. 36(3), 13:1–13:23 (2009). doi:10.​1145/​1527286.​1527287
6.
Zurück zum Zitat Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S., Zhang, H.: PETSc Web page (2016). http://www.mcs.anl.gov/petsc Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S., Zhang, H.: PETSc Web page (2016). http://​www.​mcs.​anl.​gov/​petsc
7.
Zurück zum Zitat Blackford, L.S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Whaley, R.C.: An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 135–151 (2001). doi:10.1145/567806.567807 MathSciNetCrossRef Blackford, L.S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Whaley, R.C.: An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 135–151 (2001). doi:10.​1145/​567806.​567807 MathSciNetCrossRef
8.
Zurück zum Zitat Boisvert, R.F., Pozo, R., Remington, K., Barrett, R.F., Dongarra, J.J.: Matrix market: A web resource for test matrix collections. In: Proceedings of the IFIP TC2/WG2.5 Working Conference on Quality of Numerical Software: Assessment and Enhancement, pp. 125–137. Chapman & Hall, Ltd., London, UK (1997) Boisvert, R.F., Pozo, R., Remington, K., Barrett, R.F., Dongarra, J.J.: Matrix market: A web resource for test matrix collections. In: Proceedings of the IFIP TC2/WG2.5 Working Conference on Quality of Numerical Software: Assessment and Enhancement, pp. 125–137. Chapman & Hall, Ltd., London, UK (1997)
9.
Zurück zum Zitat Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: Hwloc: A generic framework for managing hardware affinities in HPC applications. In: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP ’10, pp. 180–186. IEEE Computer Society, Washington, DC, USA (2010). doi:10.1109/PDP.2010.67 Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: Hwloc: A generic framework for managing hardware affinities in HPC applications. In: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP ’10, pp. 180–186. IEEE Computer Society, Washington, DC, USA (2010). doi:10.​1109/​PDP.​2010.​67
13.
Zurück zum Zitat Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in sparse matrix computations. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–12 (2008). doi:10.1109/IPDPS.2008.4536305 Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in sparse matrix computations. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–12 (2008). doi:10.​1109/​IPDPS.​2008.​4536305
14.
Zurück zum Zitat Denis, A.: POSTER: a generic framework for asynchronous progression and multithreaded communications. In: IEEE International Conference on Cluster Computing (CLUSTER), 2014, pp. 276–277 (2014). doi:10.1109/CLUSTER.2014.6968752 Denis, A.: POSTER: a generic framework for asynchronous progression and multithreaded communications. In: IEEE International Conference on Cluster Computing (CLUSTER), 2014, pp. 276–277 (2014). doi:10.​1109/​CLUSTER.​2014.​6968752
15.
Zurück zum Zitat Devine, K., Boman, E., Heaphy, R., Bisseling, R., Catalyurek, U.: Parallel hypergraph partitioning for scientific computing. In: Parallel and Distributed Processing Symposium, 2006. IPDPS 2006, 20th International, p. 10 (2006). doi:10.1109/IPDPS.2006.1639359 Devine, K., Boman, E., Heaphy, R., Bisseling, R., Catalyurek, U.: Parallel hypergraph partitioning for scientific computing. In: Parallel and Distributed Processing Symposium, 2006. IPDPS 2006, 20th International, p. 10 (2006). doi:10.​1109/​IPDPS.​2006.​1639359
17.
Zurück zum Zitat Gebremedhin, A.H., Nguyen, D., Patwary, M.M.A., Pothen, A.: Colpack: software for graph coloring and related problems in scientific computing. ACM Trans. Math. Softw. 40(1), 1:1–1:31 (2013). doi:10.1145/2513109.2513110 Gebremedhin, A.H., Nguyen, D., Patwary, M.M.A., Pothen, A.: Colpack: software for graph coloring and related problems in scientific computing. ACM Trans. Math. Softw. 40(1), 1:1–1:31 (2013). doi:10.​1145/​2513109.​2513110
19.
20.
Zurück zum Zitat Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: Towards realistic performance bounds for implicit CFD codes. In: Proceedings of Parallel CFD99, pp. 233–240. Elsevier (1999) Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: Towards realistic performance bounds for implicit CFD codes. In: Proceedings of Parallel CFD99, pp. 233–240. Elsevier (1999)
21.
Zurück zum Zitat Hager, G., Wellein, G.: Introduction to High Performance Computing for Scientists and Engineers, 1st edn. CRC Press Inc, Boca Raton, FL (2010)CrossRef Hager, G., Wellein, G.: Introduction to High Performance Computing for Scientists and Engineers, 1st edn. CRC Press Inc, Boca Raton, FL (2010)CrossRef
23.
Zurück zum Zitat Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Pol. Sci. Lett. 35, 355–357 (1937)MATH Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Pol. Sci. Lett. 35, 355–357 (1937)MATH
24.
Zurück zum Zitat Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5), C401–C423 (2014). doi:10.1137/130930352 MathSciNetCrossRefMATH Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5), C401–C423 (2014). doi:10.​1137/​130930352 MathSciNetCrossRefMATH
25.
Zurück zum Zitat Kreutzer, M., Pieper, A., Hager, G., Wellein, G., Alvermann, A., Fehske, H.: Performance engineering of the kernel polynomal method on large-scale cpu-gpu systems. In: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pp. 417–426 (2015). doi:10.1109/IPDPS.2015.76 Kreutzer, M., Pieper, A., Hager, G., Wellein, G., Alvermann, A., Fehske, H.: Performance engineering of the kernel polynomal method on large-scale cpu-gpu systems. In: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pp. 417–426 (2015). doi:10.​1109/​IPDPS.​2015.​76
30.
Zurück zum Zitat McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995) McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)
31.
Zurück zum Zitat Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Y. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, X. Martorell (eds.) High Performance Embedded Architectures and Compilers, Lecture Notes in Computer Science, vol. 5952, pp. 111–125. Springer, Berlin (2010). doi:10.1007/978-3-642-11515-8_10 Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Y. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, X. Martorell (eds.) High Performance Embedded Architectures and Compilers, Lecture Notes in Computer Science, vol. 5952, pp. 111–125. Springer, Berlin (2010). doi:10.​1007/​978-3-642-11515-8_​10
32.
Zurück zum Zitat Nelson, T., Belter, G., Siek, J.G., Jessup, E., Norris, B.: Reliable generation of high-performance matrix algebra. ACM Trans. Math. Softw. 41(3), 18:1–18:27 (2015). doi:10.1145/2629698 Nelson, T., Belter, G., Siek, J.G., Jessup, E., Norris, B.: Reliable generation of high-performance matrix algebra. ACM Trans. Math. Softw. 41(3), 18:1–18:27 (2015). doi:10.​1145/​2629698
34.
Zurück zum Zitat Oppe, T.C., Kincaid, D.R.: The performance of ITPACK on vector computers for solving large sparse linear systems arising in sample oil reseervoir simulation problems. Commun. Appl. Numer. Methods 3(1), 23–29 (1987). doi:10.1002/cnm.1630030106 CrossRefMATH Oppe, T.C., Kincaid, D.R.: The performance of ITPACK on vector computers for solving large sparse linear systems arising in sample oil reseervoir simulation problems. Commun. Appl. Numer. Methods 3(1), 23–29 (1987). doi:10.​1002/​cnm.​1630030106 CrossRefMATH
38.
Zurück zum Zitat Pieper, A., Kreutzer, M., Alvermann, A., Galgon, M., Fehske, H., Hager, G., Lang, B., Wellein, G.: High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. J. Comput. Phys. 325, 226–243 (2016). doi:10.1016/j.jcp.2016.08.027 MathSciNetCrossRef Pieper, A., Kreutzer, M., Alvermann, A., Galgon, M., Fehske, H., Hager, G., Lang, B., Wellein, G.: High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. J. Comput. Phys. 325, 226–243 (2016). doi:10.​1016/​j.​jcp.​2016.​08.​027 MathSciNetCrossRef
40.
Zurück zum Zitat Rabenseifner, R., Hager, G., Jost, G.: Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes. In: 17th Euromicro International Conference Parallel, Distributed and Network-based Processing, 2009, pp. 427–436 (2009). doi:10.1109/PDP.2009.43 Rabenseifner, R., Hager, G., Jost, G.: Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes. In: 17th Euromicro International Conference Parallel, Distributed and Network-based Processing, 2009, pp. 427–436 (2009). doi:10.​1109/​PDP.​2009.​43
41.
Zurück zum Zitat Röhrig-Zöllner, M., Thies, J., Kreutzer, M., Alvermann, A., Pieper, A., Basermann, A., Hager, G., Wellein, G., Fehske, H.: Increasing the performance of the Jacobi–Davidson method by blocking. SIAM J. Sci. Comput. 37(6), C697–C722 (2015). doi:10.1137/140976017 MathSciNetCrossRefMATH Röhrig-Zöllner, M., Thies, J., Kreutzer, M., Alvermann, A., Pieper, A., Basermann, A., Hager, G., Wellein, G., Fehske, H.: Increasing the performance of the Jacobi–Davidson method by blocking. SIAM J. Sci. Comput. 37(6), C697–C722 (2015). doi:10.​1137/​140976017 MathSciNetCrossRefMATH
42.
Zurück zum Zitat Rupp, K., Rudolf, F., Weinbub, J.: ViennaCL–A High Level Linear Algebra Library for GPUs and Multi-Core CPUs. In: International Workshop on GPUs and Scientific Applications, pp. 51–56 (2010) Rupp, K., Rudolf, F., Weinbub, J.: ViennaCL–A High Level Linear Algebra Library for GPUs and Multi-Core CPUs. In: International Workshop on GPUs and Scientific Applications, pp. 51–56 (2010)
43.
Zurück zum Zitat Rupp, K., Weinbub, J., Jüngel, A., Grasser, T.: Pipelined iterative solvers with kernel fusion for graphics processing units. ACM Trans. Math. Softw. 43(2), 11:1–11:27 (2016). doi:10.1145/2907944 Rupp, K., Weinbub, J., Jüngel, A., Grasser, T.: Pipelined iterative solvers with kernel fusion for graphics processing units. ACM Trans. Math. Softw. 43(2), 11:1–11:27 (2016). doi:10.​1145/​2907944
46.
Zurück zum Zitat Siek, J., Karlin, I., Jessup, E.: Build to order linear algebra kernels. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–8 (2008). doi:10.1109/IPDPS.2008.4536183 Siek, J., Karlin, I., Jessup, E.: Build to order linear algebra kernels. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–8 (2008). doi:10.​1109/​IPDPS.​2008.​4536183
50.
52.
Zurück zum Zitat Vital, B.: Etude de quelques mthodes de rsolution de problmes linaires de grande taille sur multiprocessor. Ph.D. thesis, Universit de Rennes, Rennes (1990) Vital, B.: Etude de quelques mthodes de rsolution de problmes linaires de grande taille sur multiprocessor. Ph.D. thesis, Universit de Rennes, Rennes (1990)
53.
Zurück zum Zitat Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 191–202. IEEE Press, Piscataway, NJ, USA (2014). doi:10.1109/SC.2014.21 Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 191–202. IEEE Press, Piscataway, NJ, USA (2014). doi:10.​1109/​SC.​2014.​21
Metadaten
Titel
GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems
verfasst von
Moritz Kreutzer
Jonas Thies
Melven Röhrig-Zöllner
Andreas Pieper
Faisal Shahzad
Martin Galgon
Achim Basermann
Holger Fehske
Georg Hager
Gerhard Wellein
Publikationsdatum
01.10.2016
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 5/2017
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-016-0464-z

Weitere Artikel der Ausgabe 5/2017

International Journal of Parallel Programming 5/2017 Zur Ausgabe

Premium Partner