Skip to main content

2019 | OriginalPaper | Buchkapitel

PerfMemPlus: A Tool for Automatic Discovery of Memory Performance Problems

verfasst von : Christian Helm, Kenjiro Taura

Erschienen in: High Performance Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In high-performance computing many performance problems are caused by the memory system. Because such performance bugs are hard to identify, analysis tools play an important role in performance optimization. Today’s processors offer feature-rich performance monitoring units with support for instruction sampling. But existing tools only partially use this data. Previously, performance counters were used to measure the memory bandwidth. But the attribution of high bandwidth to source code has been difficult and imprecise. We introduce a novel method for identifying performance degrading bandwidth usage and attributing it to specific objects and source code lines. This paper also introduces a new method for false sharing detection. It can differentiate false and true sharing, identify objects and source code lines where the accesses to falsely shared objects are happening. It can uncover false sharing, which has been overlooked by previous tools. PerfMemPlus automatically reports those issues by using instruction sampling data captured with a single profiling run. This simplifies the tedious search for the location of performance problems in complex code. The tool design is simple, provides support for many existing and upcoming processors and the recorded data can be easily used in future research. We show that PerfMemPlus can automatically report performance problems without producing false positives. Additionally, we present case studies that show how PerfMemPlus can pinpoint memory performance problems in the PARSEC benchmarks and machine learning applications.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bhadauria, M., Weaver, V.M., Mckee, S.A.: Understanding parsec performance on contemporary CMPS. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, pp. 98–107 (2009) Bhadauria, M., Weaver, V.M., Mckee, S.A.: Understanding parsec performance on contemporary CMPS. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, pp. 98–107 (2009)
2.
Zurück zum Zitat Bienia, C.: Benchmarking Modern Multiprocessors. Ph.D. thesis, Princeton University (2011) Bienia, C.: Benchmarking Modern Multiprocessors. Ph.D. thesis, Princeton University (2011)
4.
Zurück zum Zitat Chabbi, M., Wen, S., Liu, X.: Featherlight on-the-fly false-sharing detection. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 152–167 (2018) Chabbi, M., Wen, S., Liu, X.: Featherlight on-the-fly false-sharing detection. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 152–167 (2018)
5.
Zurück zum Zitat Drebes, A., Pop, A., Heydemann, K., Cohen, A., Drachtemam, N.: Aftermath: a graphical tool for performance analysis and debugging of fine-grained task-parallel programs and run-time systems. In: 7th Workshop on Programmability Issues for Heterogeneous Multicores (2014) Drebes, A., Pop, A., Heydemann, K., Cohen, A., Drachtemam, N.: Aftermath: a graphical tool for performance analysis and debugging of fine-grained task-parallel programs and run-time systems. In: 7th Workshop on Programmability Issues for Heterogeneous Multicores (2014)
6.
Zurück zum Zitat Eklov, D., Nikoleris, N., Hagersten, E.: A software based profiling method for obtaining speedup stacks on commodity multi-cores. In: ISPASS 2014 - IEEE International Symposium on Performance Analysis of Systems and Software (2014) Eklov, D., Nikoleris, N., Hagersten, E.: A software based profiling method for obtaining speedup stacks on commodity multi-cores. In: ISPASS 2014 - IEEE International Symposium on Performance Analysis of Systems and Software (2014)
7.
Zurück zum Zitat Eyerman, S., Du Bois, K., Eeckhout, L.: Speedup stacks: identifying scaling bottlenecks in multi-threaded applications. In: ISPASS 2012 - IEEE International Symposium on Performance Analysis of Systems and Software, pp. 145–155 (2012) Eyerman, S., Du Bois, K., Eeckhout, L.: Speedup stacks: identifying scaling bottlenecks in multi-threaded applications. In: ISPASS 2012 - IEEE International Symposium on Performance Analysis of Systems and Software, pp. 145–155 (2012)
8.
Zurück zum Zitat Gimenez, A., et al.: MemAxes: visualization and analytics for characterizing complex memory performance behaviors. IEEE Trans. Vis. Comput. Graph. 27(5), 2180–2193 (2017)CrossRef Gimenez, A., et al.: MemAxes: visualization and analytics for characterizing complex memory performance behaviors. IEEE Trans. Vis. Comput. Graph. 27(5), 2180–2193 (2017)CrossRef
9.
Zurück zum Zitat Giménez, A., et al.: Dissecting on-node memory access performance: a semantic approach. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC, pp. 166–176 (2014) Giménez, A., et al.: Dissecting on-node memory access performance: a semantic approach. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC, pp. 166–176 (2014)
12.
Zurück zum Zitat Huynh, A., Taura, K.: Delay Spotter: a tool for spotting scheduler-caused delays in task parallel runtime systems. In: IEEE International Conference on Cluster Computing, ICCC, pp. 114–125 (2017) Huynh, A., Taura, K.: Delay Spotter: a tool for spotting scheduler-caused delays in task parallel runtime systems. In: IEEE International Conference on Cluster Computing, ICCC, pp. 114–125 (2017)
15.
Zurück zum Zitat Jayasena, S., et al.: Detection of false sharing using machine learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2013 (2013) Jayasena, S., et al.: Detection of false sharing using machine learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2013 (2013)
16.
Zurück zum Zitat Lachaize, R., Lepers, B., Quéma, V.: MemProf: a memory profiler for NUMA multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference, p. 5 (2012) Lachaize, R., Lepers, B., Quéma, V.: MemProf: a memory profiler for NUMA multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference, p. 5 (2012)
18.
Zurück zum Zitat Liu, T., Berger, E.D.: SHERIFF: precise detection and automatic mitigation of false sharing. In: Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, pp. 3–18 (2011) Liu, T., Berger, E.D.: SHERIFF: precise detection and automatic mitigation of false sharing. In: Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, pp. 3–18 (2011)
19.
Zurück zum Zitat Liu, T., Liu, X.: Cheetah: detecting false sharing efficiently and effectively. In: Proceedings of the International Symposium on Code Generation and Optimization (2016) Liu, T., Liu, X.: Cheetah: detecting false sharing efficiently and effectively. In: Proceedings of the International Symposium on Code Generation and Optimization (2016)
20.
Zurück zum Zitat Liu, T., Tian, C., Hu, Z., Berger, E.D.: PREDATOR: predictive false sharing detection. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2014) Liu, T., Tian, C., Hu, Z., Berger, E.D.: PREDATOR: predictive false sharing detection. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2014)
21.
Zurück zum Zitat Liu, X., Mellor-Crummey, J.: A data-centric profiler for parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2013 (2013) Liu, X., Mellor-Crummey, J.: A data-centric profiler for parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2013 (2013)
22.
Zurück zum Zitat Liu, X., Mellor-Crummey, J.: A tool to analyze the performance of multithreaded programs on NUMA architectures. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 259–272 (2014) Liu, X., Mellor-Crummey, J.: A tool to analyze the performance of multithreaded programs on NUMA architectures. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 259–272 (2014)
23.
Zurück zum Zitat Liu, X., Sharma, K., Mellor-Crummey, J.: ArrayTool: a lightweight profiler to guide array regrouping. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, pp. 405–416 (2014) Liu, X., Sharma, K., Mellor-Crummey, J.: ArrayTool: a lightweight profiler to guide array regrouping. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, pp. 405–416 (2014)
24.
Zurück zum Zitat Liu, X., Wu, B.: ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2015 (2015) Liu, X., Wu, B.: ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2015 (2015)
25.
Zurück zum Zitat Majo, Z., Gross, T.R.: (Mis) Understanding the NUMA memory system performance of multithreaded workloads. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 11–22 (2013) Majo, Z., Gross, T.R.: (Mis) Understanding the NUMA memory system performance of multithreaded workloads. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 11–22 (2013)
29.
Zurück zum Zitat Pesterev, A., Zeldovich, N., Morris, R.T., Orlando, T.P.: Locating cache performance bottlenecks using data profiling. In: Proceedings of the 5th European Conference on Computer Systems EuroSys 2010, p. 335 (2010) Pesterev, A., Zeldovich, N., Morris, R.T., Orlando, T.P.: Locating cache performance bottlenecks using data profiling. In: Proceedings of the 5th European Conference on Computer Systems EuroSys 2010, p. 335 (2010)
30.
Zurück zum Zitat Qiao, Y., et al.: Parallelizing and optimizing neural Encoder Decoder models without padding on multi-core architecture. Future Gener. Comput. Syst. (2018) Qiao, Y., et al.: Parallelizing and optimizing neural Encoder Decoder models without padding on multi-core architecture. Future Gener. Comput. Syst. (2018)
31.
Zurück zum Zitat Roth, M., Best, M.J., Mustard, C., Fedorova, A.: Deconstructing the overhead in parallel applications. In: Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012 1, pp. 59–68 (2012) Roth, M., Best, M.J., Mustard, C., Fedorova, A.: Deconstructing the overhead in parallel applications. In: Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012 1, pp. 59–68 (2012)
32.
Zurück zum Zitat Southern, G., Renau, J.: Deconstructing PARSEC scalability. In: 11th Annual Workshop on Duplicating, Deconstructing and Debunking, p. 10 (2015) Southern, G., Renau, J.: Deconstructing PARSEC scalability. In: 11th Annual Workshop on Duplicating, Deconstructing and Debunking, p. 10 (2015)
35.
Zurück zum Zitat Xu, H., Wen, S., Gimenez, A., Gamblin, T., Liu, X.: DR-BW: identifying bandwidth contention in NUMA architectures with supervised learning. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS (2017) Xu, H., Wen, S., Gimenez, A., Gamblin, T., Liu, X.: DR-BW: identifying bandwidth contention in NUMA architectures with supervised learning. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS (2017)
Metadaten
Titel
PerfMemPlus: A Tool for Automatic Discovery of Memory Performance Problems
verfasst von
Christian Helm
Kenjiro Taura
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-20656-7_11

Premium Partner