Skip to main content
Top

2019 | OriginalPaper | Chapter

PerfMemPlus: A Tool for Automatic Discovery of Memory Performance Problems

Authors : Christian Helm, Kenjiro Taura

Published in: High Performance Computing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In high-performance computing many performance problems are caused by the memory system. Because such performance bugs are hard to identify, analysis tools play an important role in performance optimization. Today’s processors offer feature-rich performance monitoring units with support for instruction sampling. But existing tools only partially use this data. Previously, performance counters were used to measure the memory bandwidth. But the attribution of high bandwidth to source code has been difficult and imprecise. We introduce a novel method for identifying performance degrading bandwidth usage and attributing it to specific objects and source code lines. This paper also introduces a new method for false sharing detection. It can differentiate false and true sharing, identify objects and source code lines where the accesses to falsely shared objects are happening. It can uncover false sharing, which has been overlooked by previous tools. PerfMemPlus automatically reports those issues by using instruction sampling data captured with a single profiling run. This simplifies the tedious search for the location of performance problems in complex code. The tool design is simple, provides support for many existing and upcoming processors and the recorded data can be easily used in future research. We show that PerfMemPlus can automatically report performance problems without producing false positives. Additionally, we present case studies that show how PerfMemPlus can pinpoint memory performance problems in the PARSEC benchmarks and machine learning applications.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Bhadauria, M., Weaver, V.M., Mckee, S.A.: Understanding parsec performance on contemporary CMPS. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, pp. 98–107 (2009) Bhadauria, M., Weaver, V.M., Mckee, S.A.: Understanding parsec performance on contemporary CMPS. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, pp. 98–107 (2009)
2.
go back to reference Bienia, C.: Benchmarking Modern Multiprocessors. Ph.D. thesis, Princeton University (2011) Bienia, C.: Benchmarking Modern Multiprocessors. Ph.D. thesis, Princeton University (2011)
4.
go back to reference Chabbi, M., Wen, S., Liu, X.: Featherlight on-the-fly false-sharing detection. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 152–167 (2018) Chabbi, M., Wen, S., Liu, X.: Featherlight on-the-fly false-sharing detection. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 152–167 (2018)
5.
go back to reference Drebes, A., Pop, A., Heydemann, K., Cohen, A., Drachtemam, N.: Aftermath: a graphical tool for performance analysis and debugging of fine-grained task-parallel programs and run-time systems. In: 7th Workshop on Programmability Issues for Heterogeneous Multicores (2014) Drebes, A., Pop, A., Heydemann, K., Cohen, A., Drachtemam, N.: Aftermath: a graphical tool for performance analysis and debugging of fine-grained task-parallel programs and run-time systems. In: 7th Workshop on Programmability Issues for Heterogeneous Multicores (2014)
6.
go back to reference Eklov, D., Nikoleris, N., Hagersten, E.: A software based profiling method for obtaining speedup stacks on commodity multi-cores. In: ISPASS 2014 - IEEE International Symposium on Performance Analysis of Systems and Software (2014) Eklov, D., Nikoleris, N., Hagersten, E.: A software based profiling method for obtaining speedup stacks on commodity multi-cores. In: ISPASS 2014 - IEEE International Symposium on Performance Analysis of Systems and Software (2014)
7.
go back to reference Eyerman, S., Du Bois, K., Eeckhout, L.: Speedup stacks: identifying scaling bottlenecks in multi-threaded applications. In: ISPASS 2012 - IEEE International Symposium on Performance Analysis of Systems and Software, pp. 145–155 (2012) Eyerman, S., Du Bois, K., Eeckhout, L.: Speedup stacks: identifying scaling bottlenecks in multi-threaded applications. In: ISPASS 2012 - IEEE International Symposium on Performance Analysis of Systems and Software, pp. 145–155 (2012)
8.
go back to reference Gimenez, A., et al.: MemAxes: visualization and analytics for characterizing complex memory performance behaviors. IEEE Trans. Vis. Comput. Graph. 27(5), 2180–2193 (2017)CrossRef Gimenez, A., et al.: MemAxes: visualization and analytics for characterizing complex memory performance behaviors. IEEE Trans. Vis. Comput. Graph. 27(5), 2180–2193 (2017)CrossRef
9.
go back to reference Giménez, A., et al.: Dissecting on-node memory access performance: a semantic approach. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC, pp. 166–176 (2014) Giménez, A., et al.: Dissecting on-node memory access performance: a semantic approach. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC, pp. 166–176 (2014)
12.
go back to reference Huynh, A., Taura, K.: Delay Spotter: a tool for spotting scheduler-caused delays in task parallel runtime systems. In: IEEE International Conference on Cluster Computing, ICCC, pp. 114–125 (2017) Huynh, A., Taura, K.: Delay Spotter: a tool for spotting scheduler-caused delays in task parallel runtime systems. In: IEEE International Conference on Cluster Computing, ICCC, pp. 114–125 (2017)
15.
go back to reference Jayasena, S., et al.: Detection of false sharing using machine learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2013 (2013) Jayasena, S., et al.: Detection of false sharing using machine learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2013 (2013)
16.
go back to reference Lachaize, R., Lepers, B., Quéma, V.: MemProf: a memory profiler for NUMA multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference, p. 5 (2012) Lachaize, R., Lepers, B., Quéma, V.: MemProf: a memory profiler for NUMA multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference, p. 5 (2012)
18.
go back to reference Liu, T., Berger, E.D.: SHERIFF: precise detection and automatic mitigation of false sharing. In: Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, pp. 3–18 (2011) Liu, T., Berger, E.D.: SHERIFF: precise detection and automatic mitigation of false sharing. In: Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, pp. 3–18 (2011)
19.
go back to reference Liu, T., Liu, X.: Cheetah: detecting false sharing efficiently and effectively. In: Proceedings of the International Symposium on Code Generation and Optimization (2016) Liu, T., Liu, X.: Cheetah: detecting false sharing efficiently and effectively. In: Proceedings of the International Symposium on Code Generation and Optimization (2016)
20.
go back to reference Liu, T., Tian, C., Hu, Z., Berger, E.D.: PREDATOR: predictive false sharing detection. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2014) Liu, T., Tian, C., Hu, Z., Berger, E.D.: PREDATOR: predictive false sharing detection. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2014)
21.
go back to reference Liu, X., Mellor-Crummey, J.: A data-centric profiler for parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2013 (2013) Liu, X., Mellor-Crummey, J.: A data-centric profiler for parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2013 (2013)
22.
go back to reference Liu, X., Mellor-Crummey, J.: A tool to analyze the performance of multithreaded programs on NUMA architectures. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 259–272 (2014) Liu, X., Mellor-Crummey, J.: A tool to analyze the performance of multithreaded programs on NUMA architectures. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 259–272 (2014)
23.
go back to reference Liu, X., Sharma, K., Mellor-Crummey, J.: ArrayTool: a lightweight profiler to guide array regrouping. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, pp. 405–416 (2014) Liu, X., Sharma, K., Mellor-Crummey, J.: ArrayTool: a lightweight profiler to guide array regrouping. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, pp. 405–416 (2014)
24.
go back to reference Liu, X., Wu, B.: ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2015 (2015) Liu, X., Wu, B.: ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis - SC 2015 (2015)
25.
go back to reference Majo, Z., Gross, T.R.: (Mis) Understanding the NUMA memory system performance of multithreaded workloads. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 11–22 (2013) Majo, Z., Gross, T.R.: (Mis) Understanding the NUMA memory system performance of multithreaded workloads. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 11–22 (2013)
29.
go back to reference Pesterev, A., Zeldovich, N., Morris, R.T., Orlando, T.P.: Locating cache performance bottlenecks using data profiling. In: Proceedings of the 5th European Conference on Computer Systems EuroSys 2010, p. 335 (2010) Pesterev, A., Zeldovich, N., Morris, R.T., Orlando, T.P.: Locating cache performance bottlenecks using data profiling. In: Proceedings of the 5th European Conference on Computer Systems EuroSys 2010, p. 335 (2010)
30.
go back to reference Qiao, Y., et al.: Parallelizing and optimizing neural Encoder Decoder models without padding on multi-core architecture. Future Gener. Comput. Syst. (2018) Qiao, Y., et al.: Parallelizing and optimizing neural Encoder Decoder models without padding on multi-core architecture. Future Gener. Comput. Syst. (2018)
31.
go back to reference Roth, M., Best, M.J., Mustard, C., Fedorova, A.: Deconstructing the overhead in parallel applications. In: Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012 1, pp. 59–68 (2012) Roth, M., Best, M.J., Mustard, C., Fedorova, A.: Deconstructing the overhead in parallel applications. In: Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012 1, pp. 59–68 (2012)
32.
go back to reference Southern, G., Renau, J.: Deconstructing PARSEC scalability. In: 11th Annual Workshop on Duplicating, Deconstructing and Debunking, p. 10 (2015) Southern, G., Renau, J.: Deconstructing PARSEC scalability. In: 11th Annual Workshop on Duplicating, Deconstructing and Debunking, p. 10 (2015)
35.
go back to reference Xu, H., Wen, S., Gimenez, A., Gamblin, T., Liu, X.: DR-BW: identifying bandwidth contention in NUMA architectures with supervised learning. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS (2017) Xu, H., Wen, S., Gimenez, A., Gamblin, T., Liu, X.: DR-BW: identifying bandwidth contention in NUMA architectures with supervised learning. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS (2017)
Metadata
Title
PerfMemPlus: A Tool for Automatic Discovery of Memory Performance Problems
Authors
Christian Helm
Kenjiro Taura
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-20656-7_11

Premium Partner