ABSTRACT
The number of hardware threads is growing with each new generation of multicore chips; thus, one must effectively use threads to fully exploit emerging processors. OpenMP is a popular directive-based programming model that helps programmers exploit thread-level parallelism. In this paper, we describe the design and implementation of a novel performance tool for OpenMP. Our tool distinguishes itself from existing OpenMP performance tools in two principal ways. First, we develop a measurement methodology that attributes blame for work and inefficiency back to program contexts. We show how to integrate prior work on measurement methodologies that employ directed and undirected blame shifting and extend the approach to support dynamic thread-level parallelism in both time-shared and dedicated environments. Second, we develop a novel deferred context resolution method that supports online attribution of performance metrics to full calling contexts within an OpenMP program execution. This approach enables us to collect compact call path profiles for OpenMP program executions without the need for traces. Support for our approach is an integral part of an emerging standard performance tool application programming interface for OpenMP. We demonstrate the effectiveness of our approach by applying our tool to analyze four well-known application benchmarks that cover the spectrum of OpenMP features. In case studies with these benchmarks, insights from our tool helped us significantly improve the performance of these codes.
- L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22:685--701, 2010. Google ScholarCross Ref
- E. Ayguadé et al. The design of OpenMP tasks. IEEE Trans. Parallel Distrib. Syst., 20(3):404--418, Mar. 2009. Google ScholarDigital Library
- I.-H. Chung. IBM high performance toolkit, 2008. https://computing.llnl.gov/tutorials/IBM.HPC.Toolkit.Chung.pdf.Google Scholar
- Cray Inc. Using Cray performance analysis tools, April 2011. Document S-2376--52, http://docs.cray.com/books/S-2376--52/S-2376--52.pdf.Google Scholar
- S. R. Das and R. M. Fujimoto. A performance study of the cancelback protocol for time warp. SIGSIM Simul. Dig., 23(1):135--142, July 1993. Google ScholarDigital Library
- Free Software Foundation. GOMP--an OpenMP implementation for GCC. http://gcc.gnu.org/projects/gomp, 2012.Google Scholar
- K. Fürlinger and M. Gerndt. ompP: A profiling tool for OpenMP. In Proc. of the First and Second International Workshops on OpenMP, pages 15--23, Eugene, Oregon, USA, May 2005. LNCS 4315. Google ScholarDigital Library
- Google Inc. TCMalloc: Thread-Caching Malloc. http://goog-perftools.sourceforge.net/doc/tcmalloc.html. Last accessed April 3, 2013.Google Scholar
- Intel. Intel VTune Amplifier XE. http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe, July, 2012.Google Scholar
- M. Itzkowitz, O. Mazurov, N. Copty, and Y. Lin. An OpenMP runtime API for profiling. http://www.compunity.org/futures/omp-api.html.Google Scholar
- H. Jin and R. F. V. der Wijngaart. Performance characteristics of the multi-zone NAS parallel benchmarks. J. Parallel Distrib. Comput., 66(5):674--685, May 2006. Google ScholarDigital Library
- Lawrence Livermore National Laboratory. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://computation.llnl.gov/casc/ShockHydro. Last accessed April 3, 2013.Google Scholar
- Lawrence Livermore National Laboratory. ASC Sequoia Benchmark Codes. https://asc.llnl.gov/sequoia/benchmarks, 2012.Google Scholar
- B. Mohr, A. D. Malony, S. Shende, and F. Wolf. Design and prototype of a performance tool interface for OpenMP. The Journal of Supercomputing, 23:105--128, 2002. Google ScholarDigital Library
- OpenMP Architecture Review Board. OpenMP application program interface, version 3.0. http://www.openmp.org/mp-documents/spec30.pdf, May 2008.Google Scholar
- Oracle. Oracle Solaris Studio. http://www.oracle.com/technetwork/server-storage/solarisstudio/overview%/index.html.Google Scholar
- M. Schulz et al. OpenSpeedShop: An open source infrastructure for parallel performance analysis. Scientific Programming, 16(2--3):105--121, Apr. 2008. Google ScholarDigital Library
- S. Shende and A. D. Malony. The TAU parallel performance system. International Journal of High Performance Computing Applications, ACTS Collection Special Issue, 2005. Google ScholarDigital Library
- N. R. Tallent and J. Mellor-Crummey. Effective performance measurement and analysis of multithreaded applications. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 229--240, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- N. R. Tallent, J. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM SIGPLAN Conf on Programming Language Design and Implementation, pages 441--452, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- N. R. Tallent, J. Mellor-Crummey, and A. Porterfield. Analyzing lock contention in multithreaded applications. In Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010. Google ScholarDigital Library
- The Portland Group. PGPROF Profiler Guide Parallel Profiling for Scientists and Engineers. http://www.pgroup.com/doc/pgprofug.pdf, 2011.Google Scholar
- R. van der Pas. OpenMP Support in Sun Studio. https://iwomp.zih.tu-dresden.de/downloads/3.OpenMP_Sun_Studio.pdf, 2009.Google Scholar
- F. Wolf et al. Usage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications. In Tools for High Performance Computing, pages 157--167. Springer Berlin Heidelberg, 2008.Google ScholarCross Ref
Index Terms
- A new approach for performance analysis of openMP programs
Recommendations
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud ComputingMany modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Critical-blame analysis for OpenMP 4.0 offloading on Intel Xeon Phi
Critical-path detection in OpenMP 4.0 programs with offloaded code.Detection and quantification of load imbalances and their cause in OpenMP 4.0 codes.Implementation in the open-source tool infrastructure Score-P.Validation and evaluation with modified ...
Performance Gaps between OpenMP and OpenCL for Multi-core CPUs
ICPPW '12: Proceedings of the 2012 41st International Conference on Parallel Processing WorkshopsOpenCL and OpenMP are the most commonly used programming models for multi-core processors. They are also fundamentally different in their approach to parallelization. In this paper, we focus on comparing the performance of OpenCL and OpenMP. We select ...
Comments