ABSTRACT
Applications must scale well to make efficient use of even medium-scale parallel systems. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of performance bottlenecks.
Although tracing is a powerful performance-analysis technique, tools that employ it can quickly become bottlenecks themselves. Moreover, to obtain actionable performance feedback for modular parallel software systems, it is often necessary to collect and present fine-grained context-sensitive data --- the very thing scalable tools avoid. While existing tracing tools can collect calling contexts, they do so only in a coarse-grained fashion; and no prior tool scalably presents both context- and time-sensitive data.
This paper describes how to collect, analyze and present fine-grained call path traces for parallel programs. To scale our measurements, we use asynchronous sampling, whose granularity is controlled by a sampling frequency, and a compact representation.
To present traces at multiple levels of abstraction and at arbitrary resolutions, we use sampling to render complementary slices of calling-context-sensitive trace data. Because our techniques are general, they can be used on applications that use different parallel programming models (MPI, OpenMP, PGAS). This work is implemented in HPCToolkit.
- L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010. Google ScholarDigital Library
- G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In Proc. of the 1997 ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 85--96, New York, NY, USA, 1997. ACM. Google ScholarDigital Library
- Apple Computer. Shark User Guide, April 2008.Google Scholar
- S. Balay, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.0.0, Argonne National Laboratory, 2008.Google Scholar
- M. Casas, R. Badia, and J. Labarta. Automatic structure extraction from MPI applications tracefiles. In A.-M. Kermarrec, L. Bougé, and T. Priol, editors, Proc. of the 13th Intl. Euro-Par Conference, volume 4641 of Lecture Notes in Computer Science, pages 3--12. Springer, 2007. Google ScholarDigital Library
- J. Caubet, J. Gimenez, J. Labarta, L. De Rose, and J. S. Vetter. A dynamic tracing mechanism for performance analysis of OpenMP applications. In Proc. of the Intl. Workshop on OpenMP Appl. and Tools, pages 53--67, London, UK, 2001. Springer-Verlag. Google ScholarDigital Library
- A. Chan, W. Gropp, and E. Lusk. An efficient format for nearly constant-time access to arbitrary time intervals in large trace files. Scientific Programming, 16(2--3):155--165, 2008. Google ScholarDigital Library
- I.-H. Chung, R. E. Walkup, H.-F. Wen, and H. Yu. MPI performance analysis tools on Blue Gene/L. In Proc. of the 2006 ACM/IEEE Conf. on Supercomputing, page 123, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- L. De Rose, B. Homer, D. Johnson, S. Kaufmann, and H. Poxon. Cray performance analysis tools. In Tools for High Performance Computing, pages 191--199. Springer, 2008.Google ScholarCross Ref
- A. Dubey, L. B. Reid, and R. Fisher. Introduction to FLASH 3.0, with application to supersonic turbulence. Physica Scripta, 132:014046, 2008.Google ScholarCross Ref
- N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In Proc. of the 19th Intl. Conf. on Supercomputing, pages 81--90, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- T. Gamblin, B. R. de Supinski, M. Schulz, R. Fowler, and D. A. Reed. Scalable load-balance measurement for SPMD codes. In Proc. of the 2008 ACM/IEEE Conf. on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
- T. Gamblin, B. R. de Supinski, M. Schulz, R. Fowler, and D. A. Reed. Clustering performance data efficiently at massive scales. In Proc. of the 24th ACM Intl. Conf. on Supercomputing, pages 243--252, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- T. Gamblin, R. Fowler, and D. A. Reed. Scalable methods for monitoring and detecting behavioral equivalence classes in scientific codes. In Proc. of the 22nd IEEE Intl. Parallel and Distributed Processing Symp., pages 1--12, 2008.Google ScholarCross Ref
- M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6):702--719, 2010. Google ScholarDigital Library
- J. Gonzalez, J. Gimenez, and J. Labarta. Automatic detection of parallel applications computation phases. In Proc. of the 23rd IEEE Intl. Parallel and Distributed Processing Symp., pages 1--11, 2009. Google ScholarDigital Library
- W. Gu, G. Eisenhauer, K. Schwan, and J. Vetter. Falcon: On-line monitoring for steering parallel programs. Concurrency: Practice and Experience, 10(9):699--736, 1998.Google ScholarCross Ref
- Innovative Computing Laboratory, University of Tennessee. HPC Challenge benchmarks. http://icl.cs.utk.edu/hpcc.Google Scholar
- A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel. The Vampir performance analysis tool-set. In M. Resch, R. Keller, V. Himmler, B. Krammer, and A. Schulz, editors, Tools for High Performance Computing, pages 139--155. Springer, 2008.Google Scholar
- A. Knüpfer and W. Nagel. Construction and compression of complete call graphs for post-mortem program trace analysis. In Proc. of the 2005 Intl. Conf. on Parallel Processing, pages 165--172, 2005. Google ScholarDigital Library
- J. Labarta. Obtaining extremely detailed information at scale. 2009 Workshop on Performance Tools for Petascale Computing (Center for Scalable Application Development Software), July 2009.Google Scholar
- J. Labarta, J. Gimenez, E. Martínez, P. González, H. Servat, G. Llort, and X. Aguilar. Scalability of visualization and tracing tools. In G. Joubert, W. Nagel, F. Peters, O. Plata, P. Tirado, and E. Zapata, editors, Parallel Computing: Current & Future Issues of High-End Computing: Proc. of the Intl. Conf. ParCo 2005, volume 33 of NIC Series, pages 869--876, Jülich, September 2006. John von Neumann Institute for Computing.Google Scholar
- C. W. Lee and L. V. Kalé. Scalable techniques for performance analysis. Technical Report 07-06, Dept. of Computer Science, University of Illinois, Urbana-Champaign, May 2007.Google Scholar
- C. W. Lee, C. Mendes, and L. V. Kalé. Towards scalable performance analysis and visualization through data reduction. In Proc. of the 22nd IEEE Intl. Parallel and Distributed Processing Symp., pages 1--8, 2008.Google Scholar
- G. Llort, J. Gonzalez, H. Servat, J. Gimenez, and J. Labarta. On-line detection of large-scale parallel application's structure. In Proc. of the 24th IEEE Intl. Parallel and Distributed Processing Symp., pages 1--10, 2010.Google ScholarCross Ref
- Los Alamos National Laboratory. PFLOTRAN project. https://software.lanl.gov/pflotran, 2010.Google Scholar
- J. Mellor-Crummey, L. Adhianto, G. Jin, and W. N. Scherer III. A new vision for Coarray Fortran. In Proc. of the Third Conf. on Partitioned Global Address Space Programming Models, 2009. Google ScholarDigital Library
- Message Passing Interface Forum. MPI: A Message Passing Interface Standard, June 1999. http://www.mpi-forum.org/docs/mpi-11.ps.Google Scholar
- R. T. Mills, G. E. Hammond, P. C. Lichtner, V. Sripathi, G. K. Mahinthakumar, and B. F. Smith. Modeling subsurface reactive flows using leadership-class computing. Journal of Physics: Conference Series, 180(1):012062, 2009.Google ScholarCross Ref
- M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. J. Parallel Distrib. Comput., 69(8):696--710, 2009. Google ScholarDigital Library
- Oracle. Oracle Solaris Studio 12.2: Performance Analyzer. http://download.oracle.com/docs/cd/E18659_01/pdf/821--1379.pdf, September 2010.Google Scholar
- Rice University. HPCToolkit performance tools. http://hpctoolkit.org.Google Scholar
- M. Schulz, J. Galarowicz, D. Maghrak, W. Hachfeld, D. Montoya, and S. Cranford. Open$|$SpeedShop: An open source infrastructure for parallel performance analysis. Sci. Program., 16(2--3):105--121, 2008. Google ScholarDigital Library
- H. Servat, G. Llort, J. Giménez, and J. Labarta. Detailed performance analysis using coarse grain sampling. In H.-X. Lin, M. Alexander, M. Forsell, A. Knüpfer, R. Prodan, L. Sousa, and A. Streit, editors, Euro-Par 2009 Workshops, volume 6043 of Lecture Notes in Computer Science, pages 185--198. Springer-Verlag, 2010. Google ScholarDigital Library
- S. S. Shende and A. D. Malony. The TAU parallel performance system. Int. J. High Perform. Comput. Appl., 20(2):287--311, 2006. Google ScholarDigital Library
- N. R. Tallent, L. Adhianto, and J. M. Mellor-Crummey. Scalable identification of load imbalance in parallel executions using call path profiles. In Proc. of the 2010 ACM/IEEE Conf. on Supercomputing, 2010. Google ScholarDigital Library
- N. R. Tallent and J. Mellor-Crummey. Effective performance measurement and analysis of multithreaded applications. In Proc. of the 14th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 229--240, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- N. R. Tallent, J. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 441--452, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- N. R. Tallent, J. M. Mellor-Crummey, L. Adhianto, M. W. Fagan, and M. Krentel. Diagnosing performance bottlenecks in emerging petascale applications. In Proc. of the 2009 ACM/IEEE Conf. on Supercomputing, pages 1--11, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- N. R. Tallent, J. M. Mellor-Crummey, and A. Porterfield. Analyzing lock contention in multithreaded applications. In Proc. of the 15th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 269--280, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- TU Dresden Center for Information Services and High Performance Computing (ZIH). VampirTrace 5.10.1 user manual. http://www.tu-dresden.de/zih/vampirtrace http://www.tu-dresden.de/zih/vampirtrace, March 2011.Google Scholar
- V. Vishwanath, M. Hereld, K. Iskra, D. Kimpe, V. Morozov, M. E. Paper, R. Ross, and K. Yoshii. Accelerating I/O forwarding in IBM Blue Gene/P systems. Technical Report ANL/MCS-P1745-0410, Argonne National Laboratory, April 2010.Google Scholar
- P. H. Worley. MPICL: A port of the PICL tracing logic to MPI. http://www.epm.ornl.gov/picl.Google Scholar
- O. Zaki, E. Lusk, W. Gropp, and D. Swider. Toward scalable performance visualization with Jumpshot. High Performance Computing Applications, 13(2):277--288, Fall 1999. Google ScholarDigital Library
Index Terms
- Scalable fine-grained call path tracing
Recommendations
An Overhead Analysis of MPI Profiling and Tracing Tools
PERMAVOST '22: Proceedings of the 2nd Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn StrategyMPI performance analysis tools are important instruments for finding performance bottlenecks in large-scale MPI applications. These tools commonly support either the profiling or the tracing of parallel applications. Depending on the type of analysis, ...
Towards an I/O tracing framework taxonomy
PDSW '07: Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07There is high demand for I/O tracing in High Performance Computing (HPC). It enables in-depth analysis of distributed applications and file system performance tuning. It also aids distributed application debugging. Finally, it facilitates collaboration ...
Scalability analysis of SPMD codes using expectations
ICS '07: Proceedings of the 21st annual international conference on SupercomputingWe present a new technique for identifying scalability bottlenecks in executions of single-program, multiple-data (SPMD) parallel programs, quantifying their impact on performance, and associating this information with the program source code. Our ...
Comments