research-article

Scalable fine-grained call path tracing

Authors:
Nathan R. Tallent

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
John Mellor-Crummey

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Michael Franco

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Reed Landrum

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Laksono Adhianto

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

ICS '11: Proceedings of the international conference on SupercomputingMay 2011Pages 63–74https://doi.org/10.1145/1995896.1995908

Published:31 May 2011Publication History

ICS '11: Proceedings of the international conference on Supercomputing

Pages 63–74

ABSTRACT

Applications must scale well to make efficient use of even medium-scale parallel systems. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of performance bottlenecks.

Although tracing is a powerful performance-analysis technique, tools that employ it can quickly become bottlenecks themselves. Moreover, to obtain actionable performance feedback for modular parallel software systems, it is often necessary to collect and present fine-grained context-sensitive data --- the very thing scalable tools avoid. While existing tracing tools can collect calling contexts, they do so only in a coarse-grained fashion; and no prior tool scalably presents both context- and time-sensitive data.

This paper describes how to collect, analyze and present fine-grained call path traces for parallel programs. To scale our measurements, we use asynchronous sampling, whose granularity is controlled by a sampling frequency, and a compact representation.

To present traces at multiple levels of abstraction and at arbitrary resolutions, we use sampling to render complementary slices of calling-context-sensitive trace data. Because our techniques are general, they can be used on applications that use different parallel programming models (MPI, OpenMP, PGAS). This work is implemented in HPCToolkit.

References

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010. Google ScholarDigital Library
G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In Proc. of the 1997 ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 85--96, New York, NY, USA, 1997. ACM. Google ScholarDigital Library
Apple Computer. Shark User Guide, April 2008.Google Scholar
S. Balay, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.0.0, Argonne National Laboratory, 2008.Google Scholar
M. Casas, R. Badia, and J. Labarta. Automatic structure extraction from MPI applications tracefiles. In A.-M. Kermarrec, L. Bougé, and T. Priol, editors, Proc. of the 13th Intl. Euro-Par Conference, volume 4641 of Lecture Notes in Computer Science, pages 3--12. Springer, 2007. Google ScholarDigital Library
J. Caubet, J. Gimenez, J. Labarta, L. De Rose, and J. S. Vetter. A dynamic tracing mechanism for performance analysis of OpenMP applications. In Proc. of the Intl. Workshop on OpenMP Appl. and Tools, pages 53--67, London, UK, 2001. Springer-Verlag. Google ScholarDigital Library
A. Chan, W. Gropp, and E. Lusk. An efficient format for nearly constant-time access to arbitrary time intervals in large trace files. Scientific Programming, 16(2--3):155--165, 2008. Google ScholarDigital Library
I.-H. Chung, R. E. Walkup, H.-F. Wen, and H. Yu. MPI performance analysis tools on Blue Gene/L. In Proc. of the 2006 ACM/IEEE Conf. on Supercomputing, page 123, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
L. De Rose, B. Homer, D. Johnson, S. Kaufmann, and H. Poxon. Cray performance analysis tools. In Tools for High Performance Computing, pages 191--199. Springer, 2008.Google ScholarCross Ref
A. Dubey, L. B. Reid, and R. Fisher. Introduction to FLASH 3.0, with application to supersonic turbulence. Physica Scripta, 132:014046, 2008.Google ScholarCross Ref
N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In Proc. of the 19th Intl. Conf. on Supercomputing, pages 81--90, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
T. Gamblin, B. R. de Supinski, M. Schulz, R. Fowler, and D. A. Reed. Scalable load-balance measurement for SPMD codes. In Proc. of the 2008 ACM/IEEE Conf. on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
T. Gamblin, B. R. de Supinski, M. Schulz, R. Fowler, and D. A. Reed. Clustering performance data efficiently at massive scales. In Proc. of the 24th ACM Intl. Conf. on Supercomputing, pages 243--252, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
T. Gamblin, R. Fowler, and D. A. Reed. Scalable methods for monitoring and detecting behavioral equivalence classes in scientific codes. In Proc. of the 22nd IEEE Intl. Parallel and Distributed Processing Symp., pages 1--12, 2008.Google ScholarCross Ref
M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6):702--719, 2010. Google ScholarDigital Library
J. Gonzalez, J. Gimenez, and J. Labarta. Automatic detection of parallel applications computation phases. In Proc. of the 23rd IEEE Intl. Parallel and Distributed Processing Symp., pages 1--11, 2009. Google ScholarDigital Library
W. Gu, G. Eisenhauer, K. Schwan, and J. Vetter. Falcon: On-line monitoring for steering parallel programs. Concurrency: Practice and Experience, 10(9):699--736, 1998.Google ScholarCross Ref
Innovative Computing Laboratory, University of Tennessee. HPC Challenge benchmarks. http://icl.cs.utk.edu/hpcc.Google Scholar
A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel. The Vampir performance analysis tool-set. In M. Resch, R. Keller, V. Himmler, B. Krammer, and A. Schulz, editors, Tools for High Performance Computing, pages 139--155. Springer, 2008.Google Scholar
A. Knüpfer and W. Nagel. Construction and compression of complete call graphs for post-mortem program trace analysis. In Proc. of the 2005 Intl. Conf. on Parallel Processing, pages 165--172, 2005. Google ScholarDigital Library
J. Labarta. Obtaining extremely detailed information at scale. 2009 Workshop on Performance Tools for Petascale Computing (Center for Scalable Application Development Software), July 2009.Google Scholar
J. Labarta, J. Gimenez, E. Martínez, P. González, H. Servat, G. Llort, and X. Aguilar. Scalability of visualization and tracing tools. In G. Joubert, W. Nagel, F. Peters, O. Plata, P. Tirado, and E. Zapata, editors, Parallel Computing: Current & Future Issues of High-End Computing: Proc. of the Intl. Conf. ParCo 2005, volume 33 of NIC Series, pages 869--876, Jülich, September 2006. John von Neumann Institute for Computing.Google Scholar
C. W. Lee and L. V. Kalé. Scalable techniques for performance analysis. Technical Report 07-06, Dept. of Computer Science, University of Illinois, Urbana-Champaign, May 2007.Google Scholar
C. W. Lee, C. Mendes, and L. V. Kalé. Towards scalable performance analysis and visualization through data reduction. In Proc. of the 22nd IEEE Intl. Parallel and Distributed Processing Symp., pages 1--8, 2008.Google Scholar
G. Llort, J. Gonzalez, H. Servat, J. Gimenez, and J. Labarta. On-line detection of large-scale parallel application's structure. In Proc. of the 24th IEEE Intl. Parallel and Distributed Processing Symp., pages 1--10, 2010.Google ScholarCross Ref
Los Alamos National Laboratory. PFLOTRAN project. https://software.lanl.gov/pflotran, 2010.Google Scholar
J. Mellor-Crummey, L. Adhianto, G. Jin, and W. N. Scherer III. A new vision for Coarray Fortran. In Proc. of the Third Conf. on Partitioned Global Address Space Programming Models, 2009. Google ScholarDigital Library
Message Passing Interface Forum. MPI: A Message Passing Interface Standard, June 1999. http://www.mpi-forum.org/docs/mpi-11.ps.Google Scholar
R. T. Mills, G. E. Hammond, P. C. Lichtner, V. Sripathi, G. K. Mahinthakumar, and B. F. Smith. Modeling subsurface reactive flows using leadership-class computing. Journal of Physics: Conference Series, 180(1):012062, 2009.Google ScholarCross Ref
M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. J. Parallel Distrib. Comput., 69(8):696--710, 2009. Google ScholarDigital Library
Oracle. Oracle Solaris Studio 12.2: Performance Analyzer. http://download.oracle.com/docs/cd/E18659_01/pdf/821--1379.pdf, September 2010.Google Scholar
Rice University. HPCToolkit performance tools. http://hpctoolkit.org.Google Scholar
M. Schulz, J. Galarowicz, D. Maghrak, W. Hachfeld, D. Montoya, and S. Cranford. Open$|$SpeedShop: An open source infrastructure for parallel performance analysis. Sci. Program., 16(2--3):105--121, 2008. Google ScholarDigital Library
H. Servat, G. Llort, J. Giménez, and J. Labarta. Detailed performance analysis using coarse grain sampling. In H.-X. Lin, M. Alexander, M. Forsell, A. Knüpfer, R. Prodan, L. Sousa, and A. Streit, editors, Euro-Par 2009 Workshops, volume 6043 of Lecture Notes in Computer Science, pages 185--198. Springer-Verlag, 2010. Google ScholarDigital Library
S. S. Shende and A. D. Malony. The TAU parallel performance system. Int. J. High Perform. Comput. Appl., 20(2):287--311, 2006. Google ScholarDigital Library
N. R. Tallent, L. Adhianto, and J. M. Mellor-Crummey. Scalable identification of load imbalance in parallel executions using call path profiles. In Proc. of the 2010 ACM/IEEE Conf. on Supercomputing, 2010. Google ScholarDigital Library
N. R. Tallent and J. Mellor-Crummey. Effective performance measurement and analysis of multithreaded applications. In Proc. of the 14th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 229--240, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
N. R. Tallent, J. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 441--452, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
N. R. Tallent, J. M. Mellor-Crummey, L. Adhianto, M. W. Fagan, and M. Krentel. Diagnosing performance bottlenecks in emerging petascale applications. In Proc. of the 2009 ACM/IEEE Conf. on Supercomputing, pages 1--11, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
N. R. Tallent, J. M. Mellor-Crummey, and A. Porterfield. Analyzing lock contention in multithreaded applications. In Proc. of the 15th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 269--280, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
TU Dresden Center for Information Services and High Performance Computing (ZIH). VampirTrace 5.10.1 user manual. http://www.tu-dresden.de/zih/vampirtrace http://www.tu-dresden.de/zih/vampirtrace, March 2011.Google Scholar
V. Vishwanath, M. Hereld, K. Iskra, D. Kimpe, V. Morozov, M. E. Paper, R. Ross, and K. Yoshii. Accelerating I/O forwarding in IBM Blue Gene/P systems. Technical Report ANL/MCS-P1745-0410, Argonne National Laboratory, April 2010.Google Scholar
P. H. Worley. MPICL: A port of the PICL tracing logic to MPI. http://www.epm.ornl.gov/picl.Google Scholar
O. Zaki, E. Lusk, W. Gropp, and D. Swider. Toward scalable performance visualization with Jumpshot. High Performance Computing Applications, 13(2):277--288, Fall 1999. Google ScholarDigital Library

Index Terms

Scalable fine-grained call path tracing
1. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
    2. Metrics

Recommendations

An Overhead Analysis of MPI Profiling and Tracing Tools
PERMAVOST '22: Proceedings of the 2nd Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn Strategy

MPI performance analysis tools are important instruments for finding performance bottlenecks in large-scale MPI applications. These tools commonly support either the profiling or the tracing of parallel applications. Depending on the type of analysis, ...
Read More
Towards an I/O tracing framework taxonomy
PDSW '07: Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07

There is high demand for I/O tracing in High Performance Computing (HPC). It enables in-depth analysis of distributed applications and file system performance tuning. It also aids distributed application debugging. Finally, it facilitates collaboration ...
Read More
Scalability analysis of SPMD codes using expectations
ICS '07: Proceedings of the 21st annual international conference on Supercomputing

We present a new technique for identifying scalability bottlenecks in executions of single-program, multiple-data (SPMD) parallel programs, quantifying their impact on performance, and associating this information with the program source code. Our ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '11: Proceedings of the international conference on Supercomputing
May 2011
398 pages
ISBN:9781450301022
DOI:10.1145/1995896
General Chair:
David K. Lowenthal
University of Arizona
,
Program Chairs:
Bronis R. de Supinski
Lawrence Livermore National Laboratory
,
Sally A. McKee
Chalmers University of Technology
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 May 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
calling context
hpctoolkit
performance tools
statistical sampling
tracing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Upcoming Conference
ICS '24

Sponsor:

sigarch

2024 International Conference on Supercomputing

June 4 - 7, 2024

Kyoto , Japan
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 256
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable fine-grained call path tracing

ICS '11: Proceedings of the international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Overhead Analysis of MPI Profiling and Tracing Tools

Towards an I/O tracing framework taxonomy

Scalability analysis of SPMD codes using expectations