skip to main content
10.1145/1995896.1995908acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Scalable fine-grained call path tracing

Published:31 May 2011Publication History

ABSTRACT

Applications must scale well to make efficient use of even medium-scale parallel systems. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of performance bottlenecks.

Although tracing is a powerful performance-analysis technique, tools that employ it can quickly become bottlenecks themselves. Moreover, to obtain actionable performance feedback for modular parallel software systems, it is often necessary to collect and present fine-grained context-sensitive data --- the very thing scalable tools avoid. While existing tracing tools can collect calling contexts, they do so only in a coarse-grained fashion; and no prior tool scalably presents both context- and time-sensitive data.

This paper describes how to collect, analyze and present fine-grained call path traces for parallel programs. To scale our measurements, we use asynchronous sampling, whose granularity is controlled by a sampling frequency, and a compact representation.

To present traces at multiple levels of abstraction and at arbitrary resolutions, we use sampling to render complementary slices of calling-context-sensitive trace data. Because our techniques are general, they can be used on applications that use different parallel programming models (MPI, OpenMP, PGAS). This work is implemented in HPCToolkit.

References

  1. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In Proc. of the 1997 ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 85--96, New York, NY, USA, 1997. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apple Computer. Shark User Guide, April 2008.Google ScholarGoogle Scholar
  4. S. Balay, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.0.0, Argonne National Laboratory, 2008.Google ScholarGoogle Scholar
  5. M. Casas, R. Badia, and J. Labarta. Automatic structure extraction from MPI applications tracefiles. In A.-M. Kermarrec, L. Bougé, and T. Priol, editors, Proc. of the 13th Intl. Euro-Par Conference, volume 4641 of Lecture Notes in Computer Science, pages 3--12. Springer, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Caubet, J. Gimenez, J. Labarta, L. De Rose, and J. S. Vetter. A dynamic tracing mechanism for performance analysis of OpenMP applications. In Proc. of the Intl. Workshop on OpenMP Appl. and Tools, pages 53--67, London, UK, 2001. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Chan, W. Gropp, and E. Lusk. An efficient format for nearly constant-time access to arbitrary time intervals in large trace files. Scientific Programming, 16(2--3):155--165, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. I.-H. Chung, R. E. Walkup, H.-F. Wen, and H. Yu. MPI performance analysis tools on Blue Gene/L. In Proc. of the 2006 ACM/IEEE Conf. on Supercomputing, page 123, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. De Rose, B. Homer, D. Johnson, S. Kaufmann, and H. Poxon. Cray performance analysis tools. In Tools for High Performance Computing, pages 191--199. Springer, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  10. A. Dubey, L. B. Reid, and R. Fisher. Introduction to FLASH 3.0, with application to supersonic turbulence. Physica Scripta, 132:014046, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  11. N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In Proc. of the 19th Intl. Conf. on Supercomputing, pages 81--90, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Gamblin, B. R. de Supinski, M. Schulz, R. Fowler, and D. A. Reed. Scalable load-balance measurement for SPMD codes. In Proc. of the 2008 ACM/IEEE Conf. on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Gamblin, B. R. de Supinski, M. Schulz, R. Fowler, and D. A. Reed. Clustering performance data efficiently at massive scales. In Proc. of the 24th ACM Intl. Conf. on Supercomputing, pages 243--252, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Gamblin, R. Fowler, and D. A. Reed. Scalable methods for monitoring and detecting behavioral equivalence classes in scientific codes. In Proc. of the 22nd IEEE Intl. Parallel and Distributed Processing Symp., pages 1--12, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  15. M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6):702--719, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Gonzalez, J. Gimenez, and J. Labarta. Automatic detection of parallel applications computation phases. In Proc. of the 23rd IEEE Intl. Parallel and Distributed Processing Symp., pages 1--11, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Gu, G. Eisenhauer, K. Schwan, and J. Vetter. Falcon: On-line monitoring for steering parallel programs. Concurrency: Practice and Experience, 10(9):699--736, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  18. Innovative Computing Laboratory, University of Tennessee. HPC Challenge benchmarks. http://icl.cs.utk.edu/hpcc.Google ScholarGoogle Scholar
  19. A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel. The Vampir performance analysis tool-set. In M. Resch, R. Keller, V. Himmler, B. Krammer, and A. Schulz, editors, Tools for High Performance Computing, pages 139--155. Springer, 2008.Google ScholarGoogle Scholar
  20. A. Knüpfer and W. Nagel. Construction and compression of complete call graphs for post-mortem program trace analysis. In Proc. of the 2005 Intl. Conf. on Parallel Processing, pages 165--172, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Labarta. Obtaining extremely detailed information at scale. 2009 Workshop on Performance Tools for Petascale Computing (Center for Scalable Application Development Software), July 2009.Google ScholarGoogle Scholar
  22. J. Labarta, J. Gimenez, E. Martínez, P. González, H. Servat, G. Llort, and X. Aguilar. Scalability of visualization and tracing tools. In G. Joubert, W. Nagel, F. Peters, O. Plata, P. Tirado, and E. Zapata, editors, Parallel Computing: Current & Future Issues of High-End Computing: Proc. of the Intl. Conf. ParCo 2005, volume 33 of NIC Series, pages 869--876, Jülich, September 2006. John von Neumann Institute for Computing.Google ScholarGoogle Scholar
  23. C. W. Lee and L. V. Kalé. Scalable techniques for performance analysis. Technical Report 07-06, Dept. of Computer Science, University of Illinois, Urbana-Champaign, May 2007.Google ScholarGoogle Scholar
  24. C. W. Lee, C. Mendes, and L. V. Kalé. Towards scalable performance analysis and visualization through data reduction. In Proc. of the 22nd IEEE Intl. Parallel and Distributed Processing Symp., pages 1--8, 2008.Google ScholarGoogle Scholar
  25. G. Llort, J. Gonzalez, H. Servat, J. Gimenez, and J. Labarta. On-line detection of large-scale parallel application's structure. In Proc. of the 24th IEEE Intl. Parallel and Distributed Processing Symp., pages 1--10, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  26. Los Alamos National Laboratory. PFLOTRAN project. https://software.lanl.gov/pflotran, 2010.Google ScholarGoogle Scholar
  27. J. Mellor-Crummey, L. Adhianto, G. Jin, and W. N. Scherer III. A new vision for Coarray Fortran. In Proc. of the Third Conf. on Partitioned Global Address Space Programming Models, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Message Passing Interface Forum. MPI: A Message Passing Interface Standard, June 1999. http://www.mpi-forum.org/docs/mpi-11.ps.Google ScholarGoogle Scholar
  29. R. T. Mills, G. E. Hammond, P. C. Lichtner, V. Sripathi, G. K. Mahinthakumar, and B. F. Smith. Modeling subsurface reactive flows using leadership-class computing. Journal of Physics: Conference Series, 180(1):012062, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  30. M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. J. Parallel Distrib. Comput., 69(8):696--710, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Oracle. Oracle Solaris Studio 12.2: Performance Analyzer. http://download.oracle.com/docs/cd/E18659_01/pdf/821--1379.pdf, September 2010.Google ScholarGoogle Scholar
  32. Rice University. HPCToolkit performance tools. http://hpctoolkit.org.Google ScholarGoogle Scholar
  33. M. Schulz, J. Galarowicz, D. Maghrak, W. Hachfeld, D. Montoya, and S. Cranford. Open$|$SpeedShop: An open source infrastructure for parallel performance analysis. Sci. Program., 16(2--3):105--121, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. H. Servat, G. Llort, J. Giménez, and J. Labarta. Detailed performance analysis using coarse grain sampling. In H.-X. Lin, M. Alexander, M. Forsell, A. Knüpfer, R. Prodan, L. Sousa, and A. Streit, editors, Euro-Par 2009 Workshops, volume 6043 of Lecture Notes in Computer Science, pages 185--198. Springer-Verlag, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. S. Shende and A. D. Malony. The TAU parallel performance system. Int. J. High Perform. Comput. Appl., 20(2):287--311, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. N. R. Tallent, L. Adhianto, and J. M. Mellor-Crummey. Scalable identification of load imbalance in parallel executions using call path profiles. In Proc. of the 2010 ACM/IEEE Conf. on Supercomputing, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. N. R. Tallent and J. Mellor-Crummey. Effective performance measurement and analysis of multithreaded applications. In Proc. of the 14th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 229--240, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. N. R. Tallent, J. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 441--452, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. N. R. Tallent, J. M. Mellor-Crummey, L. Adhianto, M. W. Fagan, and M. Krentel. Diagnosing performance bottlenecks in emerging petascale applications. In Proc. of the 2009 ACM/IEEE Conf. on Supercomputing, pages 1--11, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. N. R. Tallent, J. M. Mellor-Crummey, and A. Porterfield. Analyzing lock contention in multithreaded applications. In Proc. of the 15th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 269--280, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. TU Dresden Center for Information Services and High Performance Computing (ZIH). VampirTrace 5.10.1 user manual. http://www.tu-dresden.de/zih/vampirtrace http://www.tu-dresden.de/zih/vampirtrace, March 2011.Google ScholarGoogle Scholar
  42. V. Vishwanath, M. Hereld, K. Iskra, D. Kimpe, V. Morozov, M. E. Paper, R. Ross, and K. Yoshii. Accelerating I/O forwarding in IBM Blue Gene/P systems. Technical Report ANL/MCS-P1745-0410, Argonne National Laboratory, April 2010.Google ScholarGoogle Scholar
  43. P. H. Worley. MPICL: A port of the PICL tracing logic to MPI. http://www.epm.ornl.gov/picl.Google ScholarGoogle Scholar
  44. O. Zaki, E. Lusk, W. Gropp, and D. Swider. Toward scalable performance visualization with Jumpshot. High Performance Computing Applications, 13(2):277--288, Fall 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable fine-grained call path tracing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICS '11: Proceedings of the international conference on Supercomputing
        May 2011
        398 pages
        ISBN:9781450301022
        DOI:10.1145/1995896

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 31 May 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate584of2,055submissions,28%

        Upcoming Conference

        ICS '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader