Skip to main content

2015 | OriginalPaper | Buchkapitel

Systematic Debugging of Concurrent Systems Using Coalesced Stack Trace Graphs

verfasst von : Diego Caminha B. de Oliveira, Zvonimir Rakamarić, Ganesh Gopalakrishnan, Alan Humphrey, Qingyu Meng, Martin Berzins

Erschienen in: Languages and Compilers for Parallel Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A central need during software development of large-scale parallel systems is tools that help to quickly identify the root causes of bugs. Given the massive scale of these systems, tools that highlight changes—say introduced across software versions or their operating conditions (e.g., inputs, schedules)—can prove to be highly effective in practice. Conventional debuggers, while good at presenting details at the problem-site (e.g., crash), often omit contextual information to identify the root causes of the bug. We present a new approach to collect and coalesce stack traces, leading to an efficient summary display of salient system control flow differences in a graphical form called Coalesced Stack Trace Graphs (CSTG). CSTGs have helped us debug situations within a computational framework called Uintah that has been deployed at very large scale. In this paper, we detail CSTGs through case studies in the context of Uintah where unexpected behaviors caused by different versions of software or occurring across different time-steps of a system (e.g., due to non-determinism) are debugged. We show that CSTG also gives conventional debuggers a far more productive and guided role to play.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Say, the line number where the call to the next function in the stack is made or the instrumentation code is found.
 
2
www.​cs.​utah.​edu/​fv/​CSTG/​. For the ease of presentation, we simplify many of the function and variable names involved.
 
3
The zoomed out region of the CSTGs contains no information relevant for this study.
 
Literatur
1.
Zurück zum Zitat Ahn, D.H., Lee, G.L., Gopalakrishnan, G., Rakamarić, Z., Schulz, M., Laguna, I.: Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset. In: International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering (SE-HPCCSE) (2013) Ahn, D.H., Lee, G.L., Gopalakrishnan, G., Rakamarić, Z., Schulz, M., Laguna, I.: Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset. In: International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering (SE-HPCCSE) (2013)
2.
Zurück zum Zitat Ammons, G., Ball, T., Larus, J.R.: Exploiting hardware performance counters with flow and context sensitive profiling. In: PLDI, pp. 85–96 (1997) Ammons, G., Ball, T., Larus, J.R.: Exploiting hardware performance counters with flow and context sensitive profiling. In: PLDI, pp. 85–96 (1997)
3.
Zurück zum Zitat Arnold, D.C., Ahn, D.H., de Supinski, B.R., Lee, G.L., Miller, B.P., Schulz, M.: Stack trace analysis for large scale debugging. In: IPDPS, pp. 1–10 (2007) Arnold, D.C., Ahn, D.H., de Supinski, B.R., Lee, G.L., Miller, B.P., Schulz, M.: Stack trace analysis for large scale debugging. In: IPDPS, pp. 1–10 (2007)
4.
Zurück zum Zitat Bartz, K., Stokes, J.W., Platt, J.C., Kivett, R., Grant, D., Calinoiu, S., Loihle, G.: Finding similar failures using callstack similarity. In: Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML) (2008) Bartz, K., Stokes, J.W., Platt, J.C., Kivett, R., Grant, D., Calinoiu, S., Loihle, G.: Finding similar failures using callstack similarity. In: Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML) (2008)
5.
Zurück zum Zitat Berzins, M.: Status of release of the Uintah computational framework. SCI Technical report UUSCI-2012-001, SCI Institute, Utah (2012) Berzins, M.: Status of release of the Uintah computational framework. SCI Technical report UUSCI-2012-001, SCI Institute, Utah (2012)
6.
Zurück zum Zitat Beschastnikh, I., Brun, Y., Ernst, M.D., Krishnamurthy, A., Anderson, T.E.: Mining temporal invariants from partially ordered logs. In: SLAML (2011) Beschastnikh, I., Brun, Y., Ernst, M.D., Krishnamurthy, A., Anderson, T.E.: Mining temporal invariants from partially ordered logs. In: SLAML (2011)
7.
Zurück zum Zitat Beschastnikh, I., Brun, Y., Schneider, S., Sloan, M., Ernst, M.D.: Leveraging existing instrumentation to automatically infer invariant-constrained models. In: Symposium on the Foundations of Software Engineering (FSE), pp. 267–277 (2011) Beschastnikh, I., Brun, Y., Schneider, S., Sloan, M., Ernst, M.D.: Leveraging existing instrumentation to automatically infer invariant-constrained models. In: Symposium on the Foundations of Software Engineering (FSE), pp. 267–277 (2011)
8.
Zurück zum Zitat Bronevetsky, G., Laguna, I., Bagchi, S., de Supinski, B., Ahn, D., Schulz, M.: AutomaDeD: Automata-based debugging for dissimilar parallel tasks. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 231–240 (2010) Bronevetsky, G., Laguna, I., Bagchi, S., de Supinski, B., Ahn, D., Schulz, M.: AutomaDeD: Automata-based debugging for dissimilar parallel tasks. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 231–240 (2010)
9.
Zurück zum Zitat Dang, Y., Wu, R., Zhang, H., Zhang, D., Nobel, P.: Rebucket: A method for clustering duplicate crash reports based on call stack similarity. In: International Conference on Software Engineering (ICSE), pp. 1084–1093 (2012) Dang, Y., Wu, R., Zhang, H., Zhang, D., Nobel, P.: Rebucket: A method for clustering duplicate crash reports based on call stack similarity. In: International Conference on Software Engineering (ICSE), pp. 1084–1093 (2012)
11.
Zurück zum Zitat de Oliveira, D.C.B., Rakamarić, Z., Gopalakrishnan, G., Humphrey, A., Meng, Q., Berzins, M.: Practical formal correctness checking of million-core problem solving environments for HPC. In: International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE 2013) (2013) de Oliveira, D.C.B., Rakamarić, Z., Gopalakrishnan, G., Humphrey, A., Meng, Q., Berzins, M.: Practical formal correctness checking of million-core problem solving environments for HPC. In: International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE 2013) (2013)
12.
Zurück zum Zitat Do, H., Elbaum, S.G., Rothermel, G.: Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empirical Softw. Eng. Int. J. 10(4), 405–435 (2005)CrossRef Do, H., Elbaum, S.G., Rothermel, G.: Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empirical Softw. Eng. Int. J. 10(4), 405–435 (2005)CrossRef
13.
Zurück zum Zitat Germain, J.D.D.S., McCorquodale, J., Parker, S.G., Johnson, C.R.: Uintah: A massively parallel problem solving environment. In: IEEE International Symposium on High Performance Distributed Computing (HPDC), pp. 33–41 (2000) Germain, J.D.D.S., McCorquodale, J., Parker, S.G., Johnson, C.R.: Uintah: A massively parallel problem solving environment. In: IEEE International Symposium on High Performance Distributed Computing (HPDC), pp. 33–41 (2000)
14.
Zurück zum Zitat Gopalakrishnan, G., Kirby, R.M., Siegel, S., Thakur, R., Gropp, W., Lusk, E., De Supinski, B.R., Schulz, M., Bronevetsky, G.: Formal analysis of MPI-based parallel programs. Commun. ACM 54(12), 82–91 (2011)CrossRef Gopalakrishnan, G., Kirby, R.M., Siegel, S., Thakur, R., Gropp, W., Lusk, E., De Supinski, B.R., Schulz, M., Bronevetsky, G.: Formal analysis of MPI-based parallel programs. Commun. ACM 54(12), 82–91 (2011)CrossRef
15.
Zurück zum Zitat Han, S., Dang, Y., Ge, S., Zhang, D., Xie, T.: Performance debugging in the large via mining millions of stack traces. In: ICSE, pp. 145–155 (2012) Han, S., Dang, Y., Ge, S., Zhang, D., Xie, T.: Performance debugging in the large via mining millions of stack traces. In: ICSE, pp. 145–155 (2012)
16.
Zurück zum Zitat Kim, S., Zimmermann, T., Nagappan, N.: Crash graphs: An aggregated view of multiple crashes to improve crash triage. In: DSN, pp. 486–493 (2011) Kim, S., Zimmermann, T., Nagappan, N.: Crash graphs: An aggregated view of multiple crashes to improve crash triage. In: DSN, pp. 486–493 (2011)
17.
Zurück zum Zitat Meng, Q., Humphrey, A., Schmidt, J., Berzins, M.: Investigating applications portability with the Uintah DAG-based runtime system on petascale supercomputers. Technical report UUSCI-2013-003, SCI Institute, Utah (2013) Meng, Q., Humphrey, A., Schmidt, J., Berzins, M.: Investigating applications portability with the Uintah DAG-based runtime system on petascale supercomputers. Technical report UUSCI-2013-003, SCI Institute, Utah (2013)
18.
Zurück zum Zitat Pastore, F., Mariani, L., Goffi, A.: Radar: A tool for debugging regression problems in C/C++ software. In: ICSE, pp. 1335–1338 (2013) Pastore, F., Mariani, L., Goffi, A.: Radar: A tool for debugging regression problems in C/C++ software. In: ICSE, pp. 1335–1338 (2013)
21.
Zurück zum Zitat Sambasivan, R.R., Zheng, A.X., De Rosa, M., Krevat, E., Whitman, S., Stroucken, M., Wang, W., Xu, L., Ganger, G.R.: Diagnosing performance changes by comparing request flows. In: NSDI, pp. 4–4 (2011) Sambasivan, R.R., Zheng, A.X., De Rosa, M., Krevat, E., Whitman, S., Stroucken, M., Wang, W., Xu, L., Ganger, G.R.: Diagnosing performance changes by comparing request flows. In: NSDI, pp. 4–4 (2011)
22.
Zurück zum Zitat Schroter, A., Bettenburg, N., Premraj, R.: Do stack traces help developers fix bugs? In: Conference on Mining Software Repositories (MSR), pp. 118–121 (2010) Schroter, A., Bettenburg, N., Premraj, R.: Do stack traces help developers fix bugs? In: Conference on Mining Software Repositories (MSR), pp. 118–121 (2010)
25.
Zurück zum Zitat Zeller, A.: Yesterday, my program worked. Today, it does not. Why? In: Symposium on the Foundations of Software Engineering (FSE), pp. 253–267 (1999) Zeller, A.: Yesterday, my program worked. Today, it does not. Why? In: Symposium on the Foundations of Software Engineering (FSE), pp. 253–267 (1999)
Metadaten
Titel
Systematic Debugging of Concurrent Systems Using Coalesced Stack Trace Graphs
verfasst von
Diego Caminha B. de Oliveira
Zvonimir Rakamarić
Ganesh Gopalakrishnan
Alan Humphrey
Qingyu Meng
Martin Berzins
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-17473-0_21