Skip to main content
Erschienen in: Empirical Software Engineering 1/2024

01.02.2024

Silent bugs in deep learning frameworks: an empirical study of Keras and TensorFlow

verfasst von: Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, Giuliano Antoniol

Erschienen in: Empirical Software Engineering | Ausgabe 1/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Deep Learning (DL) frameworks are now widely used, simplifying the creation of complex models as well as their integration into various applications even among non-DL experts. However, like any other programs, they are prone to bugs. This paper deals with the subcategory of bugs named silent bugs: they lead to wrong behavior but they do not cause system crashes or hangs, nor show an error message to the user. Such bugs are even more dangerous in DL applications and frameworks due to the “black-box” and stochastic nature of the DL systems (i.e., the end user can not understand how the model makes decisions). This paper presents the first empirical study of the silent bugs in Tensorflow, specifically its high-level API Keras, and their impact on users’ programs. We extracted closed issues related to Keras API from the TensorFlow GitHub repository. Out of the 1,168 issues that we gathered, 77 were reproducible silent bugs affecting users’ programs. We categorized the bugs based on the effects on the users’ programs and the components where the issues occurred, using information from the issue reports. We then derived a threat level for each of the issues, based on the impact they had on the users’ programs. To assess the relevance of identified categories and the impact scale, we conducted an online survey with 103 DL developers. The participants generally agreed with the significant impact of silent bugs in DL frameworks and how they impact users and acknowledged our findings (i.e., categories of silent bugs and the proposed impact scale).

Graphical abstract

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al (2016) Tensorflow: A system for large-scale machine learning. In: 12th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 16), p 265–283 Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al (2016) Tensorflow: A system for large-scale machine learning. In: 12th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 16), p 265–283
Zurück zum Zitat Altman DG (1991) Practical statistics for medical research. Chapman and Hall, London Altman DG (1991) Practical statistics for medical research. Chapman and Hall, London
Zurück zum Zitat Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46CrossRef Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46CrossRef
Zurück zum Zitat Du X, Xiao G, Sui Y (2020) Fault triggers in the tensorflow framework: An experience report. In: 2020 IEEE 31st international symposium on software reliability engineering (ISSRE), IEEE, p 1–12 Du X, Xiao G, Sui Y (2020) Fault triggers in the tensorflow framework: An experience report. In: 2020 IEEE 31st international symposium on software reliability engineering (ISSRE), IEEE, p 1–12
Zurück zum Zitat Du X, Sui Y, Liu Z, Ai J (2022) An empirical study of fault triggers in deep learning frameworks. IEEE Trans Depend Sec Comput Du X, Sui Y, Liu Z, Ai J (2022) An empirical study of fault triggers in deep learning frameworks. IEEE Trans Depend Sec Comput
Zurück zum Zitat Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76:378–382CrossRef Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76:378–382CrossRef
Zurück zum Zitat Groce A, Kulesza T, Zhang C, Shamasunder S, Burnett M, Wong WK, Stumpf S, Das S, Shinsel A, Bice F et al (2013) You are the only possible oracle: Effective test selection for end users of interactive machine learning systems. IEEE Trans Soft Eng 40(3):307–323CrossRef Groce A, Kulesza T, Zhang C, Shamasunder S, Burnett M, Wong WK, Stumpf S, Das S, Shinsel A, Bice F et al (2013) You are the only possible oracle: Effective test selection for end users of interactive machine learning systems. IEEE Trans Soft Eng 40(3):307–323CrossRef
Zurück zum Zitat Humbatova N, Jahangirova G, Bavota G, Riccio V, Stocco A, Tonella P (2020) Taxonomy of real faults in deep learning systems. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, p 1110–1121 Humbatova N, Jahangirova G, Bavota G, Riccio V, Stocco A, Tonella P (2020) Taxonomy of real faults in deep learning systems. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, p 1110–1121
Zurück zum Zitat Islam MJ, Nguyen G, Pan R, Rajan H (2019) A comprehensive study on deep learning bug characteristics. In: Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, p 510–520 Islam MJ, Nguyen G, Pan R, Rajan H (2019) A comprehensive study on deep learning bug characteristics. In: Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, p 510–520
Zurück zum Zitat Jia L, Zhong H, Wang X, Huang L, Lu X (2020) An empirical study on bugs inside tensorflow. In: International conference on database systems for advanced applications, Springer, p 604–620 Jia L, Zhong H, Wang X, Huang L, Lu X (2020) An empirical study on bugs inside tensorflow. In: International conference on database systems for advanced applications, Springer, p 604–620
Zurück zum Zitat Jia L, Zhong H, Wang X, Huang L, Lu X (2021) The symptoms, causes, and repairs of bugs inside a deep learning library. J Syst Soft 177:110935CrossRef Jia L, Zhong H, Wang X, Huang L, Lu X (2021) The symptoms, causes, and repairs of bugs inside a deep learning library. J Syst Soft 177:110935CrossRef
Zurück zum Zitat Le V, Afshari M, Su Z (2014) Compiler validation via equivalence modulo inputs. In: Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, association for computing machinery, New York, NY, USA, PLDI ’14, p 216–226. https://doi.org/10.1145/2594291.2594334 Le V, Afshari M, Su Z (2014) Compiler validation via equivalence modulo inputs. In: Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, association for computing machinery, New York, NY, USA, PLDI ’14, p 216–226. https://​doi.​org/​10.​1145/​2594291.​2594334
Zurück zum Zitat Long F, Rinard M (2016) Automatic patch generation by learning correct code. In: Proceedings of the 43rd annual ACM SIGPLAN-SIGACT symposium on principles of programming languages, association for computing machinery, New York, NY, USA, POPL ’16, p 298–312. https://doi.org/10.1145/2837614.2837617 Long F, Rinard M (2016) Automatic patch generation by learning correct code. In: Proceedings of the 43rd annual ACM SIGPLAN-SIGACT symposium on principles of programming languages, association for computing machinery, New York, NY, USA, POPL ’16, p 298–312. https://​doi.​org/​10.​1145/​2837614.​2837617
Zurück zum Zitat Nguyen G, Dlugolinsky S, Bobák M, Tran V, López García Á, Heredia I, Malík P, Hluchý L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52:77–124CrossRef Nguyen G, Dlugolinsky S, Bobák M, Tran V, López García Á, Heredia I, Malík P, Hluchý L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52:77–124CrossRef
Zurück zum Zitat Nikanjam A, Morovati MM, Khomh F, Ben Braiek H (2021) Faults in deep reinforcement learning programs: a taxonomy and a detection approach. Auto Soft Eng 29 Nikanjam A, Morovati MM, Khomh F, Ben Braiek H (2021) Faults in deep reinforcement learning programs: a taxonomy and a detection approach. Auto Soft Eng 29
Zurück zum Zitat Oppenheim AN (2000) Questionnaire design, interviewing and attitude measurement. Bloomsbury Publishing Oppenheim AN (2000) Questionnaire design, interviewing and attitude measurement. Bloomsbury Publishing
Zurück zum Zitat Papadakis M, Shin D, Yoo S, Bae DH (2018) Are mutation scores correlated with real fault detection? a large scale empirical study on the relationship between mutants and real faults. In: Proceedings of the 40th international conference on software engineering, association for computing machinery, New York, NY, USA, ICSE ’18, p 537–548. https://doi.org/10.1145/3180155.3180183 Papadakis M, Shin D, Yoo S, Bae DH (2018) Are mutation scores correlated with real fault detection? a large scale empirical study on the relationship between mutants and real faults. In: Proceedings of the 40th international conference on software engineering, association for computing machinery, New York, NY, USA, ICSE ’18, p 537–548. https://​doi.​org/​10.​1145/​3180155.​3180183
Zurück zum Zitat Seaman CB (1999) Qualitative methods in empirical studies of software engineering. IEEE Transactions on software engineering 25(4):557–572CrossRef Seaman CB (1999) Qualitative methods in empirical studies of software engineering. IEEE Transactions on software engineering 25(4):557–572CrossRef
Zurück zum Zitat Wang W, Poo-Caamaño G, Wilde E, German DM (2015) What is the gist? understanding the use of public gists on github. In: 2015 IEEE/ACM 12th Working conference on mining software repositories, IEEE, p 314–323 Wang W, Poo-Caamaño G, Wilde E, German DM (2015) What is the gist? understanding the use of public gists on github. In: 2015 IEEE/ACM 12th Working conference on mining software repositories, IEEE, p 314–323
Zurück zum Zitat Wang Z, Yan M, Chen J, Liu S, Zhang D (2020) Deep learning library testing via effective model generation. In: Proceedings of the 28th ACM Joint meeting on European software engineering conference and symposium on the foundations of software engineering, p 788–799 Wang Z, Yan M, Chen J, Liu S, Zhang D (2020) Deep learning library testing via effective model generation. In: Proceedings of the 28th ACM Joint meeting on European software engineering conference and symposium on the foundations of software engineering, p 788–799
Zurück zum Zitat Zhang Y, Chen Y, Cheung SC, Xiong Y, Zhang L (2018a) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, p 129–140 Zhang Y, Chen Y, Cheung SC, Xiong Y, Zhang L (2018a) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, p 129–140
Zurück zum Zitat Zhang Y, Chen Y, Cheung SC, Xiong Y, Zhang L (2018b) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, p 129–140 Zhang Y, Chen Y, Cheung SC, Xiong Y, Zhang L (2018b) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, p 129–140
Metadaten
Titel
Silent bugs in deep learning frameworks: an empirical study of Keras and TensorFlow
verfasst von
Florian Tambon
Amin Nikanjam
Le An
Foutse Khomh
Giuliano Antoniol
Publikationsdatum
01.02.2024
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 1/2024
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-023-10389-6

Weitere Artikel der Ausgabe 1/2024

Empirical Software Engineering 1/2024 Zur Ausgabe

Premium Partner