nach oben

Empirical Software Engineering

Erschienen in:

01.02.2024

Silent bugs in deep learning frameworks: an empirical study of Keras and TensorFlow

verfasst von: Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, Giuliano Antoniol

Erschienen in: Empirical Software Engineering | Ausgabe 1/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Deep Learning (DL) frameworks are now widely used, simplifying the creation of complex models as well as their integration into various applications even among non-DL experts. However, like any other programs, they are prone to bugs. This paper deals with the subcategory of bugs named silent bugs: they lead to wrong behavior but they do not cause system crashes or hangs, nor show an error message to the user. Such bugs are even more dangerous in DL applications and frameworks due to the “black-box” and stochastic nature of the DL systems (i.e., the end user can not understand how the model makes decisions). This paper presents the first empirical study of the silent bugs in Tensorflow, specifically its high-level API Keras, and their impact on users’ programs. We extracted closed issues related to Keras API from the TensorFlow GitHub repository. Out of the 1,168 issues that we gathered, 77 were reproducible silent bugs affecting users’ programs. We categorized the bugs based on the effects on the users’ programs and the components where the issues occurred, using information from the issue reports. We then derived a threat level for each of the issues, based on the impact they had on the users’ programs. To assess the relevance of identified categories and the impact scale, we conducted an online survey with 103 DL developers. The participants generally agreed with the significant impact of silent bugs in DL frameworks and how they impact users and acknowledged our findings (i.e., categories of silent bugs and the proposed impact scale).

Graphical abstract

Vorheriger Artikel Fairness-aware machine learning engineering: how far are we?

Nächster Artikel Unreproducible builds: time to fix, causes, and correlation with external ecosystem factors

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://colab.research.google.com/

https://github.com/tensorflow/tensorflow/issues/40002

(2020) Keras releases. https://github.com/keras-team/keras/releases/tag/2.4.0

(2020) TenforFlow implementation. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/python/keras

(2020) tf.keras API. https://www.tensorflow.org/api_docs/python/tf/keras

(2021) DL Frameworks in 2021. https://towardsdatascience.com/top-5-deep-learning-frameworks-to-watch-in-2021-and-why-tensorflow-98d8d6667351

(2022) Keras. https://keras.io/

(2022) Pytorch. https://pytorch.org/

(2022) Replication Package. https://github.com/amin-nikanjam/SilentBugsInTensorFlowKeras

(2022) Tensorflow. https://www.tensorflow.org/

(2022) TensorFlow repository. https://github.com/tensorflow/tensorflow

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al (2016) Tensorflow: A system for large-scale machine learning. In: 12th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 16), p 265–283

Altman DG (1991) Practical statistics for medical research. Chapman and Hall, London

Chen J, Liang Y, Shen Q, Jiang J, Li S (2023) Toward understanding deep learning framework bugs. ACM Trans Softw Eng Methodol. https://doi.org/10.1145/3587155 just Accepted

Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46CrossRef

Di Franco A, Guo H, Rubio-González C (2017) A comprehensive study of real-world numerical bug characteristics. In: 2017 32nd IEEE/ACM international conference on automated software engineering (ASE), p 509–519. https://doi.org/10.1109/ASE.2017.8115662

Du X, Xiao G, Sui Y (2020) Fault triggers in the tensorflow framework: An experience report. In: 2020 IEEE 31st international symposium on software reliability engineering (ISSRE), IEEE, p 1–12

Du X, Sui Y, Liu Z, Ai J (2022) An empirical study of fault triggers in deep learning frameworks. IEEE Trans Depend Sec Comput

fix-example (2020) TensorFlow/Keras Issue. https://github.com/tensorflow/tensorflow/commit/15f6c30d7977c92ba452eb5c1873b8c9f0968a5f

Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76:378–382CrossRef

githubREST (2022) Github REST API. https://docs.github.com/en/rest

GitSearchAPI (2022) Github search api. https://docs.github.com/en/rest/reference/search

googleForm (2022) Google forms. https://www.google.ca/forms/about/

Groce A, Kulesza T, Zhang C, Shamasunder S, Burnett M, Wong WK, Stumpf S, Das S, Shinsel A, Bice F et al (2013) You are the only possible oracle: Effective test selection for end users of interactive machine learning systems. IEEE Trans Soft Eng 40(3):307–323CrossRef

Humbatova N, Jahangirova G, Bavota G, Riccio V, Stocco A, Tonella P (2020) Taxonomy of real faults in deep learning systems. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, p 1110–1121

Islam MJ, Nguyen G, Pan R, Rajan H (2019) A comprehensive study on deep learning bug characteristics. In: Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, p 510–520

issue1 (2022) TensorFlow/Keras Issue. https://github.com/tensorflow/tensorflow/issues/42459

issue2 (2022) TensorFlow/Keras Issue. https://github.com/tensorflow/tensorflow/issues/32476

issue3 (2022) TensorFlow/Keras Issue. https://github.com/tensorflow/tensorflow/issues/32286

issue4 (2022) TensorFlow/Keras Issue. https://github.com/tensorflow/tensorflow/issues/31324

issue5 (2022) TensorFlow/Keras Issue. https://github.com/tensorflow/tensorflow/issues/32420

issue6 (2022) TensorFlow/Keras Issue. https://github.com/tensorflow/tensorflow/issues/30486

issue7 (2022) TensorFlow/Keras Issue. https://github.com/tensorflow/tensorflow/issues/38596

issue8 (2020) TensorFlow/Keras Issue. https://github.com/tensorflow/tensorflow/issues/38197

jax (2022) JAX. https://jax.readthedocs.io/en/latest/

Jia L, Zhong H, Wang X, Huang L, Lu X (2020) An empirical study on bugs inside tensorflow. In: International conference on database systems for advanced applications, Springer, p 604–620

Jia L, Zhong H, Huang L (2021a) The unit test quality of deep learning libraries: A mutation analysis. In: 2021 IEEE international conference on software maintenance and evolution (ICSME), p 47–57. https://doi.org/10.1109/ICSME52107.2021.00011

Jia L, Zhong H, Wang X, Huang L, Lu X (2021) The symptoms, causes, and repairs of bugs inside a deep learning library. J Syst Soft 177:110935CrossRef

Jia L, Zhong H, Wang X, Huang L, Li Z (2022) How do injected bugs affect deep learning? In: 2022 IEEE international conference on software analysis, evolution and reengineering (SANER), p 793–804. https://doi.org/10.1109/SANER53432.2022.00097

Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1–2):81–93. https://doi.org/10.1093/biomet/30.1-2.81, https://academic.oup.com/biomet/article-pdf/30/1-2/81/423380/30-1-2-81.pdf

Kouwe EVD, Giuffrida C, Tanenbaum AS (2014) On the soundness of silence: Investigating silent failures using fault injection experiments. In: 2014 Tenth European dependable computing conference, p 118–129. https://doi.org/10.1109/EDCC.2014.16

Le V, Afshari M, Su Z (2014) Compiler validation via equivalence modulo inputs. In: Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, association for computing machinery, New York, NY, USA, PLDI ’14, p 216–226. https://doi.org/10.1145/2594291.2594334

Li M, Cao J, Tian Y, Li TO, Wen M, Cheung SC (2023) Comet: Coverage-guided model generation for deep learning library testing. ACM Trans Softw Eng Methodol. https://doi.org/10.1145/3583566 just Accepted

Long F, Rinard M (2016) Automatic patch generation by learning correct code. In: Proceedings of the 43rd annual ACM SIGPLAN-SIGACT symposium on principles of programming languages, association for computing machinery, New York, NY, USA, POPL ’16, p 298–312. https://doi.org/10.1145/2837614.2837617

MLlib-Spark (2022) MLlib-Spark. https://spark.apache.org/mllib/

Nguyen G, Dlugolinsky S, Bobák M, Tran V, López García Á, Heredia I, Malík P, Hluchý L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52:77–124CrossRef

Nikanjam A, Morovati MM, Khomh F, Ben Braiek H (2021) Faults in deep reinforcement learning programs: a taxonomy and a detection approach. Auto Soft Eng 29

Oppenheim AN (2000) Questionnaire design, interviewing and attitude measurement. Bloomsbury Publishing

Papadakis M, Shin D, Yoo S, Bae DH (2018) Are mutation scores correlated with real fault detection? a large scale empirical study on the relationship between mutants and real faults. In: Proceedings of the 40th international conference on software engineering, association for computing machinery, New York, NY, USA, ICSE ’18, p 537–548. https://doi.org/10.1145/3180155.3180183

Pham HV, Lutellier T, Qi W, Tan L (2019) Cradle: Cross-backend validation to detect and localize bugs in deep learning libraries. In: 2019 IEEE/ACM 41st international conference on software engineering (ICSE), pp 1027–1038. https://doi.org/10.1109/ICSE.2019.00107

reddit (2022) Reddit. https://www.reddit.com/

Seaman CB (1999) Qualitative methods in empirical studies of software engineering. IEEE Transactions on software engineering 25(4):557–572CrossRef

Sun C, Le V, Su Z (2016) Finding compiler bugs via live code mutation. SIGPLAN Not 51(10):849–863. https://doi.org/10.1145/3022671.2984038CrossRef

Vahabzadeh A, Fard AM, Mesbah A (2015) An empirical study of bugs in test code. In: 2015 IEEE International conference on software maintenance and evolution (ICSME), p 101–110. https://doi.org/10.1109/ICSM.2015.7332456

Wang W, Poo-Caamaño G, Wilde E, German DM (2015) What is the gist? understanding the use of public gists on github. In: 2015 IEEE/ACM 12th Working conference on mining software repositories, IEEE, p 314–323

Wang Z, Yan M, Chen J, Liu S, Zhang D (2020) Deep learning library testing via effective model generation. In: Proceedings of the 28th ACM Joint meeting on European software engineering conference and symposium on the foundations of software engineering, p 788–799

Weimer W, Nguyen T, Le Goues C, Forrest S (2009) Automatically finding patches using genetic programming. In: 2009 IEEE 31st international conference on software engineering, p 364–374. https://doi.org/10.1109/ICSE.2009.5070536

Zhang JM, Harman M, Ma L, Liu Y (2022) Machine learning testing: Survey, landscapes and horizons. IEEE Trans Soft Eng 48(1):1–36. https://doi.org/10.1109/TSE.2019.2962027CrossRef

Zhang Y, Chen Y, Cheung SC, Xiong Y, Zhang L (2018a) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, p 129–140

Zhang Y, Chen Y, Cheung SC, Xiong Y, Zhang L (2018b) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, p 129–140

Titel: Silent bugs in deep learning frameworks: an empirical study of Keras and TensorFlow
verfasst von: Florian Tambon
Amin Nikanjam
Le An
Foutse Khomh
Giuliano Antoniol
Publikationsdatum: 01.02.2024
Verlag: Springer US
Erschienen in: Empirical Software Engineering / Ausgabe 1/2024
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI: https://doi.org/10.1007/s10664-023-10389-6

Springer Professional

Abstract

Graphical abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 1/2024

Does code review speed matter for practitioners?

Which design decisions in AI-enabled mobile applications contribute to greener AI?

The Impact of Personality on Requirements Engineering Activities: A Mixed-Methods Study

Search-based Automatic Repair for Fairness and Accuracy in Decision-making Software

Fairness-aware machine learning engineering: how far are we?

Predicting merge conflicts considering social and technical assets

Premium Partner