Skip to main content
Erschienen in: Empirical Software Engineering 3/2023

01.06.2023

Bugs in machine learning-based systems: a faultload benchmark

verfasst von: Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, Zhen Ming (Jack) Jiang

Erschienen in: Empirical Software Engineering | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs’ lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 100 bugs reported by ML developers in GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs’ origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). Savannah, USENIX, pp 265–283 Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). Savannah, USENIX, pp 265–283
Zurück zum Zitat Abidi M, Grichi M, Khomh F, Guéhéneuc Y G (2019a) Code smells for multi-language systems. In: Proceedings of the 24th European conference on pattern languages of programs, pp 1–13 Abidi M, Grichi M, Khomh F, Guéhéneuc Y G (2019a) Code smells for multi-language systems. In: Proceedings of the 24th European conference on pattern languages of programs, pp 1–13
Zurück zum Zitat Abidi M, Khomh F, Guéhéneuc Y G (2019b) Anti-patterns for multi-language systems. In: Proceedings of the 24th European conference on pattern languages of programs, pp 1–14 Abidi M, Khomh F, Guéhéneuc Y G (2019b) Anti-patterns for multi-language systems. In: Proceedings of the 24th European conference on pattern languages of programs, pp 1–14
Zurück zum Zitat Abidi M, Rahman M S, Openja M, Khomh F (2021) Are multi-language design smells fault-prone? An empirical study. ACM Trans Softw Eng Methodol (TOSEM) 30(3):1–56CrossRef Abidi M, Rahman M S, Openja M, Khomh F (2021) Are multi-language design smells fault-prone? An empirical study. ACM Trans Softw Eng Methodol (TOSEM) 30(3):1–56CrossRef
Zurück zum Zitat Al-Rfou R, Alain G, Almahairi A, Angermueller C, Bahdanau D, Ballas N, Bastien F, Bayer J, Belikov A, Belopolsky A et al (2016) Theano: a python framework for fast computation of mathematical expressions. arXiv e-prints pp arXiv–1605 Al-Rfou R, Alain G, Almahairi A, Angermueller C, Bahdanau D, Ballas N, Bastien F, Bayer J, Belikov A, Belopolsky A et al (2016) Theano: a python framework for fast computation of mathematical expressions. arXiv e-prints pp arXiv–1605
Zurück zum Zitat Amershi S, Begel A, Bird C, DeLine R, Gall H, Kamar E, Nagappan N, Nushi B, Zimmermann T (2019) Software engineering for machine learning: a case study. In: 2019 IEEE/ACM 41st international conference on software engineering: Software engineering in practice (ICSE-SEIP). IEEE, pp 291–300 Amershi S, Begel A, Bird C, DeLine R, Gall H, Kamar E, Nagappan N, Nushi B, Zimmermann T (2019) Software engineering for machine learning: a case study. In: 2019 IEEE/ACM 41st international conference on software engineering: Software engineering in practice (ICSE-SEIP). IEEE, pp 291–300
Zurück zum Zitat Borg M (2021) The aiq meta-testbed: pragmatically bridging academic ai testing and industrial q needs. In: International conference on software quality. Springer, pp 66–77 Borg M (2021) The aiq meta-testbed: pragmatically bridging academic ai testing and industrial q needs. In: International conference on software quality. Springer, pp 66–77
Zurück zum Zitat Bourque P, Dupuis R, Abran A, Moore J W, Tripp L (1999) The guide to the software engineering body of knowledge. IEEE Softw 16(6):35–44CrossRef Bourque P, Dupuis R, Abran A, Moore J W, Tripp L (1999) The guide to the software engineering body of knowledge. IEEE Softw 16(6):35–44CrossRef
Zurück zum Zitat Chollet F et al (2018) Keras: the python deep learning library. Astrophysics Source Code Library, pp ascl–1806 Chollet F et al (2018) Keras: the python deep learning library. Astrophysics Source Code Library, pp ascl–1806
Zurück zum Zitat Chouldechova A, Roth A (2018) The frontiers of fairness in machine learning. arXiv:1810.08810 Chouldechova A, Roth A (2018) The frontiers of fairness in machine learning. arXiv:1810.​08810
Zurück zum Zitat Collobert R, Bengio S, Mariéthoz J (2002) Torch: a modular machine learning software library. Tech. rep. Idiap Collobert R, Bengio S, Mariéthoz J (2002) Torch: a modular machine learning software library. Tech. rep. Idiap
Zurück zum Zitat Dwork C (2008) Differential privacy: a survey of results. In: International conference on theory and applications of models of computation. Springer, pp 1–19 Dwork C (2008) Differential privacy: a survey of results. In: International conference on theory and applications of models of computation. Springer, pp 1–19
Zurück zum Zitat Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, Cui C, Corrado G, Thrun S, Dean J (2019) A guide to deep learning in healthcare. Nat Med 25(1):24–29CrossRef Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, Cui C, Corrado G, Thrun S, Dean J (2019) A guide to deep learning in healthcare. Nat Med 25(1):24–29CrossRef
Zurück zum Zitat Felderer M, Ramler R (2021) Quality assurance for ai-based systems: overview and challenges (introduction to interactive session). In: International conference on software quality. Springer, pp 33–42 Felderer M, Ramler R (2021) Quality assurance for ai-based systems: overview and challenges (introduction to interactive session). In: International conference on software quality. Springer, pp 33–42
Zurück zum Zitat Galin D (2004) Software quality assurance: from theory to implementation. Pearson Education, England Galin D (2004) Software quality assurance: from theory to implementation. Pearson Education, England
Zurück zum Zitat Hawkins D M (2004) The problem of overfitting. J Chem Inf Comput 44(1):1–12CrossRef Hawkins D M (2004) The problem of overfitting. J Chem Inf Comput 44(1):1–12CrossRef
Zurück zum Zitat Humbatova N, Jahangirova G, Bavota G, Riccio V, Stocco A, Tonella P (2020) Taxonomy of real faults in deep learning systems. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, pp 1110–1121 Humbatova N, Jahangirova G, Bavota G, Riccio V, Stocco A, Tonella P (2020) Taxonomy of real faults in deep learning systems. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, pp 1110–1121
Zurück zum Zitat Huppler K (2009) The art of building a good benchmark. In: Technology conference on performance evaluation and benchmarking. Springer, pp 18–30 Huppler K (2009) The art of building a good benchmark. In: Technology conference on performance evaluation and benchmarking. Springer, pp 18–30
Zurück zum Zitat Islam M J, Nguyen G, Pan R, Rajan H (2019) A comprehensive study on deep learning bug characteristics. In: Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp 510–520 Islam M J, Nguyen G, Pan R, Rajan H (2019) A comprehensive study on deep learning bug characteristics. In: Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp 510–520
Zurück zum Zitat Islam M J, Pan R, Nguyen G, Rajan H (2020) Repairing deep neural networks: fix patterns and challenges. In: 2020 IEEE/ACM 42nd international conference on software engineering (ICSE). IEEE, pp 1135–1146 Islam M J, Pan R, Nguyen G, Rajan H (2020) Repairing deep neural networks: fix patterns and challenges. In: 2020 IEEE/ACM 42nd international conference on software engineering (ICSE). IEEE, pp 1135–1146
Zurück zum Zitat Jia L, Zhong H, Huang L (2021a) The unit test quality of deep learning libraries: a mutation analysis. In: 2021 IEEE International conference on software maintenance and evolution (ICSME). IEEE, pp 47–57 Jia L, Zhong H, Huang L (2021a) The unit test quality of deep learning libraries: a mutation analysis. In: 2021 IEEE International conference on software maintenance and evolution (ICSME). IEEE, pp 47–57
Zurück zum Zitat Jia L, Zhong H, Wang X, Huang L, Lu X (2021b) The symptoms, causes, and repairs of bugs inside a deep learning library. J Syst Softw 177:110935 Jia L, Zhong H, Wang X, Huang L, Lu X (2021b) The symptoms, causes, and repairs of bugs inside a deep learning library. J Syst Softw 177:110935
Zurück zum Zitat Jia L, Zhong H, Wang X, Huang L, Li Z (2022) How do injected bugs affect deep learning?. In: 2022 IEEE International conference on software analysis, evolution and reengineering (SANER). IEEE, pp 793–804 Jia L, Zhong H, Wang X, Huang L, Li Z (2022) How do injected bugs affect deep learning?. In: 2022 IEEE International conference on software analysis, evolution and reengineering (SANER). IEEE, pp 793–804
Zurück zum Zitat Jiang Y, Liu H, Niu N, Zhang L, Hu Y (2021) Extracting concise bug-fixing patches from human-written patches in version control systems. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE). IEEE, pp 686–698 Jiang Y, Liu H, Niu N, Zhang L, Hu Y (2021) Extracting concise bug-fixing patches from human-written patches in version control systems. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE). IEEE, pp 686–698
Zurück zum Zitat Just R, Jalali D, Ernst M D (2014) Defects4j: a database of existing faults to enable controlled testing studies for java programs. In: Proceedings of the 2014 international symposium on software testing and analysis, pp 437–440 Just R, Jalali D, Ernst M D (2014) Defects4j: a database of existing faults to enable controlled testing studies for java programs. In: Proceedings of the 2014 international symposium on software testing and analysis, pp 437–440
Zurück zum Zitat Kim M, Kim Y, Lee E (2021) Denchmark: a bug benchmark of deep learning-related software. In: 2021 IEEE/ACM 18th international conference on mining software repositories (MSR). IEEE, pp 540–544 Kim M, Kim Y, Lee E (2021) Denchmark: a bug benchmark of deep learning-related software. In: 2021 IEEE/ACM 18th international conference on mining software repositories (MSR). IEEE, pp 540–544
Zurück zum Zitat Kirk M (2014) Thoughtful machine learning: a test-driven approach. O’Reilly Media, Inc. Kirk M (2014) Thoughtful machine learning: a test-driven approach. O’Reilly Media, Inc.
Zurück zum Zitat Kistowski JV, Arnold JA, Huppler K, Lange KD, Henning JL, Cao P (2015) How to build a benchmark. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, pp 333–336 Kistowski JV, Arnold JA, Huppler K, Lange KD, Henning JL, Cao P (2015) How to build a benchmark. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, pp 333–336
Zurück zum Zitat Krizhevsky A, Hinton G et al (2009) Learning multiple layers of features from tiny images Krizhevsky A, Hinton G et al (2009) Learning multiple layers of features from tiny images
Zurück zum Zitat Le Goues C, Holtschulte N, Smith E K, Brun Y, Devanbu P, Forrest S, Weimer W (2015) The manybugs and introclass benchmarks for automated repair of c programs. IEEE Trans Softw Eng 41(12):1236–1256CrossRef Le Goues C, Holtschulte N, Smith E K, Brun Y, Devanbu P, Forrest S, Weimer W (2015) The manybugs and introclass benchmarks for automated repair of c programs. IEEE Trans Softw Eng 41(12):1236–1256CrossRef
Zurück zum Zitat LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef
Zurück zum Zitat Lenarduzzi V, Lomio F, Moreschini S, Taibi D, Tamburri D A (2021) Software quality for ai: where we are now?. In: International conference on software quality. Springer, pp 43–53 Lenarduzzi V, Lomio F, Moreschini S, Taibi D, Tamburri D A (2021) Software quality for ai: where we are now?. In: International conference on software quality. Springer, pp 43–53
Zurück zum Zitat Lin Z, Marinov D, Zhong H, Chen Y, Zhao J (2015) Jacontebe: a benchmark suite of real-world java concurrency bugs (t). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 178–189 Lin Z, Marinov D, Zhong H, Chen Y, Zhao J (2015) Jacontebe: a benchmark suite of real-world java concurrency bugs (t). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 178–189
Zurück zum Zitat Lipton Z C (2018) The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16 (3):31–57CrossRef Lipton Z C (2018) The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16 (3):31–57CrossRef
Zurück zum Zitat Liu X, Xie L, Wang Y, Zou J, Xiong J, Ying Z, Vasilakos A V (2020) Privacy and security issues in deep learning: a survey. IEEE Access 9:4566–4593CrossRef Liu X, Xie L, Wang Y, Zou J, Xiong J, Ying Z, Vasilakos A V (2020) Privacy and security issues in deep learning: a survey. IEEE Access 9:4566–4593CrossRef
Zurück zum Zitat Lu S, Li Z, Qin F, Tan L, Zhou P, Zhou Y (2005) Bugbench: benchmarks for evaluating bug detection tools. In: Workshop on the evaluation of software defect detection tools, vol 5. Chicago Lu S, Li Z, Qin F, Tan L, Zhou P, Zhou Y (2005) Bugbench: benchmarks for evaluating bug detection tools. In: Workshop on the evaluation of software defect detection tools, vol 5. Chicago
Zurück zum Zitat Lyu M R (2007) Software reliability engineering: a roadmap. In: Future of software engineering (FOSE’07). IEEE, Minneapolis, pp 153–170 Lyu M R (2007) Software reliability engineering: a roadmap. In: Future of software engineering (FOSE’07). IEEE, Minneapolis, pp 153–170
Zurück zum Zitat Ma L, Juefei-Xu F, Zhang F, Sun J, Xue M, Li B, Chen C, Su T, Li L, Liu Y et al (2018) Deepgauge: multi-granularity testing criteria for deep learning systems. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. Association for Computing Machinery (ACM), New York, pp 120–131 Ma L, Juefei-Xu F, Zhang F, Sun J, Xue M, Li B, Chen C, Su T, Li L, Liu Y et al (2018) Deepgauge: multi-granularity testing criteria for deep learning systems. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. Association for Computing Machinery (ACM), New York, pp 120–131
Zurück zum Zitat Madeiral F, Urli S, Maia M, Monperrus M (2019) Bears: an extensible java bug benchmark for automatic program repair studies. In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 468–478 Madeiral F, Urli S, Maia M, Monperrus M (2019) Bears: an extensible java bug benchmark for automatic program repair studies. In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 468–478
Zurück zum Zitat Marijan D, Gotlieb A, Ahuja M K (2019) Challenges of testing machine learning based systems. In: 2019 IEEE International conference on artificial intelligence testing (AITest). IEEE, pp 101–102 Marijan D, Gotlieb A, Ahuja M K (2019) Challenges of testing machine learning based systems. In: 2019 IEEE International conference on artificial intelligence testing (AITest). IEEE, pp 101–102
Zurück zum Zitat Martínez-Fernández S, Bogner J, Franch X, Oriol M, Siebert J, Trendowicz A, Vollmer AM, Wagner S (2021) Software engineering for ai-based systems: a survey. arXiv:2105.01984 Martínez-Fernández S, Bogner J, Franch X, Oriol M, Siebert J, Trendowicz A, Vollmer AM, Wagner S (2021) Software engineering for ai-based systems: a survey. arXiv:2105.​01984
Zurück zum Zitat McDonald N, Schoenebeck S, Forte A (2019) Reliability and inter-rater reliability in qualitative research: Norms and guidelines for cscw and hci practice. Proc ACM on Human-Comput Interact 3(CSCW):1–23 McDonald N, Schoenebeck S, Forte A (2019) Reliability and inter-rater reliability in qualitative research: Norms and guidelines for cscw and hci practice. Proc ACM on Human-Comput Interact 3(CSCW):1–23
Zurück zum Zitat Nejadgholi M, Yang J (2019) A study of oracle approximations in testing deep learning libraries. In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 785–796 Nejadgholi M, Yang J (2019) A study of oracle approximations in testing deep learning libraries. In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 785–796
Zurück zum Zitat Nikanjam A, Khomh F (2021) Design smells in deep learning programs: an empirical study. In: 2021 IEEE International conference on software maintenance and evolution (ICSME), pp 332–342 Nikanjam A, Khomh F (2021) Design smells in deep learning programs: an empirical study. In: 2021 IEEE International conference on software maintenance and evolution (ICSME), pp 332–342
Zurück zum Zitat Nikanjam A, Morovati M M, Khomh F, Braiek H B (2021b) Faults in deep reinforcement learning programs: a taxonomy and a detection approach. arXiv:2101.00135 Nikanjam A, Morovati M M, Khomh F, Braiek H B (2021b) Faults in deep reinforcement learning programs: a taxonomy and a detection approach. arXiv:2101.​00135
Zurück zum Zitat Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv:1912.01703 Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv:1912.​01703
Zurück zum Zitat Pei K, Cao Y, Yang J, Jana S (2017) Deepxplore: automated whitebox testing of deep learning systems. In: Proceedings of the 26th symposium on operating systems principles. Association for Computing Machinery (ACM), New York, pp 1–18 Pei K, Cao Y, Yang J, Jana S (2017) Deepxplore: automated whitebox testing of deep learning systems. In: Proceedings of the 26th symposium on operating systems principles. Association for Computing Machinery (ACM), New York, pp 1–18
Zurück zum Zitat Pham H V, Qian S, Wang J, Lutellier T, Rosenthal J, Tan L, Yu Y, Nagappan N (2021) Problems and opportunities in training deep learning software systems: an analysis of variance. In: Proceedings of the 35th IEEE/ACM international conference on automated software engineering, ASE ’20. Association for Computing Machinery, New York, pp 771–783. https://doi.org/10.1145/3324884.3416545 Pham H V, Qian S, Wang J, Lutellier T, Rosenthal J, Tan L, Yu Y, Nagappan N (2021) Problems and opportunities in training deep learning software systems: an analysis of variance. In: Proceedings of the 35th IEEE/ACM international conference on automated software engineering, ASE ’20. Association for Computing Machinery, New York, pp 771–783. https://​doi.​org/​10.​1145/​3324884.​3416545
Zurück zum Zitat Pressman R S (2005) Software engineering: a practitioner’s approach. Palgrave Macmillan Pressman R S (2005) Software engineering: a practitioner’s approach. Palgrave Macmillan
Zurück zum Zitat Radjenović D, Heričko M, Torkar R, živkovič A (2013) Software fault prediction metrics: a systematic literature review. Inf Softw Technol 55(8):1397–1418CrossRef Radjenović D, Heričko M, Torkar R, živkovič A (2013) Software fault prediction metrics: a systematic literature review. Inf Softw Technol 55(8):1397–1418CrossRef
Zurück zum Zitat Riccio V, Jahangirova G, Stocco A, Humbatova N, Weiss M, Tonella P (2020) Testing machine learning based systems: a systematic mapping. Empir Softw Eng 25(6):5193–5254CrossRef Riccio V, Jahangirova G, Stocco A, Humbatova N, Weiss M, Tonella P (2020) Testing machine learning based systems: a systematic mapping. Empir Softw Eng 25(6):5193–5254CrossRef
Zurück zum Zitat Rice L, Wong E, Kolter Z (2020) Overfitting in adversarially robust deep learning. In: International conference on machine learning. PMLR, pp 8093–8104 Rice L, Wong E, Kolter Z (2020) Overfitting in adversarially robust deep learning. In: International conference on machine learning. PMLR, pp 8093–8104
Zurück zum Zitat Rivera-Landos E, Khomh F, Nikanjam A (2021) The challenge of reproducible ml: an empirical study on the impact of bugs Rivera-Landos E, Khomh F, Nikanjam A (2021) The challenge of reproducible ml: an empirical study on the impact of bugs
Zurück zum Zitat Rodríguez-Pérez G, Robles G, González-Barahona JM (2018) Reproducibility and credibility in empirical software engineering: a case study based on a systematic literature review of the use of the szz algorithm. Inf Softw Technol 99:164–176CrossRef Rodríguez-Pérez G, Robles G, González-Barahona JM (2018) Reproducibility and credibility in empirical software engineering: a case study based on a systematic literature review of the use of the szz algorithm. Inf Softw Technol 99:164–176CrossRef
Zurück zum Zitat Schoop E, Huang F, Hartmann B (2021) Umlaut: debugging deep learning programs using program structure and model behavior. In: Proceedings of the 2021 CHI conference on human factors in computing systems, pp 1–16 Schoop E, Huang F, Hartmann B (2021) Umlaut: debugging deep learning programs using program structure and model behavior. In: Proceedings of the 2021 CHI conference on human factors in computing systems, pp 1–16
Zurück zum Zitat Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo J F, Dennison D (2015) Hidden technical debt in machine learning systems. Adv Neural Inf Process Syst 28:2503–2511 Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo J F, Dennison D (2015) Hidden technical debt in machine learning systems. Adv Neural Inf Process Syst 28:2503–2511
Zurück zum Zitat Shen Q, Ma H, Chen J, Tian Y, Cheung S C, Chen X (2021) A comprehensive study of deep learning compiler bugs. In: Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp 968–980 Shen Q, Ma H, Chen J, Tian Y, Cheung S C, Chen X (2021) A comprehensive study of deep learning compiler bugs. In: Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp 968–980
Zurück zum Zitat Spadini D, Aniche M, Bacchelli A (2018) PyDriller: python framework for mining software repositories. In: Proceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering—ESEC/FSE 2018. ACM Press, New York, pp 908–911. https://doi.org/10.1145/3236024.3264598 Spadini D, Aniche M, Bacchelli A (2018) PyDriller: python framework for mining software repositories. In: Proceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering—ESEC/FSE 2018. ACM Press, New York, pp 908–911. https://​doi.​org/​10.​1145/​3236024.​3264598
Zurück zum Zitat Tambon F, Nikanjam A, An L, Khomh F, Antoniol G (2021) Silent bugs in deep learning frameworks: an empirical study of keras and tensorflow Tambon F, Nikanjam A, An L, Khomh F, Antoniol G (2021) Silent bugs in deep learning frameworks: an empirical study of keras and tensorflow
Zurück zum Zitat Tian Y, Pei K, Jana S, Ray B (2018) Deeptest: automated testing of deep-neural-network-driven autonomous cars. In: Proceedings of the 40th international conference on software engineering, pp 303–314 Tian Y, Pei K, Jana S, Ray B (2018) Deeptest: automated testing of deep-neural-network-driven autonomous cars. In: Proceedings of the 40th international conference on software engineering, pp 303–314
Zurück zum Zitat Vieira M, Madeira H, Sachs K, Kounev S (2012) Resilience benchmarking. In: Resilience assessment and evaluation of computing systems. Springer, pp 283–301 Vieira M, Madeira H, Sachs K, Kounev S (2012) Resilience benchmarking. In: Resilience assessment and evaluation of computing systems. Springer, pp 283–301
Zurück zum Zitat Wardat M, Le W, Rajan H (2021) Deeplocalize: fault localization for deep neural networks. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE). IEEE, pp 251–262 Wardat M, Le W, Rajan H (2021) Deeplocalize: fault localization for deep neural networks. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE). IEEE, pp 251–262
Zurück zum Zitat Wardat M, Cruz B D, Le W, Rajan H (2022) Deepdiagnosis: automatically diagnosing faults and recommending actionable fixes in deep learning programs. In: Proceedings of the 44th international conference on software engineering, pp 561–572 Wardat M, Cruz B D, Le W, Rajan H (2022) Deepdiagnosis: automatically diagnosing faults and recommending actionable fixes in deep learning programs. In: Proceedings of the 44th international conference on software engineering, pp 561–572
Zurück zum Zitat Widyasari R, Sim S Q, Lok C, Qi H, Phan J, Tay Q, Tan C, Wee F, Tan J E, Yieh Y et al (2020) Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies. In: Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp 1556–1560 Widyasari R, Sim S Q, Lok C, Qi H, Phan J, Tay Q, Tan C, Wee F, Tan J E, Yieh Y et al (2020) Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies. In: Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp 1556–1560
Zurück zum Zitat Xue M, Yuan C, Wu H, Zhang Y, Liu W (2020) Machine learning security: threats, countermeasures, and evaluations. IEEE Access 8:74720–74742CrossRef Xue M, Yuan C, Wu H, Zhang Y, Liu W (2020) Machine learning security: threats, countermeasures, and evaluations. IEEE Access 8:74720–74742CrossRef
Zurück zum Zitat Zerouali A, Mens T, Robles G, Gonzalez-Barahona J M (2019) On the diversity of software package popularity metrics: an empirical study of npm. In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 589–593 Zerouali A, Mens T, Robles G, Gonzalez-Barahona J M (2019) On the diversity of software package popularity metrics: an empirical study of npm. In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 589–593
Zurück zum Zitat Zhang M, Zhang Y, Zhang L, Liu C, Khurshid S (2018a) Deeproad: Gan-based metamorphic testing and input validation framework for autonomous driving systems. In: 2018 33rd IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 132–142 Zhang M, Zhang Y, Zhang L, Liu C, Khurshid S (2018a) Deeproad: Gan-based metamorphic testing and input validation framework for autonomous driving systems. In: 2018 33rd IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 132–142
Zurück zum Zitat Zhang Y, Chen Y, Cheung S C, Xiong Y, Zhang L (2018b) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, pp 129–140 Zhang Y, Chen Y, Cheung S C, Xiong Y, Zhang L (2018b) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, pp 129–140
Zurück zum Zitat Zhang J, Barr E T, Guedj B, Harman M, Shawe-Taylor J (2019) Perturbed model validation: a new framework to validate model relevance Zhang J, Barr E T, Guedj B, Harman M, Shawe-Taylor J (2019) Perturbed model validation: a new framework to validate model relevance
Zurück zum Zitat Zhang J M, Harman M, Ma L, Liu Y (2020) Machine learning testing: survey, landscapes and horizons. IEEE Trans Softw Eng Zhang J M, Harman M, Ma L, Liu Y (2020) Machine learning testing: survey, landscapes and horizons. IEEE Trans Softw Eng
Zurück zum Zitat Zhu C, Huang W R, Li H, Taylor G, Studer C, Goldstein T (2019) Transferable clean-label poisoning attacks on deep neural nets. In: International conference on machine learning. PMLR, pp 7614–7623 Zhu C, Huang W R, Li H, Taylor G, Studer C, Goldstein T (2019) Transferable clean-label poisoning attacks on deep neural nets. In: International conference on machine learning. PMLR, pp 7614–7623
Zurück zum Zitat Zubrow D (2009) IEEE Standard classification for software anomalies. IEEE Computer Society Zubrow D (2009) IEEE Standard classification for software anomalies. IEEE Computer Society
Metadaten
Titel
Bugs in machine learning-based systems: a faultload benchmark
verfasst von
Mohammad Mehdi Morovati
Amin Nikanjam
Foutse Khomh
Zhen Ming (Jack) Jiang
Publikationsdatum
01.06.2023
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 3/2023
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-023-10291-1

Weitere Artikel der Ausgabe 3/2023

Empirical Software Engineering 3/2023 Zur Ausgabe

Premium Partner