research-article

Taxonomy of real faults in deep learning systems

Authors:
Nargiz Humbatova

Università della Svizzera italiana, Lugano, Switzerland

Università della Svizzera italiana, Lugano, Switzerland
View Profile

,
Gunel Jahangirova

Università della Svizzera italiana, Lugano, Switzerland

Università della Svizzera italiana, Lugano, Switzerland
View Profile

,
Gabriele Bavota

Università della Svizzera italiana, Lugano, Switzerland

Università della Svizzera italiana, Lugano, Switzerland
View Profile

,
Vincenzo Riccio

Università della Svizzera italiana, Lugano, Switzerland

Università della Svizzera italiana, Lugano, Switzerland
View Profile

,
Andrea Stocco

Università della Svizzera italiana, Lugano, Switzerland

Università della Svizzera italiana, Lugano, Switzerland
View Profile

,
Paolo Tonella

Università della Svizzera italiana, Lugano, Switzerland

Università della Svizzera italiana, Lugano, Switzerland
View Profile

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software EngineeringJune 2020Pages 1110–1121https://doi.org/10.1145/3377811.3380395

Published:01 October 2020Publication History

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

Pages 1110–1121

ABSTRACT

The growing application of deep neural networks in safety-critical domains makes the analysis of faults that occur in such systems of enormous importance. In this paper we introduce a large taxonomy of faults in deep learning (DL) systems. We have manually analysed 1059 artefacts gathered from GitHub commits and issues of projects that use the most popular DL frameworks (TensorFlow, Keras and PyTorch) and from related Stack Overflow posts. Structured interviews with 20 researchers and practitioners describing the problems they have encountered in their experience have enriched our taxonomy with a variety of additional faults that did not emerge from the other two sources. Our final taxonomy was validated with a survey involving an additional set of 21 developers, confirming that almost all fault categories (13/15) were experienced by at least 50% of the survey participants.

References

2019. Descript. https://www.descript.comGoogle Scholar
2019. FrameworkData. https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297aGoogle Scholar
2019. GitHub - About Stars. https://help.github.com/articles/about-stars/Google Scholar
2019. GitHub - Forking a repo. https://help.github.com/articles/fork-a-repo/Google Scholar
2019. GitHub Search API. https://developer.github.com/v3/search/Google Scholar
2019. ISO/PAS 21448:2019 Road vehicles --- Safety of the intended functionality. https://www.iso.org/standard/70939.htmlGoogle Scholar
2019. Qualtrics. https://www.qualtrics.comGoogle Scholar
2019. Replication Package. https://github.com/dlfaults/dl_faultsGoogle Scholar
2019. StackExchange Data Explorer. https://data.stackexchange.com/stackoverflow/query/newGoogle Scholar
2019. Upwork. https://www.upwork.comGoogle Scholar
J. H. Andrews, L. C. Briand, and Y. Labiche. 2005. Is Mutation an Appropriate Tool for Testing Experiments?. In Proceedings of the 27th International Conference on Software Engineering (ICSE '05). ACM, New York, NY, USA, 402--411. Google ScholarDigital Library
Anders Arpteg, Björn Brinne, Luka Crnkovic-Friis, and Jan Bosch. 2018. Software engineering challenges of deep learning. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 50--59.Google ScholarCross Ref
Boris Beizer. 1984. Software System Testing and Quality Assurance. Van Nostrand Reinhold Co., New York, NY, USA.Google ScholarDigital Library
Muriel Daran. 1996. Software Error Analysis: A Real Case Study Involving Real Faults and Mutations. In In Proceedings of the 1996 ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM Press, 158--171.Google Scholar
Michael Fischer, Martin Pinzger, and Harald C. Gall. 2003. Populating a Release History Database from Version Control and Bug Tracking Systems. In 19th International Conference on Software Maintenance (ICSM 2003).Google Scholar
Siw Elisabeth Hove and Bente Anda. 2005. Experiences from Conducting Semistructured Interviews in Empirical Software Engineering Research. In Proceedings of the 11th IEEE International Software Metrics Symposium (METRICS '05). IEEE Computer Society, Washington, DC, USA, 23--. Google ScholarDigital Library
Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). ACM, New York, NY, USA, 510--520. Google ScholarDigital Library
René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are Mutants a Valid Substitute for Real Faults in Software Testing?. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). ACM, New York, NY, USA, 654--665. Google ScholarDigital Library
Lucy Ellen Lwakatare, Aiswarya Raj, Jan Bosch, Helena Holmström Olsson, and Ivica Crnkovic. 2019. A taxonomy of software engineering challenges for machine learning systems: An empirical investigation. In International Conference on Agile Software Development. Springer, 227--243.Google ScholarCross Ref
Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepMutation: Mutation Testing of Deep Learning Systems. In 29th IEEE International Symposium on Software Reliability Engineering, ISSRE 2018, Memphis, TN, USA, October 15-18, 2018. 100--111. Google ScholarCross Ref
Sarah Meldrum, Sherlock A. Licorish, and Bastin Tony Roy Savarimuthu. 2017. Crowdsourced Knowledge on Stack Overflow: A Systematic Mapping Study. In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering (EASE'17). ACM, New York, NY, USA, 180--185. Google ScholarDigital Library
Jennifer Rowley and Richard Hartley. 2017. Organizing knowledge: an introduction to managing access to information. Routledge.Google Scholar
Carolyn B. Seaman. 1999. Qualitative Methods in Empirical Studies of Software Engineering. IEEE Trans. Softw. Eng. 25, 4 (July 1999), 557--572. Google ScholarDigital Library
Carolyn B. Seaman, Forrest Shull, Myrna Regardie, Denis Elbert, Raimund L. Feldmann, Yuepu Guo, and Sally Godfrey. 2008. Defect Categorization: Making Use of a Decade of Widely Varying Historical Data. In Proceedings of the Second ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM '08). ACM, New York, NY, USA, 149--157. Google ScholarDigital Library
W. Shen, J. Wan, and Z. Chen. 2018. MuNN: Mutation Analysis of Neural Networks. In 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). 108--115. Google ScholarCross Ref
X. Sun, T. Zhou, G. Li, J. Hu, H. Yang, and B. Li. 2017. An Empirical Study on Real Bugs for Machine Learning Programs. In 2017 24th Asia-Pacific Software Engineering Conference (APSEC). 348--357. Google ScholarCross Ref
Ferdian Thung, Shaowei Wang, David Lo, and Lingxiao Jiang. 2012. An Empirical Study of Bugs in Machine Learning Systems. In Proceedings of the 2012 IEEE 23rd International Symposium on Software Reliability Engineering (ISSRE '12). IEEE Computer Society, Washington, DC, USA, 271--280. Google ScholarDigital Library
Muhammad Usman, Ricardo Britto, Jürgen Börstler, and Emilia Mendes. 2017. Taxonomies in software engineering: A systematic mapping study and a revised taxonomy development method. Information and Software Technology 85 (2017), 43--59.Google ScholarDigital Library
G. Vijayaraghavan and C. Kramer. [n.d.]. Bug taxonomies: Use them to generate better test. Software Testing Analysis and Review (STAR EAST) ([n. d.]).Google Scholar
Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An Empirical Study on TensorFlow Program Bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2018). ACM, New York, NY, USA, 129--140. Google ScholarDigital Library

Index Terms

Taxonomy of real faults in deep learning systems
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation

Recommendations

DeepCrime: mutation testing of deep learning systems based on real faults
ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Deep Learning (DL) solutions are increasingly adopted, but how to test them remains a major open research problem. Existing and new testing techniques have been proposed for and adapted to DL systems, including mutation testing. However, no approach has ...
Read More
Faults in deep reinforcement learning programs: a taxonomy and a detection approach
Abstract
A growing demand is witnessed in both industry and academia for employing Deep Learning (DL) in various domains to solve real-world problems. Deep reinforcement learning (DRL) is the application of DL in the domain of Reinforcement Learning. Like ...
Read More
DDV: A Taxonomy for Deep Learning Methods in Detecting Prostate Cancer
Abstract
Deep learning is increasingly studied in the prediction of cancer yet few deep learning systems have been introduced for daily use for such purpose. The manual scanning, reading, and analysis by radiologists to detect cancer are very time-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering
June 2020
1640 pages
ISBN:9781450371216
DOI:10.1145/3377811
General Chairs:
Gregg Rothermel
North Carolina State University
,
Doo-Hwan Bae
KAIST, South Korea
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available / v1.1
- Artifacts Evaluated & Reusable / v1.1
Author Tags
deep learning
real faults
software testing
taxonomy
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 151
  Total Citations
  View Citations
- 1,575
  Total Downloads
- Downloads (Last 12 months)392
- Downloads (Last 6 weeks)41
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Taxonomy of real faults in deep learning systems

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

DeepCrime: mutation testing of deep learning systems based on real faults

Faults in deep reinforcement learning programs: a taxonomy and a detection approach

DDV: A Taxonomy for Deep Learning Methods in Detecting Prostate Cancer