ABSTRACT
The growing application of deep neural networks in safety-critical domains makes the analysis of faults that occur in such systems of enormous importance. In this paper we introduce a large taxonomy of faults in deep learning (DL) systems. We have manually analysed 1059 artefacts gathered from GitHub commits and issues of projects that use the most popular DL frameworks (TensorFlow, Keras and PyTorch) and from related Stack Overflow posts. Structured interviews with 20 researchers and practitioners describing the problems they have encountered in their experience have enriched our taxonomy with a variety of additional faults that did not emerge from the other two sources. Our final taxonomy was validated with a survey involving an additional set of 21 developers, confirming that almost all fault categories (13/15) were experienced by at least 50% of the survey participants.
- 2019. Descript. https://www.descript.comGoogle Scholar
- 2019. FrameworkData. https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297aGoogle Scholar
- 2019. GitHub - About Stars. https://help.github.com/articles/about-stars/Google Scholar
- 2019. GitHub - Forking a repo. https://help.github.com/articles/fork-a-repo/Google Scholar
- 2019. GitHub Search API. https://developer.github.com/v3/search/Google Scholar
- 2019. ISO/PAS 21448:2019 Road vehicles --- Safety of the intended functionality. https://www.iso.org/standard/70939.htmlGoogle Scholar
- 2019. Qualtrics. https://www.qualtrics.comGoogle Scholar
- 2019. Replication Package. https://github.com/dlfaults/dl_faultsGoogle Scholar
- 2019. StackExchange Data Explorer. https://data.stackexchange.com/stackoverflow/query/newGoogle Scholar
- 2019. Upwork. https://www.upwork.comGoogle Scholar
- J. H. Andrews, L. C. Briand, and Y. Labiche. 2005. Is Mutation an Appropriate Tool for Testing Experiments?. In Proceedings of the 27th International Conference on Software Engineering (ICSE '05). ACM, New York, NY, USA, 402--411. Google ScholarDigital Library
- Anders Arpteg, Björn Brinne, Luka Crnkovic-Friis, and Jan Bosch. 2018. Software engineering challenges of deep learning. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 50--59.Google ScholarCross Ref
- Boris Beizer. 1984. Software System Testing and Quality Assurance. Van Nostrand Reinhold Co., New York, NY, USA.Google ScholarDigital Library
- Muriel Daran. 1996. Software Error Analysis: A Real Case Study Involving Real Faults and Mutations. In In Proceedings of the 1996 ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM Press, 158--171.Google Scholar
- Michael Fischer, Martin Pinzger, and Harald C. Gall. 2003. Populating a Release History Database from Version Control and Bug Tracking Systems. In 19th International Conference on Software Maintenance (ICSM 2003).Google Scholar
- Siw Elisabeth Hove and Bente Anda. 2005. Experiences from Conducting Semistructured Interviews in Empirical Software Engineering Research. In Proceedings of the 11th IEEE International Software Metrics Symposium (METRICS '05). IEEE Computer Society, Washington, DC, USA, 23--. Google ScholarDigital Library
- Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). ACM, New York, NY, USA, 510--520. Google ScholarDigital Library
- René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are Mutants a Valid Substitute for Real Faults in Software Testing?. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). ACM, New York, NY, USA, 654--665. Google ScholarDigital Library
- Lucy Ellen Lwakatare, Aiswarya Raj, Jan Bosch, Helena Holmström Olsson, and Ivica Crnkovic. 2019. A taxonomy of software engineering challenges for machine learning systems: An empirical investigation. In International Conference on Agile Software Development. Springer, 227--243.Google ScholarCross Ref
- Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepMutation: Mutation Testing of Deep Learning Systems. In 29th IEEE International Symposium on Software Reliability Engineering, ISSRE 2018, Memphis, TN, USA, October 15-18, 2018. 100--111. Google ScholarCross Ref
- Sarah Meldrum, Sherlock A. Licorish, and Bastin Tony Roy Savarimuthu. 2017. Crowdsourced Knowledge on Stack Overflow: A Systematic Mapping Study. In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering (EASE'17). ACM, New York, NY, USA, 180--185. Google ScholarDigital Library
- Jennifer Rowley and Richard Hartley. 2017. Organizing knowledge: an introduction to managing access to information. Routledge.Google Scholar
- Carolyn B. Seaman. 1999. Qualitative Methods in Empirical Studies of Software Engineering. IEEE Trans. Softw. Eng. 25, 4 (July 1999), 557--572. Google ScholarDigital Library
- Carolyn B. Seaman, Forrest Shull, Myrna Regardie, Denis Elbert, Raimund L. Feldmann, Yuepu Guo, and Sally Godfrey. 2008. Defect Categorization: Making Use of a Decade of Widely Varying Historical Data. In Proceedings of the Second ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM '08). ACM, New York, NY, USA, 149--157. Google ScholarDigital Library
- W. Shen, J. Wan, and Z. Chen. 2018. MuNN: Mutation Analysis of Neural Networks. In 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). 108--115. Google ScholarCross Ref
- X. Sun, T. Zhou, G. Li, J. Hu, H. Yang, and B. Li. 2017. An Empirical Study on Real Bugs for Machine Learning Programs. In 2017 24th Asia-Pacific Software Engineering Conference (APSEC). 348--357. Google ScholarCross Ref
- Ferdian Thung, Shaowei Wang, David Lo, and Lingxiao Jiang. 2012. An Empirical Study of Bugs in Machine Learning Systems. In Proceedings of the 2012 IEEE 23rd International Symposium on Software Reliability Engineering (ISSRE '12). IEEE Computer Society, Washington, DC, USA, 271--280. Google ScholarDigital Library
- Muhammad Usman, Ricardo Britto, Jürgen Börstler, and Emilia Mendes. 2017. Taxonomies in software engineering: A systematic mapping study and a revised taxonomy development method. Information and Software Technology 85 (2017), 43--59.Google ScholarDigital Library
- G. Vijayaraghavan and C. Kramer. [n.d.]. Bug taxonomies: Use them to generate better test. Software Testing Analysis and Review (STAR EAST) ([n. d.]).Google Scholar
- Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An Empirical Study on TensorFlow Program Bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2018). ACM, New York, NY, USA, 129--140. Google ScholarDigital Library
Index Terms
- Taxonomy of real faults in deep learning systems
Recommendations
DeepCrime: mutation testing of deep learning systems based on real faults
ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and AnalysisDeep Learning (DL) solutions are increasingly adopted, but how to test them remains a major open research problem. Existing and new testing techniques have been proposed for and adapted to DL systems, including mutation testing. However, no approach has ...
Faults in deep reinforcement learning programs: a taxonomy and a detection approach
AbstractA growing demand is witnessed in both industry and academia for employing Deep Learning (DL) in various domains to solve real-world problems. Deep reinforcement learning (DRL) is the application of DL in the domain of Reinforcement Learning. Like ...
DDV: A Taxonomy for Deep Learning Methods in Detecting Prostate Cancer
AbstractDeep learning is increasingly studied in the prediction of cancer yet few deep learning systems have been introduced for daily use for such purpose. The manual scanning, reading, and analysis by radiologists to detect cancer are very time-...
Comments