research-article

DeepCrime: mutation testing of deep learning systems based on real faults

Authors:
Nargiz Humbatova

USI Lugano, Switzerland

USI Lugano, Switzerland

0000-0002-3037-8368
View Profile

,
Gunel Jahangirova

USI Lugano, Switzerland

USI Lugano, Switzerland

0000-0002-1423-1083
View Profile

,
Paolo Tonella

USI Lugano, Switzerland

USI Lugano, Switzerland

0000-0003-3088-0339
View Profile

ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and AnalysisJuly 2021Pages 67–78https://doi.org/10.1145/3460319.3464825

Published:11 July 2021Publication History

ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 67–78

ABSTRACT

Deep Learning (DL) solutions are increasingly adopted, but how to test them remains a major open research problem. Existing and new testing techniques have been proposed for and adapted to DL systems, including mutation testing. However, no approach has investigated the possibility to simulate the effects of real DL faults by means of mutation operators. We have defined 35 DL mutation operators relying on 3 empirical studies about real faults in DL systems. We followed a systematic process to extract the mutation operators from the existing fault taxonomies, with a formal phase of conflict resolution in case of disagreement. We have implemented 24 of these DL mutation operators into DeepCrime, the first source-level pre-training mutation tool based on real DL faults. We have assessed our mutation operators to understand their characteristics: whether they produce interesting, i.e., killable but not trivial, mutations. Then, we have compared the sensitivity of our tool to the changes in the quality of test data with that of DeepMutation++, an existing post-training DL mutation tool.

References

2013. DiffMerge: an application to visually compare and merge files on Windows, OS X and Linux. https://sourcegear.com/diffmerge/Google Scholar
2019. FrameworkData. https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297aGoogle Scholar
2020. DeepCrime Replication Package. https://zenodo.org/record/4772465Google Scholar
2020. An implementation of a multimodal CNN for appearance-based gaze estimation. https://github.com/dlsuroviki/UnityEyesModelGoogle Scholar
2020. Keras Code Examples. Available at https://keras.io/examples/Google Scholar
2020. Keras MNIST CNN Model. Available at https://keras.io/examples/vision/mnist_convnet/Google Scholar
2020. Keras Movie Recommender Model. Available at https://keras.io/examples/structured_data/collaborative_filtering_movielens/Google Scholar
2020. Movie Recommender Dataset. Available at http://files.grouplens.org/datasets/movielens/ml-latest-small.zipGoogle Scholar
2020. Speaker Recognition Dataset. Available at https://www.kaggle.com/kongaevans/speaker-recognition-datasetGoogle Scholar
2020. Speaker Recognition Model. Available at https://keras.io/examples/audio/speaker_recognition_using_cnn/Google Scholar
Boris Beizer. 1984. Software System Testing and Quality Assurance. Van Nostrand Reinhold Co., New York, NY, USA. isbn:0-442-21306-9Google ScholarDigital Library
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to End Learning for Self-Driving Cars.. CoRR, abs/1604.07316 (2016), arxiv:1604.07316Google Scholar
Taejoon Byun, Vaibhav Sharma, Abhishek Vijayakumar, Sanjai Rayadurgam, and Darren Cofer. 2019. Input prioritization for testing neural networks. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). 63–70. https://doi.org/10.1109/AITest.2019.000-6 Google ScholarCross Ref
2020. DeepCrime. https://github.com/dlfaults/deepcrimeGoogle Scholar
Yarin Gal and Zoubin Ghahramani. 2016. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems. 1019–1027.Google Scholar
Marcio Augusto Guimarães, Leo Fernandes, Márcio Ribeiro, Marcelo d’Amorim, and Rohit Gheyi. 2020. Optimizing Mutation Testing by Discovering Dynamic Mutant Subsumption Relations. In 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST). 198–208. https://doi.org/10.1109/ICST46399.2020.00029 Google ScholarCross Ref
Jahangirova Gunel, Stocco Andrea, and Tonella Paolo. 2021. Quality Metrics and Oracles for Autonomous Vehicles Testing. In 2021 IEEE 14th International Conference on Software Testing, Validation and Verification (ICST). https://doi.org/10.1109/ICST49551.2021.00030 Google ScholarCross Ref
Qiang Hu, Lei Ma, Xiaofei Xie, Bing Yu, Yang Liu, and Jianjun Zhao. 2019. DeepMutation++: A Mutation Testing Framework for Deep Learning Systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1158–1161. https://doi.org/10.1109/ASE.2019.00126 Google ScholarDigital Library
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. [n.d.]. Dataset of Real Faults in Deep Learning Systems. https://zenodo.org/record/3667541#.Xzmily2B3zsGoogle Scholar
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of Real Faults in Deep Learning Systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA. 1110–1121. isbn:9781450371216 https://doi.org/10.1145/3377811.3380395 Google ScholarDigital Library
Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). ACM, New York, NY, USA. 510–520. isbn:978-1-4503-5572-8 https://doi.org/10.1145/3338906.3338955 Google ScholarDigital Library
Gunel Jahangirova and Paolo Tonella. 2020. An Empirical Evaluation of Mutation Operators for Deep Learning Systems. In IEEE International Conference on Software Testing, Verification and Validation (ICST’20). IEEE, 12 pages. https://doi.org/10.1109/ICST46399.2020.00018 Google ScholarCross Ref
Ken Kelley and Kristopher J Preacher. 2012. On effect size.. Psychological methods, 17, 2 (2012), 137. https://doi.org/10.1037/a0028086 Google ScholarCross Ref
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41st International Conference on Software Engineering, ICSE. 1039–1049. https://doi.org/10.1109/ICSE.2019.00108 Google ScholarDigital Library
Bob Kurtz, Paul Ammann, Marcio E Delamaro, Jeff Offutt, and Lin Deng. 2014. Mutant subsumption graphs. In IEEE Seventh International Conference on Software Testing, Verification and Validation Workshops. 176–185. https://doi.org/10.1109/ICSTW.2014.20 Google ScholarDigital Library
Bob Kurtz, Paul Ammann, Jeff Offutt, Márcio E. Delamaro, Mariet Kurtz, and Nida Gökçe. 2016. Analyzing the validity of selective mutation with dominator mutants. In ACM Sigsoft International Symposium on Foundations of Software Engineering. https://doi.org/10.1145/2950290.2950322 Google ScholarDigital Library
Yann LeCun. 1998. The MNIST Database of Handwritten Digits. Available at http://yann. lecun. com/exdb/mnist/Google Scholar
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multi-granularity Testing Criteria for Deep Learning Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). ACM, New York, NY, USA. 120–131. isbn:978-1-4503-5937-5 https://doi.org/10.1145/3238147.3238202 Google ScholarDigital Library
Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepMutation: Mutation Testing of Deep Learning Systems. In 29th IEEE International Symposium on Software Reliability Engineering, ISSRE 2018, Memphis, TN, USA, October 15-18, 2018. 100–111. https://doi.org/10.1109/ISSRE.2018.00021 Google ScholarCross Ref
John Ashworth Nelder and Robert WM Wedderburn. 1972. Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135, 3 (1972), 370–384. https://doi.org/10.2307/2344614 Google ScholarCross Ref
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles. 1–18. https://doi.org/10.1145/3132747.3132785 Google ScholarDigital Library
Maryam Vahdat Pour, Zhuo Li, Lei Ma, and Hadi Hemmati. 2021. A Search-Based Testing Framework for Deep Neural Networks of Source Code Embedding. In IEEE International Conference on Software Testing, Verification and Validation (ICST’21). IEEE, 11 pages. arxiv:2101.07910Google ScholarCross Ref
W. Shen, J. Wan, and Z. Chen. 2018. MuNN: Mutation Analysis of Neural Networks. In 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). 108–115. https://doi.org/10.1109/QRS-C.2018.00032 Google ScholarCross Ref
Jeongju Sohn, Sungmin Kang, and Shin Yoo. 2019. Search Based Repair of Deep Neural Networks. arXiv preprint arXiv:1912.12463, arxiv:1912.12463Google Scholar
Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. 2019. Adversarial sample detection for deep neural network through model mutation testing. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 1245–1256. https://doi.org/10.1109/ICSE.2019.00126 Google ScholarDigital Library
Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing Test Inputs for Deep Neural Networks via Mutation Analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 397–409. https://doi.org/10.1109/ICSE43902.2021.00046 Google ScholarDigital Library
Edwin B Wilson. 1927. Probable inference, the law of succession, and statistical inference. J. Amer. Statist. Assoc., 22, 158 (1927), 209–212. https://doi.org/10.1080/01621459.1927.10502953 Google ScholarCross Ref
Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. 2016. Learning an Appearance-Based Gaze Estimator from One Million Synthesised Images. ETRA ’16. Association for Computing Machinery, New York, NY, USA. 131–138. isbn:9781450341257 https://doi.org/10.1145/2857491.2857492 Google ScholarDigital Library
Lotfi A Zadeh. 1965. Fuzzy sets. Information and control, 8, 3 (1965), 338–353. https://doi.org/10.1016/S0019-9958(65)90241-X Google ScholarCross Ref
Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An Empirical Study on TensorFlow Program Bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2018). ACM, New York, NY, USA. 129–140. isbn:978-1-4503-5699-2 https://doi.org/10.1145/3213846.3213866 Google ScholarDigital Library

Index Terms

DeepCrime: mutation testing of deep learning systems based on real faults
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation

Recommendations

DeepCrime: From Real Faults to Mutation Testing Tool for Deep Learning
ICSE '23: Proceedings of the 45th International Conference on Software Engineering: Companion Proceedings

The recent advance of Deep Learning (DL) due to its human-competitive performance in complex and often safety-critical tasks, reveals many gaps in their testing. There exist a number of DL-specific testing approaches, and yet none has presented the ...
Read More
How effective are mutation testing tools? An empirical analysis of Java mutation testing tools with manual analysis and real faults

Mutation analysis is a well-studied, fault-based testing technique. It requires testers to design tests based on a set of artificial defects. The defects help in performing testing activities by measuring the ratio that is revealed by the candidate ...
Read More
A Fine-Grained Evaluation of Mutation Operators for Deep Learning Systems: A Selective Mutation Approach
Internetware '23: Proceedings of the 14th Asia-Pacific Symposium on Internetware

The widespread adoption of deep learning (DL) has made it critical to ensure its reliability. Mutation testing has been employed in DL testing to assess test data quality, but it can be costly of a large number of generated mutants. Cost reduction can ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis
July 2021
685 pages
ISBN:9781450384599
DOI:10.1145/3460319
General Chair:
Cristian Cadar
Imperial College London, UK
,
Program Chair:
Xiangyu Zhang
Purdue University, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 July 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available / v1.1
- Artifacts Evaluated & Reusable / v1.1
Author Tags
deep learning
mutation testing
real faults
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate58of213submissions,27%
Upcoming Conference
ISSTA '24

Sponsor:

sigsoft

33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

September 16 - 20, 2024

Vienna , Austria
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 1,341
  Total Downloads
- Downloads (Last 12 months)465
- Downloads (Last 6 weeks)68
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DeepCrime: mutation testing of deep learning systems based on real faults

ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

DeepCrime: From Real Faults to Mutation Testing Tool for Deep Learning

How effective are mutation testing tools? An empirical analysis of Java mutation testing tools with manual analysis and real faults

A Fine-Grained Evaluation of Mutation Operators for Deep Learning Systems: A Selective Mutation Approach