ABSTRACT
Deep Learning (DL) solutions are increasingly adopted, but how to test them remains a major open research problem. Existing and new testing techniques have been proposed for and adapted to DL systems, including mutation testing. However, no approach has investigated the possibility to simulate the effects of real DL faults by means of mutation operators. We have defined 35 DL mutation operators relying on 3 empirical studies about real faults in DL systems. We followed a systematic process to extract the mutation operators from the existing fault taxonomies, with a formal phase of conflict resolution in case of disagreement. We have implemented 24 of these DL mutation operators into DeepCrime, the first source-level pre-training mutation tool based on real DL faults. We have assessed our mutation operators to understand their characteristics: whether they produce interesting, i.e., killable but not trivial, mutations. Then, we have compared the sensitivity of our tool to the changes in the quality of test data with that of DeepMutation++, an existing post-training DL mutation tool.
- 2013. DiffMerge: an application to visually compare and merge files on Windows, OS X and Linux. https://sourcegear.com/diffmerge/Google Scholar
- 2019. FrameworkData. https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297aGoogle Scholar
- 2020. DeepCrime Replication Package. https://zenodo.org/record/4772465Google Scholar
- 2020. An implementation of a multimodal CNN for appearance-based gaze estimation. https://github.com/dlsuroviki/UnityEyesModelGoogle Scholar
- 2020. Keras Code Examples. Available at https://keras.io/examples/Google Scholar
- 2020. Keras MNIST CNN Model. Available at https://keras.io/examples/vision/mnist_convnet/Google Scholar
- 2020. Keras Movie Recommender Model. Available at https://keras.io/examples/structured_data/collaborative_filtering_movielens/Google Scholar
- 2020. Movie Recommender Dataset. Available at http://files.grouplens.org/datasets/movielens/ml-latest-small.zipGoogle Scholar
- 2020. Speaker Recognition Dataset. Available at https://www.kaggle.com/kongaevans/speaker-recognition-datasetGoogle Scholar
- 2020. Speaker Recognition Model. Available at https://keras.io/examples/audio/speaker_recognition_using_cnn/Google Scholar
- Boris Beizer. 1984. Software System Testing and Quality Assurance. Van Nostrand Reinhold Co., New York, NY, USA. isbn:0-442-21306-9Google ScholarDigital Library
- Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to End Learning for Self-Driving Cars.. CoRR, abs/1604.07316 (2016), arxiv:1604.07316Google Scholar
- Taejoon Byun, Vaibhav Sharma, Abhishek Vijayakumar, Sanjai Rayadurgam, and Darren Cofer. 2019. Input prioritization for testing neural networks. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). 63–70. https://doi.org/10.1109/AITest.2019.000-6 Google ScholarCross Ref
- 2020. DeepCrime. https://github.com/dlfaults/deepcrimeGoogle Scholar
- Yarin Gal and Zoubin Ghahramani. 2016. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems. 1019–1027.Google Scholar
- Marcio Augusto Guimarães, Leo Fernandes, Márcio Ribeiro, Marcelo d’Amorim, and Rohit Gheyi. 2020. Optimizing Mutation Testing by Discovering Dynamic Mutant Subsumption Relations. In 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST). 198–208. https://doi.org/10.1109/ICST46399.2020.00029 Google ScholarCross Ref
- Jahangirova Gunel, Stocco Andrea, and Tonella Paolo. 2021. Quality Metrics and Oracles for Autonomous Vehicles Testing. In 2021 IEEE 14th International Conference on Software Testing, Validation and Verification (ICST). https://doi.org/10.1109/ICST49551.2021.00030 Google ScholarCross Ref
- Qiang Hu, Lei Ma, Xiaofei Xie, Bing Yu, Yang Liu, and Jianjun Zhao. 2019. DeepMutation++: A Mutation Testing Framework for Deep Learning Systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1158–1161. https://doi.org/10.1109/ASE.2019.00126 Google ScholarDigital Library
- Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. [n.d.]. Dataset of Real Faults in Deep Learning Systems. https://zenodo.org/record/3667541#.Xzmily2B3zsGoogle Scholar
- Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of Real Faults in Deep Learning Systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA. 1110–1121. isbn:9781450371216 https://doi.org/10.1145/3377811.3380395 Google ScholarDigital Library
- Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). ACM, New York, NY, USA. 510–520. isbn:978-1-4503-5572-8 https://doi.org/10.1145/3338906.3338955 Google ScholarDigital Library
- Gunel Jahangirova and Paolo Tonella. 2020. An Empirical Evaluation of Mutation Operators for Deep Learning Systems. In IEEE International Conference on Software Testing, Verification and Validation (ICST’20). IEEE, 12 pages. https://doi.org/10.1109/ICST46399.2020.00018 Google ScholarCross Ref
- Ken Kelley and Kristopher J Preacher. 2012. On effect size.. Psychological methods, 17, 2 (2012), 137. https://doi.org/10.1037/a0028086 Google ScholarCross Ref
- Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41st International Conference on Software Engineering, ICSE. 1039–1049. https://doi.org/10.1109/ICSE.2019.00108 Google ScholarDigital Library
- Bob Kurtz, Paul Ammann, Marcio E Delamaro, Jeff Offutt, and Lin Deng. 2014. Mutant subsumption graphs. In IEEE Seventh International Conference on Software Testing, Verification and Validation Workshops. 176–185. https://doi.org/10.1109/ICSTW.2014.20 Google ScholarDigital Library
- Bob Kurtz, Paul Ammann, Jeff Offutt, Márcio E. Delamaro, Mariet Kurtz, and Nida Gökçe. 2016. Analyzing the validity of selective mutation with dominator mutants. In ACM Sigsoft International Symposium on Foundations of Software Engineering. https://doi.org/10.1145/2950290.2950322 Google ScholarDigital Library
- Yann LeCun. 1998. The MNIST Database of Handwritten Digits. Available at http://yann. lecun. com/exdb/mnist/Google Scholar
- Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multi-granularity Testing Criteria for Deep Learning Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). ACM, New York, NY, USA. 120–131. isbn:978-1-4503-5937-5 https://doi.org/10.1145/3238147.3238202 Google ScholarDigital Library
- Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepMutation: Mutation Testing of Deep Learning Systems. In 29th IEEE International Symposium on Software Reliability Engineering, ISSRE 2018, Memphis, TN, USA, October 15-18, 2018. 100–111. https://doi.org/10.1109/ISSRE.2018.00021 Google ScholarCross Ref
- John Ashworth Nelder and Robert WM Wedderburn. 1972. Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135, 3 (1972), 370–384. https://doi.org/10.2307/2344614 Google ScholarCross Ref
- Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles. 1–18. https://doi.org/10.1145/3132747.3132785 Google ScholarDigital Library
- Maryam Vahdat Pour, Zhuo Li, Lei Ma, and Hadi Hemmati. 2021. A Search-Based Testing Framework for Deep Neural Networks of Source Code Embedding. In IEEE International Conference on Software Testing, Verification and Validation (ICST’21). IEEE, 11 pages. arxiv:2101.07910Google ScholarCross Ref
- W. Shen, J. Wan, and Z. Chen. 2018. MuNN: Mutation Analysis of Neural Networks. In 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). 108–115. https://doi.org/10.1109/QRS-C.2018.00032 Google ScholarCross Ref
- Jeongju Sohn, Sungmin Kang, and Shin Yoo. 2019. Search Based Repair of Deep Neural Networks. arXiv preprint arXiv:1912.12463, arxiv:1912.12463Google Scholar
- Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. 2019. Adversarial sample detection for deep neural network through model mutation testing. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 1245–1256. https://doi.org/10.1109/ICSE.2019.00126 Google ScholarDigital Library
- Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing Test Inputs for Deep Neural Networks via Mutation Analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 397–409. https://doi.org/10.1109/ICSE43902.2021.00046 Google ScholarDigital Library
- Edwin B Wilson. 1927. Probable inference, the law of succession, and statistical inference. J. Amer. Statist. Assoc., 22, 158 (1927), 209–212. https://doi.org/10.1080/01621459.1927.10502953 Google ScholarCross Ref
- Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. 2016. Learning an Appearance-Based Gaze Estimator from One Million Synthesised Images. ETRA ’16. Association for Computing Machinery, New York, NY, USA. 131–138. isbn:9781450341257 https://doi.org/10.1145/2857491.2857492 Google ScholarDigital Library
- Lotfi A Zadeh. 1965. Fuzzy sets. Information and control, 8, 3 (1965), 338–353. https://doi.org/10.1016/S0019-9958(65)90241-X Google ScholarCross Ref
- Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An Empirical Study on TensorFlow Program Bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2018). ACM, New York, NY, USA. 129–140. isbn:978-1-4503-5699-2 https://doi.org/10.1145/3213846.3213866 Google ScholarDigital Library
Index Terms
- DeepCrime: mutation testing of deep learning systems based on real faults
Recommendations
DeepCrime: From Real Faults to Mutation Testing Tool for Deep Learning
ICSE '23: Proceedings of the 45th International Conference on Software Engineering: Companion ProceedingsThe recent advance of Deep Learning (DL) due to its human-competitive performance in complex and often safety-critical tasks, reveals many gaps in their testing. There exist a number of DL-specific testing approaches, and yet none has presented the ...
How effective are mutation testing tools? An empirical analysis of Java mutation testing tools with manual analysis and real faults
Mutation analysis is a well-studied, fault-based testing technique. It requires testers to design tests based on a set of artificial defects. The defects help in performing testing activities by measuring the ratio that is revealed by the candidate ...
A Fine-Grained Evaluation of Mutation Operators for Deep Learning Systems: A Selective Mutation Approach
Internetware '23: Proceedings of the 14th Asia-Pacific Symposium on InternetwareThe widespread adoption of deep learning (DL) has made it critical to ensure its reliability. Mutation testing has been employed in DL testing to assess test data quality, but it can be costly of a large number of generated mutants. Cost reduction can ...
Comments