ABSTRACT
Deep learning applications become increasingly popular in important domains such as self-driving systems and facial identity systems. Defective deep learning applications may lead to catastrophic consequences. Although recent research efforts were made on testing and debugging deep learning applications, the characteristics of deep learning defects have never been studied. To fill this gap, we studied deep learning applications built on top of TensorFlow and collected program bugs related to TensorFlow from StackOverflow QA pages and Github projects. We extracted information from QA pages, commit messages, pull request messages, and issue discussions to examine the root causes and symptoms of these bugs. We also studied the strategies deployed by TensorFlow users for bug detection and localization. These findings help researchers and TensorFlow users to gain a better understanding of coding defects in TensorFlow programs and point out a new direction for future research.
- Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The Oracle Problem in Software Testing: A Survey. IEEE Trans. Software Eng. 41, 5 (2015), 507–525.Google ScholarDigital Library
- Gabriele Bavota, Mario Linares Vásquez, Carlos Eduardo Bernal-Cárdenas, Massimiliano Di Penta, Rocco Oliveto, and Denys Poshyvanyk. 2015. The Impact of API Change- and Fault-Proneness on the User Ratings of Android Apps. IEEE Trans. Software Eng. 41, 4 (2015), 384–407.Google ScholarDigital Library
- Junjie Chen, Yanwei Bai, Dan Hao, Yingfei Xiong, Hongyu Zhang, and Bing Xie. 2017. Learning to prioritize test programs for compiler testing. In Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017. 700–711. Google ScholarDigital Library
- Junjie Chen, Wenxiang Hu, Dan Hao, Yingfei Xiong, Hongyu Zhang, Lu Zhang, and Bing Xie. 2016. An empirical comparison of compiler testing techniques. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016. 180–190. Google ScholarDigital Library
- 2884878Google Scholar
- Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. CoRR abs/1512.01274 (2015). arXiv: 1512.01274 http://arxiv.org/abs/1512.01274Google Scholar
- Tse-Hsun Chen, Meiyappan Nagappan, Emad Shihab, and Ahmed E. Hassan. 2014. An empirical study of dormant bugs. In 11th Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, Hyderabad, India. 82–91. Google ScholarDigital Library
- Muriel Daran and Pascale Thévenod-Fosse. 1996. Software Error Analysis: A Real Case Study Involving Real Faults and Mutations. In Proceedings of the 1996 International Symposium on Software Testing and Analysis, ISSTA 1996, San Diego, CA, USA, January 8-10, 1996. 158–171. Google ScholarDigital Library
- Danny Dig and Ralph E. Johnson. 2006. How do APIs evolve? A story of refactoring. Journal of Software Maintenance 18, 2 (2006), 83–107. Google ScholarDigital Library
- Martín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
- István Forgács and Antonia Bertolino. 2002. Preventing untestedness in data-flow based testing. Softw. Test., Verif. Reliab. 12, 1 (2002), 29–58. 1002/stvr.234Google ScholarCross Ref
- Marcus Gerhold and Mariëlle Stoelinga. 2018. Model-based testing of probabilistic systems. Formal Asp. Comput. 30, 1 (2018), 77–106. s00165-017-0440-4 Google ScholarCross Ref
- Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org/ Google ScholarDigital Library
- Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patanaanake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, November 03 - 05, 2014. Google ScholarDigital Library
- 7:1–7:14.Google Scholar
- David J. Hand. 2007. Principles of Data Mining. Drug Safety 30, 7 (01 Jul 2007), 621–622.Google Scholar
- Mauricio A. Hernández and Salvatore J. Stolfo. 1998. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov. 2, 1 (1998), 9–37. Google ScholarDigital Library
- Robert M. Hierons. 2006. Avoiding coincidental correctness in boundary value analysis. ACM Trans. Softw. Eng. Methodol. 15, 3 (2006), 227–241. g/10.1145/1151695.1151696 Google ScholarDigital Library
- Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd D. Millstein, and Tyson Condie. 2015. Titian: Data Provenance Support in Spark. PVLDB 9, 3 (2015), 216–227. http: //www.vldb.org/pvldb/vol9/p216-interlandi.pdf Google ScholarDigital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).Google Scholar
- Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and detecting real-world performance bugs. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012. 77–88. Google ScholarDigital Library
- Raula Gaikovina Kula, Ali Ouni, Daniel M. Germán, and Katsuro Inoue. 2018. An empirical study on the impact of refactoring activities on evolving clientused APIs. Information & Software Technology 93 (2018), 186–199. Google ScholarDigital Library
- Peter Alan Lee and Thomas Anderson. 1990. Fault Tolerance. Springer Vienna, Vienna, 51–77.Google Scholar
- Jun Li, Yingfei Xiong, Xuanzhe Liu, and Lu Zhang. 2013. How Does Web Service API Evolution Affect Clients?. In 2013 IEEE 20th International Conference on Web Services, Santa Clara, CA, USA, June 28 - July 3, 2013. 300–307. Google ScholarDigital Library
- Shiqing Ma, Yousra Aafer, Zhaogui Xu, Wen-Chuan Lee, Juan Zhai, Yingqi Liu, and Xiangyu Zhang. 2017. LAMP: data provenance for graph based machine learning algorithms through derivative computation. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017. 786–797. Google ScholarDigital Library
- Jonathan I. Maletic and Andrian Marcus. 2000. Data Cleansing: Beyond Integrity Analysis. In Fifth Conference on Information Quality (IQ 2000). 200–209.Google Scholar
- Brian Marick. 1991. The Weak Mutation Hypothesis. In Proceedings of the Symposium on Testing, Analysis, and Verification, TAV 1991, Victoria, British Columbia, Canada, October 8-10, 1991. 190–199. Google ScholarDigital Library
- Seyed Mehdi Nasehi, Jonathan Sillito, Frank Maurer, and Chris Burns. 2012. What makes a good code example?: A study of programming Q&A in StackOverflow. In 28th IEEE International Conference on Software Maintenance, ICSM 2012, Trento, Italy, September 23-28, 2012. 25–34. Google ScholarDigital Library
- Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).Google Scholar
- Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017. Google ScholarDigital Library
- 1–18.Google Scholar
- Erhard Rahm and Hong Hai Do. 2000. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3–13. http://sites.computer.org/d ebull/A00DEC-CD.pdfGoogle Scholar
- Manos Renieris and Steven P. Reiss. 2003. Fault Localization With Nearest Neighbor Queries. In 18th IEEE International Conference on Automated Software Engineering (ASE 2003), 6-10 October 2003, Montreal, Canada. 30–39. Google ScholarDigital Library
- Debra J. Richardson and Margaret C. Thompson. 1993. An Analysis of Test Data Selection Criteria Using the RELAY Model of Fault Detection. IEEE Trans. Software Eng. 19, 6 (1993), 533–553. Google ScholarDigital Library
- Carolyn B. Seaman, Forrest Shull, Myrna Regardie, Denis Elbert, Raimund L. Feldmann, Yuepu Guo, and Sally Godfrey. 2008. Defect categorization: making use of a decade of widely varying historical data. In Proceedings of the Second International Symposium on Empirical Software Engineering and Measurement, ESEM 2008, October 9-10, 2008, Kaiserslautern, Germany. 149–157. g/10.1145/1414004.1414030 Google ScholarDigital Library
- Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’08). ACM, New York, NY, USA, 614–622. Google ScholarDigital Library
- Siwakorn Srisakaokul, Zhengkai Wu, Angello Astorga, Oreoluwa Alebiosu, and Tao Xie. 2018. Multiple-Implementation Testing of Supervised Learning Software.. In Proceedings of the AAAI-18 Workshop on Engineering Dependable and Secure Machine Learning Systems (EDSMLS 2018), co-located with AAAI 2018, New Orleans, LA, Feburary 2018.Google Scholar
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. CoRR abs/1312.6199 (2013). arXiv: 1312.6199 http://arxiv.org/abs/1312.6199Google Scholar
- Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016). http://arxiv.org/abs/1605.02688Google Scholar
- Ferdian Thung, Shaowei Wang, David Lo, and Lingxiao Jiang. 2012. An Empirical Study of Bugs in Machine Learning Systems. In 23rd IEEE International Symposium on Software Reliability Engineering, ISSRE 2012, Dallas, TX, USA, November 27-30, 2012. 271–280. Google ScholarDigital Library
- Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2017. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. CoRR abs/1708.08559 (2017). arXiv: 1708.08559 http://arxiv.org/abs/1708.08559Google Scholar
- Tian Xiao, Jiaxing Zhang, Hucheng Zhou, Zhenyu Guo, Sean McDirmid, Wei Lin, Wenguang Chen, and Lidong Zhou. 2014. Nondeterminism in MapReduce considered harmful? an empirical study on non-commutative aggregators in MapReduce programs. In 36th International Conference on Software Engineering, ICSE ’14, Companion Proceedings, Hyderabad, India, May 31 - June 07, 2014. 44–53. Google ScholarDigital Library
- Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail E. Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software 84, 4 (2011), 544–558. Google ScholarDigital Library
- Baowen Xu, Ju Qian, Xiaofang Zhang, Zhongqiang Wu, and Lin Chen. 2005. A brief survey of program slicing. ACM SIGSOFT Software Engineering Notes 30, 2 (2005), 1–36. Google ScholarDigital Library
- Shahed Zaman, Bram Adams, and Ahmed E. Hassan. 2011. Security versus performance bugs: a case study on Firefox. In Proceedings of the 8th International Working Conference on Mining Software Repositories, MSR 2011 (Co-located with ICSE), Waikiki, Honolulu, HI, USA, May 21-28, 2011, Proceedings. 93–102. https: ISSTA’18, July 16–21, 2018, Amsterdam, Netherlands Y. Zhang, Y. Chen, S. Cheung, Y. Xiong, L. Zhang Google ScholarDigital Library
Index Terms
- An empirical study on TensorFlow program bugs
Recommendations
An Empirical Study on Bugs Inside TensorFlow
Database Systems for Advanced ApplicationsAbstractIn recent years, deep learning has become a hot research topic. Although it achieves incredible positive results in some scenarios, bugs inside deep learning software can introduce disastrous consequences, especially when the software is used in ...
Silent bugs in deep learning frameworks: an empirical study of Keras and TensorFlow
AbstractDeep Learning (DL) frameworks are now widely used, simplifying the creation of complex models as well as their integration into various applications even among non-DL experts. However, like any other programs, they are prone to bugs. This paper ...
An empirical study of dormant bugs
MSR 2014: Proceedings of the 11th Working Conference on Mining Software RepositoriesOver the past decade, several research efforts have studied the quality of software systems by looking at post-release bugs. However, these studies do not account for bugs that remain dormant (i.e., introduced in a version of the software system, but ...
Comments