skip to main content
10.1145/3213846.3213866acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

An empirical study on TensorFlow program bugs

Published:12 July 2018Publication History

ABSTRACT

Deep learning applications become increasingly popular in important domains such as self-driving systems and facial identity systems. Defective deep learning applications may lead to catastrophic consequences. Although recent research efforts were made on testing and debugging deep learning applications, the characteristics of deep learning defects have never been studied. To fill this gap, we studied deep learning applications built on top of TensorFlow and collected program bugs related to TensorFlow from StackOverflow QA pages and Github projects. We extracted information from QA pages, commit messages, pull request messages, and issue discussions to examine the root causes and symptoms of these bugs. We also studied the strategies deployed by TensorFlow users for bug detection and localization. These findings help researchers and TensorFlow users to gain a better understanding of coding defects in TensorFlow programs and point out a new direction for future research.

References

  1. Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The Oracle Problem in Software Testing: A Survey. IEEE Trans. Software Eng. 41, 5 (2015), 507–525.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Gabriele Bavota, Mario Linares Vásquez, Carlos Eduardo Bernal-Cárdenas, Massimiliano Di Penta, Rocco Oliveto, and Denys Poshyvanyk. 2015. The Impact of API Change- and Fault-Proneness on the User Ratings of Android Apps. IEEE Trans. Software Eng. 41, 4 (2015), 384–407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Junjie Chen, Yanwei Bai, Dan Hao, Yingfei Xiong, Hongyu Zhang, and Bing Xie. 2017. Learning to prioritize test programs for compiler testing. In Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017. 700–711. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Junjie Chen, Wenxiang Hu, Dan Hao, Yingfei Xiong, Hongyu Zhang, Lu Zhang, and Bing Xie. 2016. An empirical comparison of compiler testing techniques. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016. 180–190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 2884878Google ScholarGoogle Scholar
  6. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. CoRR abs/1512.01274 (2015). arXiv: 1512.01274 http://arxiv.org/abs/1512.01274Google ScholarGoogle Scholar
  7. Tse-Hsun Chen, Meiyappan Nagappan, Emad Shihab, and Ahmed E. Hassan. 2014. An empirical study of dormant bugs. In 11th Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, Hyderabad, India. 82–91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Muriel Daran and Pascale Thévenod-Fosse. 1996. Software Error Analysis: A Real Case Study Involving Real Faults and Mutations. In Proceedings of the 1996 International Symposium on Software Testing and Analysis, ISSTA 1996, San Diego, CA, USA, January 8-10, 1996. 158–171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Danny Dig and Ralph E. Johnson. 2006. How do APIs evolve? A story of refactoring. Journal of Software Maintenance 18, 2 (2006), 83–107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Martín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google ScholarGoogle Scholar
  11. István Forgács and Antonia Bertolino. 2002. Preventing untestedness in data-flow based testing. Softw. Test., Verif. Reliab. 12, 1 (2002), 29–58. 1002/stvr.234Google ScholarGoogle ScholarCross RefCross Ref
  12. Marcus Gerhold and Mariëlle Stoelinga. 2018. Model-based testing of probabilistic systems. Formal Asp. Comput. 30, 1 (2018), 77–106. s00165-017-0440-4 Google ScholarGoogle ScholarCross RefCross Ref
  13. Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org/ Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patanaanake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, November 03 - 05, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 7:1–7:14.Google ScholarGoogle Scholar
  16. David J. Hand. 2007. Principles of Data Mining. Drug Safety 30, 7 (01 Jul 2007), 621–622.Google ScholarGoogle Scholar
  17. Mauricio A. Hernández and Salvatore J. Stolfo. 1998. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov. 2, 1 (1998), 9–37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Robert M. Hierons. 2006. Avoiding coincidental correctness in boundary value analysis. ACM Trans. Softw. Eng. Methodol. 15, 3 (2006), 227–241. g/10.1145/1151695.1151696 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd D. Millstein, and Tyson Condie. 2015. Titian: Data Provenance Support in Spark. PVLDB 9, 3 (2015), 216–227. http: //www.vldb.org/pvldb/vol9/p216-interlandi.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).Google ScholarGoogle Scholar
  21. Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and detecting real-world performance bugs. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012. 77–88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Raula Gaikovina Kula, Ali Ouni, Daniel M. Germán, and Katsuro Inoue. 2018. An empirical study on the impact of refactoring activities on evolving clientused APIs. Information & Software Technology 93 (2018), 186–199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Peter Alan Lee and Thomas Anderson. 1990. Fault Tolerance. Springer Vienna, Vienna, 51–77.Google ScholarGoogle Scholar
  24. Jun Li, Yingfei Xiong, Xuanzhe Liu, and Lu Zhang. 2013. How Does Web Service API Evolution Affect Clients?. In 2013 IEEE 20th International Conference on Web Services, Santa Clara, CA, USA, June 28 - July 3, 2013. 300–307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Shiqing Ma, Yousra Aafer, Zhaogui Xu, Wen-Chuan Lee, Juan Zhai, Yingqi Liu, and Xiangyu Zhang. 2017. LAMP: data provenance for graph based machine learning algorithms through derivative computation. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017. 786–797. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jonathan I. Maletic and Andrian Marcus. 2000. Data Cleansing: Beyond Integrity Analysis. In Fifth Conference on Information Quality (IQ 2000). 200–209.Google ScholarGoogle Scholar
  27. Brian Marick. 1991. The Weak Mutation Hypothesis. In Proceedings of the Symposium on Testing, Analysis, and Verification, TAV 1991, Victoria, British Columbia, Canada, October 8-10, 1991. 190–199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Seyed Mehdi Nasehi, Jonathan Sillito, Frank Maurer, and Chris Burns. 2012. What makes a good code example?: A study of programming Q&A in StackOverflow. In 28th IEEE International Conference on Software Maintenance, ICSM 2012, Trento, Italy, September 23-28, 2012. 25–34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).Google ScholarGoogle Scholar
  30. Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. 1–18.Google ScholarGoogle Scholar
  32. Erhard Rahm and Hong Hai Do. 2000. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3–13. http://sites.computer.org/d ebull/A00DEC-CD.pdfGoogle ScholarGoogle Scholar
  33. Manos Renieris and Steven P. Reiss. 2003. Fault Localization With Nearest Neighbor Queries. In 18th IEEE International Conference on Automated Software Engineering (ASE 2003), 6-10 October 2003, Montreal, Canada. 30–39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Debra J. Richardson and Margaret C. Thompson. 1993. An Analysis of Test Data Selection Criteria Using the RELAY Model of Fault Detection. IEEE Trans. Software Eng. 19, 6 (1993), 533–553. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Carolyn B. Seaman, Forrest Shull, Myrna Regardie, Denis Elbert, Raimund L. Feldmann, Yuepu Guo, and Sally Godfrey. 2008. Defect categorization: making use of a decade of widely varying historical data. In Proceedings of the Second International Symposium on Empirical Software Engineering and Measurement, ESEM 2008, October 9-10, 2008, Kaiserslautern, Germany. 149–157. g/10.1145/1414004.1414030 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’08). ACM, New York, NY, USA, 614–622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Siwakorn Srisakaokul, Zhengkai Wu, Angello Astorga, Oreoluwa Alebiosu, and Tao Xie. 2018. Multiple-Implementation Testing of Supervised Learning Software.. In Proceedings of the AAAI-18 Workshop on Engineering Dependable and Secure Machine Learning Systems (EDSMLS 2018), co-located with AAAI 2018, New Orleans, LA, Feburary 2018.Google ScholarGoogle Scholar
  38. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. CoRR abs/1312.6199 (2013). arXiv: 1312.6199 http://arxiv.org/abs/1312.6199Google ScholarGoogle Scholar
  39. Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016). http://arxiv.org/abs/1605.02688Google ScholarGoogle Scholar
  40. Ferdian Thung, Shaowei Wang, David Lo, and Lingxiao Jiang. 2012. An Empirical Study of Bugs in Machine Learning Systems. In 23rd IEEE International Symposium on Software Reliability Engineering, ISSRE 2012, Dallas, TX, USA, November 27-30, 2012. 271–280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2017. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. CoRR abs/1708.08559 (2017). arXiv: 1708.08559 http://arxiv.org/abs/1708.08559Google ScholarGoogle Scholar
  42. Tian Xiao, Jiaxing Zhang, Hucheng Zhou, Zhenyu Guo, Sean McDirmid, Wei Lin, Wenguang Chen, and Lidong Zhou. 2014. Nondeterminism in MapReduce considered harmful? an empirical study on non-commutative aggregators in MapReduce programs. In 36th International Conference on Software Engineering, ICSE ’14, Companion Proceedings, Hyderabad, India, May 31 - June 07, 2014. 44–53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail E. Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software 84, 4 (2011), 544–558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Baowen Xu, Ju Qian, Xiaofang Zhang, Zhongqiang Wu, and Lin Chen. 2005. A brief survey of program slicing. ACM SIGSOFT Software Engineering Notes 30, 2 (2005), 1–36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Shahed Zaman, Bram Adams, and Ahmed E. Hassan. 2011. Security versus performance bugs: a case study on Firefox. In Proceedings of the 8th International Working Conference on Mining Software Repositories, MSR 2011 (Co-located with ICSE), Waikiki, Honolulu, HI, USA, May 21-28, 2011, Proceedings. 93–102. https: ISSTA’18, July 16–21, 2018, Amsterdam, Netherlands Y. Zhang, Y. Chen, S. Cheung, Y. Xiong, L. Zhang Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An empirical study on TensorFlow program bugs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ISSTA 2018: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis
      July 2018
      379 pages
      ISBN:9781450356992
      DOI:10.1145/3213846
      • General Chair:
      • Frank Tip,
      • Program Chair:
      • Eric Bodden

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 July 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate58of213submissions,27%

      Upcoming Conference

      ISSTA '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader