research-article

An empirical study on TensorFlow program bugs

Authors:
Yuhao Zhang

Peking University, China

Peking University, China
View Profile

,
Yifan Chen

Peking University, China

Peking University, China
View Profile

,
Shing-Chi Cheung

Hong Kong University of Science and Technology, China

Hong Kong University of Science and Technology, China
View Profile

,
Yingfei Xiong

Peking University, China

Peking University, China
View Profile

,
Lu Zhang

Peking University, China

Peking University, China
View Profile

ISSTA 2018: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and AnalysisJuly 2018Pages 129–140https://doi.org/10.1145/3213846.3213866

Published:12 July 2018Publication History

ISSTA 2018: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 129–140

ABSTRACT

Deep learning applications become increasingly popular in important domains such as self-driving systems and facial identity systems. Defective deep learning applications may lead to catastrophic consequences. Although recent research efforts were made on testing and debugging deep learning applications, the characteristics of deep learning defects have never been studied. To fill this gap, we studied deep learning applications built on top of TensorFlow and collected program bugs related to TensorFlow from StackOverflow QA pages and Github projects. We extracted information from QA pages, commit messages, pull request messages, and issue discussions to examine the root causes and symptoms of these bugs. We also studied the strategies deployed by TensorFlow users for bug detection and localization. These findings help researchers and TensorFlow users to gain a better understanding of coding defects in TensorFlow programs and point out a new direction for future research.

References

Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The Oracle Problem in Software Testing: A Survey. IEEE Trans. Software Eng. 41, 5 (2015), 507–525.Google ScholarDigital Library
Gabriele Bavota, Mario Linares Vásquez, Carlos Eduardo Bernal-Cárdenas, Massimiliano Di Penta, Rocco Oliveto, and Denys Poshyvanyk. 2015. The Impact of API Change- and Fault-Proneness on the User Ratings of Android Apps. IEEE Trans. Software Eng. 41, 4 (2015), 384–407.Google ScholarDigital Library
Junjie Chen, Yanwei Bai, Dan Hao, Yingfei Xiong, Hongyu Zhang, and Bing Xie. 2017. Learning to prioritize test programs for compiler testing. In Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017. 700–711. Google ScholarDigital Library
Junjie Chen, Wenxiang Hu, Dan Hao, Yingfei Xiong, Hongyu Zhang, Lu Zhang, and Bing Xie. 2016. An empirical comparison of compiler testing techniques. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016. 180–190. Google ScholarDigital Library
2884878Google Scholar
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. CoRR abs/1512.01274 (2015). arXiv: 1512.01274 http://arxiv.org/abs/1512.01274Google Scholar
Tse-Hsun Chen, Meiyappan Nagappan, Emad Shihab, and Ahmed E. Hassan. 2014. An empirical study of dormant bugs. In 11th Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, Hyderabad, India. 82–91. Google ScholarDigital Library
Muriel Daran and Pascale Thévenod-Fosse. 1996. Software Error Analysis: A Real Case Study Involving Real Faults and Mutations. In Proceedings of the 1996 International Symposium on Software Testing and Analysis, ISSTA 1996, San Diego, CA, USA, January 8-10, 1996. 158–171. Google ScholarDigital Library
Danny Dig and Ralph E. Johnson. 2006. How do APIs evolve? A story of refactoring. Journal of Software Maintenance 18, 2 (2006), 83–107. Google ScholarDigital Library
Martín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
István Forgács and Antonia Bertolino. 2002. Preventing untestedness in data-flow based testing. Softw. Test., Verif. Reliab. 12, 1 (2002), 29–58. 1002/stvr.234Google ScholarCross Ref
Marcus Gerhold and Mariëlle Stoelinga. 2018. Model-based testing of probabilistic systems. Formal Asp. Comput. 30, 1 (2018), 77–106. s00165-017-0440-4 Google ScholarCross Ref
Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org/ Google ScholarDigital Library
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patanaanake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, November 03 - 05, 2014. Google ScholarDigital Library
7:1–7:14.Google Scholar
David J. Hand. 2007. Principles of Data Mining. Drug Safety 30, 7 (01 Jul 2007), 621–622.Google Scholar
Mauricio A. Hernández and Salvatore J. Stolfo. 1998. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov. 2, 1 (1998), 9–37. Google ScholarDigital Library
Robert M. Hierons. 2006. Avoiding coincidental correctness in boundary value analysis. ACM Trans. Softw. Eng. Methodol. 15, 3 (2006), 227–241. g/10.1145/1151695.1151696 Google ScholarDigital Library
Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd D. Millstein, and Tyson Condie. 2015. Titian: Data Provenance Support in Spark. PVLDB 9, 3 (2015), 216–227. http: //www.vldb.org/pvldb/vol9/p216-interlandi.pdf Google ScholarDigital Library
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).Google Scholar
Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and detecting real-world performance bugs. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012. 77–88. Google ScholarDigital Library
Raula Gaikovina Kula, Ali Ouni, Daniel M. Germán, and Katsuro Inoue. 2018. An empirical study on the impact of refactoring activities on evolving clientused APIs. Information & Software Technology 93 (2018), 186–199. Google ScholarDigital Library
Peter Alan Lee and Thomas Anderson. 1990. Fault Tolerance. Springer Vienna, Vienna, 51–77.Google Scholar
Jun Li, Yingfei Xiong, Xuanzhe Liu, and Lu Zhang. 2013. How Does Web Service API Evolution Affect Clients?. In 2013 IEEE 20th International Conference on Web Services, Santa Clara, CA, USA, June 28 - July 3, 2013. 300–307. Google ScholarDigital Library
Shiqing Ma, Yousra Aafer, Zhaogui Xu, Wen-Chuan Lee, Juan Zhai, Yingqi Liu, and Xiangyu Zhang. 2017. LAMP: data provenance for graph based machine learning algorithms through derivative computation. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017. 786–797. Google ScholarDigital Library
Jonathan I. Maletic and Andrian Marcus. 2000. Data Cleansing: Beyond Integrity Analysis. In Fifth Conference on Information Quality (IQ 2000). 200–209.Google Scholar
Brian Marick. 1991. The Weak Mutation Hypothesis. In Proceedings of the Symposium on Testing, Analysis, and Verification, TAV 1991, Victoria, British Columbia, Canada, October 8-10, 1991. 190–199. Google ScholarDigital Library
Seyed Mehdi Nasehi, Jonathan Sillito, Frank Maurer, and Chris Burns. 2012. What makes a good code example?: A study of programming Q&A in StackOverflow. In 28th IEEE International Conference on Software Maintenance, ICSM 2012, Trento, Italy, September 23-28, 2012. 25–34. Google ScholarDigital Library
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).Google Scholar
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017. Google ScholarDigital Library
1–18.Google Scholar
Erhard Rahm and Hong Hai Do. 2000. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3–13. http://sites.computer.org/d ebull/A00DEC-CD.pdfGoogle Scholar
Manos Renieris and Steven P. Reiss. 2003. Fault Localization With Nearest Neighbor Queries. In 18th IEEE International Conference on Automated Software Engineering (ASE 2003), 6-10 October 2003, Montreal, Canada. 30–39. Google ScholarDigital Library
Debra J. Richardson and Margaret C. Thompson. 1993. An Analysis of Test Data Selection Criteria Using the RELAY Model of Fault Detection. IEEE Trans. Software Eng. 19, 6 (1993), 533–553. Google ScholarDigital Library
Carolyn B. Seaman, Forrest Shull, Myrna Regardie, Denis Elbert, Raimund L. Feldmann, Yuepu Guo, and Sally Godfrey. 2008. Defect categorization: making use of a decade of widely varying historical data. In Proceedings of the Second International Symposium on Empirical Software Engineering and Measurement, ESEM 2008, October 9-10, 2008, Kaiserslautern, Germany. 149–157. g/10.1145/1414004.1414030 Google ScholarDigital Library
Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’08). ACM, New York, NY, USA, 614–622. Google ScholarDigital Library
Siwakorn Srisakaokul, Zhengkai Wu, Angello Astorga, Oreoluwa Alebiosu, and Tao Xie. 2018. Multiple-Implementation Testing of Supervised Learning Software.. In Proceedings of the AAAI-18 Workshop on Engineering Dependable and Secure Machine Learning Systems (EDSMLS 2018), co-located with AAAI 2018, New Orleans, LA, Feburary 2018.Google Scholar
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. CoRR abs/1312.6199 (2013). arXiv: 1312.6199 http://arxiv.org/abs/1312.6199Google Scholar
Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016). http://arxiv.org/abs/1605.02688Google Scholar
Ferdian Thung, Shaowei Wang, David Lo, and Lingxiao Jiang. 2012. An Empirical Study of Bugs in Machine Learning Systems. In 23rd IEEE International Symposium on Software Reliability Engineering, ISSRE 2012, Dallas, TX, USA, November 27-30, 2012. 271–280. Google ScholarDigital Library
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2017. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. CoRR abs/1708.08559 (2017). arXiv: 1708.08559 http://arxiv.org/abs/1708.08559Google Scholar
Tian Xiao, Jiaxing Zhang, Hucheng Zhou, Zhenyu Guo, Sean McDirmid, Wei Lin, Wenguang Chen, and Lidong Zhou. 2014. Nondeterminism in MapReduce considered harmful? an empirical study on non-commutative aggregators in MapReduce programs. In 36th International Conference on Software Engineering, ICSE ’14, Companion Proceedings, Hyderabad, India, May 31 - June 07, 2014. 44–53. Google ScholarDigital Library
Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail E. Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software 84, 4 (2011), 544–558. Google ScholarDigital Library
Baowen Xu, Ju Qian, Xiaofang Zhang, Zhongqiang Wu, and Lin Chen. 2005. A brief survey of program slicing. ACM SIGSOFT Software Engineering Notes 30, 2 (2005), 1–36. Google ScholarDigital Library
Shahed Zaman, Bram Adams, and Ahmed E. Hassan. 2011. Security versus performance bugs: a case study on Firefox. In Proceedings of the 8th International Working Conference on Mining Software Repositories, MSR 2011 (Co-located with ICSE), Waikiki, Honolulu, HI, USA, May 21-28, 2011, Proceedings. 93–102. https: ISSTA’18, July 16–21, 2018, Amsterdam, Netherlands Y. Zhang, Y. Chen, S. Cheung, Y. Xiong, L. Zhang Google ScholarDigital Library

Index Terms

An empirical study on TensorFlow program bugs
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis

Recommendations

An Empirical Study on Bugs Inside TensorFlow
Database Systems for Advanced Applications
Abstract
In recent years, deep learning has become a hot research topic. Although it achieves incredible positive results in some scenarios, bugs inside deep learning software can introduce disastrous consequences, especially when the software is used in ...
Read More
Silent bugs in deep learning frameworks: an empirical study of Keras and TensorFlow
Abstract
Deep Learning (DL) frameworks are now widely used, simplifying the creation of complex models as well as their integration into various applications even among non-DL experts. However, like any other programs, they are prone to bugs. This paper ...
Read More
An empirical study of dormant bugs
MSR 2014: Proceedings of the 11th Working Conference on Mining Software Repositories

Over the past decade, several research efforts have studied the quality of software systems by looking at post-release bugs. However, these studies do not account for bugs that remain dormant (i.e., introduced in a version of the software system, but ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISSTA 2018: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis
July 2018
379 pages
ISBN:9781450356992
DOI:10.1145/3213846
General Chair:
Frank Tip
Northeastern University, USA
,
Program Chair:
Eric Bodden
University of Paderborn, Germany / Fraunhofer IEM, Germany
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep Learning
Empirical Study
TensorFlow Program Bug
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate58of213submissions,27%
Upcoming Conference
ISSTA '24

Sponsor:

sigsoft

33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

September 16 - 20, 2024

Vienna , Austria
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 202
  Total Citations
  View Citations
- 2,106
  Total Downloads
- Downloads (Last 12 months)365
- Downloads (Last 6 weeks)44
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An empirical study on TensorFlow program bugs

ISSTA 2018: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Empirical Study on Bugs Inside TensorFlow

Silent bugs in deep learning frameworks: an empirical study of Keras and TensorFlow

An empirical study of dormant bugs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An empirical study on TensorFlow program bugs

ISSTA 2018: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Empirical Study on Bugs Inside TensorFlow

Silent bugs in deep learning frameworks: an empirical study of Keras and TensorFlow

An empirical study of dormant bugs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media