Exposing numerical bugs in deep learning via gradient back-propagation

Authors:
Ming Yan

Tianjin University, China

Tianjin University, China
View Profile

,
Junjie Chen

Tianjin University, China

Tianjin University, China

0000-0003-3056-9962
View Profile

,
Xiangyu Zhang

Purdue University, USA

Purdue University, USA
View Profile

,
Lin Tan

Purdue University, USA

Purdue University, USA
View Profile

,
Gan Wang

Tianjin University, China

Tianjin University, China
View Profile

,
Zan Wang

Tianjin University, China

Tianjin University, China
View Profile

ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringAugust 2021Pages 627–638https://doi.org/10.1145/3468264.3468612

Published:18 August 2021Publication History

ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 627–638

ABSTRACT

Numerical computation is dominant in deep learning (DL) programs. Consequently, numerical bugs are one of the most prominent kinds of defects in DL programs. Numerical bugs can lead to exceptional values such as NaN (Not-a-Number) and INF (Infinite), which can be propagated and eventually cause crashes or invalid outputs. They occur when special inputs cause invalid parameter values at internal mathematical operations such as log(). In this paper, we propose the first dynamic technique, called GRIST, which automatically generates a small input that can expose numerical bugs in DL programs. GRIST piggy-backs on the built-in gradient computation functionalities of DL infrastructures. Our evaluation on 63 real-world DL programs shows that GRIST detects 78 bugs including 56 unknown bugs. By submitting them to the corresponding issue repositories, eight bugs have been confirmed and three bugs have been fixed. Moreover, GRIST can save 8.79X execution time to expose numerical bugs compared to running original programs with its provided inputs. Compared to the state-of-the-art technique DEBAR (which is a static technique), DEBAR produces 12 false positives and misses 31 true bugs (of which 30 bugs can be found by GRIST), while GRIST only misses one known bug in those programs and no false positive. The results demonstrate the effectiveness of GRIST.

References

Accessed: 2020. GitHub. https://github.com/philipperemy/deep-speaker/issues/5Google Scholar
Accessed: 2020. GitHub. https://github.com/ForeverZyh/TensorFlow-Program-Bugs/blob/master/StackOverflow/IPS-2/33699174-buggy/mnist.pyGoogle Scholar
Accessed: 2020. GitHub. https://github.com/adamsolomou/SC-DNN/blob/a9169c6b7a0d456c1d2f229913e2d8c042c40aab/src/training/sc_train_creg.pyGoogle Scholar
Accessed: 2020. PyTorch Forums. https://discuss.pytorch.org/t/my-self-implemented-batchnorm-relu-gives-nan/42294Google Scholar
Accessed: 2020. PyTorch Forums. https://discuss.pytorch.org/t/different-losses-on-2-different-machines/36446/5Google Scholar
Nathaniel Ayewah, David Hovemeyer, J. David Morgenthaler, John Penix, and William Pugh. 2008. Using Static Analysis to Find Bugs. IEEE Softw., 25, 5 (2008), 22–29.Google ScholarDigital Library
Houssem Ben Braiek and Foutse Khomh. 2019. DeepEvolution: A Search-Based Testing Approach for Deep Neural Networks. In ICSME. IEEE, 454–458.Google Scholar
Houssem Ben Braiek and Foutse Khomh. 2019. TFCheck : A TensorFlow Library for Detecting Training Issues in Neural Network Programs. In 19th IEEE International Conference on Software Quality, Reliability and Security, QRS 2019, Sofia, Bulgaria, July 22-26, 2019. IEEE, 426–433.Google ScholarCross Ref
Nicholas Carlini and David A. Wagner. 2017. Towards Evaluating the Robustness of Neural Networks. In IEEE Symposium on Security and Privacy. IEEE Computer Society, 39–57.Google Scholar
Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. 2015. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision. 2722–2730.Google ScholarDigital Library
Junjie Chen, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. Continuous Incident Triage for Large-Scale Online Service Systems. In 34th IEEE/ACM International Conference on Automated Software Engineering. 364–375.Google Scholar
Junjie Chen, Haoyang Ma, and Lingming Zhang. 2020. Enhanced Compiler Bug Isolation via Memoized Search. In 35th IEEE/ACM International Conference on Automated Software Engineering. 78–89.Google ScholarDigital Library
Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical Accuracy Estimation for Efficient Deep Neural Network Testing. ACM Trans. Softw. Eng. Methodol., 29, 4 (2020), 30:1–30:35.Google ScholarDigital Library
Junjie Chen, Shu Zhang, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Yu Kang, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2020. How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems. In 35th IEEE/ACM International Conference on Automated Software Engineering. 373–384.Google ScholarDigital Library
Will Dietz, Peng Li, John Regehr, and Vikram S. Adve. 2012. Understanding integer overflow in C/C++. In 34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland. IEEE Computer Society, 760–770.Google Scholar
Xiaoning Du, Xiaofei Xie, Yi Li, Lei Ma, Yang Liu, and Jianjun Zhao. 2019. A Quantitative Analysis Framework for Recurrent Neural Network. In ASE. IEEE, 1062–1065.Google Scholar
Cormac Flanagan, K. Rustan M. Leino, Mark Lillibridge, Greg Nelson, James B. Saxe, and Raymie Stata. 2013. PLDI 2002: Extended static checking for Java. ACM SIGPLAN Notices, 48, 4S (2013), 22–33.Google ScholarDigital Library
Anthony Di Franco, Hui Guo, and Cindy Rubio-González. 2017. A comprehensive study of real-world numerical bug characteristics. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, USA, October 30 - November 03, 2017. IEEE Computer Society, 509–519.Google ScholarCross Ref
Zhoulai Fu and Zhendong Su. 2017. Achieving high coverage for floating-point code via unconstrained programming. In PLDI. ACM, 306–319.Google Scholar
Zhoulai Fu and Zhendong Su. 2019. Effective floating-point analysis via weak-distance minimization. In PLDI. ACM, 439–452.Google Scholar
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In ICLR (Poster).Google Scholar
Hui Guo and Cindy Rubio-González. 2020. Efficient generation of error-inducing floating-point inputs via symbolic execution. In ICSE. ACM, 1261–1272.Google Scholar
Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun. 2018. DLFuzz: differential fuzzing testing of deep learning systems. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018. ACM, 739–743.Google ScholarDigital Library
Andrew Habib and Michael Pradel. 2018. How many of all bugs do we find? a study of static bug detectors. In ASE. ACM, 317–328.Google Scholar
Fabrice Harel-Canada, Lingxiao Wang, Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim. 2020. Is neuron coverage a meaningful measure for testing deep neural networks? In ESEC/SIGSOFT FSE. ACM, 851–862.Google Scholar
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2019. Taxonomy of Real Faults in Deep Learning Systems. CoRR, abs/1910.11015 (2019).Google Scholar
Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Alessandra Russo Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019. ACM, 510–520.Google ScholarDigital Library
Md Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing Deep Neural Networks: Fix Patterns and Challenges. CoRR, abs/2005.00972 (2020).Google ScholarDigital Library
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019. IEEE / ACM, 1039–1049.Google ScholarDigital Library
Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2017. Adversarial examples in the physical world. In ICLR (Workshop). OpenReview.net.Google Scholar
Ron Larson and Bruce H Edwards. 2016. Calculus of a single variable. Nelson Education.Google Scholar
Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. 2018. Towards Robust Neural Networks via Random Self-ensemble. In ECCV (7) (Lecture Notes in Computer Science, Vol. 11211). Springer, 381–397.Google Scholar
Lei Ma, Felix Juefei-Xu, Minhui Xue, Bo Li, Li Li, Yang Liu, and Jianjun Zhao. 2019. DeepCT: Tomographic Combinatorial Testing for Deep Learning Systems. In 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2019, Hangzhou, China, February 24-27, 2019. IEEE, 614–618.Google Scholar
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3-7, 2018. ACM, 120–131.Google ScholarDigital Library
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.Google Scholar
Maren Mahsereci, Lukas Balles, Christoph Lassner, and Philipp Hennig. 2017. Early stopping without a validation set. arXiv preprint arXiv:1703.09580.Google Scholar
Augustus Odena, Catherine Olsson, David Andersen, and Ian J. Goodfellow. 2019. TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97). PMLR, 4901–4911.Google Scholar
Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. 2016. The Limitations of Deep Learning in Adversarial Settings. In EuroS&P. IEEE, 372–387.Google Scholar
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017. ACM, 1–18.Google ScholarDigital Library
Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019. IEEE / ACM, 1027–1038.Google ScholarDigital Library
Lutz Prechelt. 1998. Early stopping-but when? In Neural Networks: Tricks of the trade. Springer, 55–69.Google ScholarDigital Library
Louis B. Rall. 1981. Automatic Differentiation: Techniques and Applications (Lecture Notes in Computer Science, Vol. 120). Springer. isbn:3-540-10861-0 https://doi.org/10.1007/3-540-10861-0 Google ScholarCross Ref
Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A Comprehensive Study of Deep Learning Compiler Bugs. In ESEC/FSE. to appear.Google Scholar
Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. 2014. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems. 1988–1996.Google Scholar
Enyi Tang, Xiangyu Zhang, Norbert Th. Müller, Zhenyu Chen, and Xuandong Li. 2017. Software Numerical Instability Detection and Diagnosis by Combining Stochastic and Infinite-Precision Testing. IEEE Trans. Software Eng., 43, 10 (2017), 975–994.Google ScholarCross Ref
Ferdian Thung, Lucia, David Lo, Lingxiao Jiang, Foyzur Rahman, and Premkumar T. Devanbu. 2012. To what extent could we detect field defects? an empirical study of false negatives in static bug finding tools. In ASE. ACM, 50–59.Google Scholar
Zan Wang, Ming Yan, Junjie Chen, Shuang Liu, and Dongdi Zhang. 2020. Deep learning library testing via effective model generation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 788–799.Google ScholarDigital Library
Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing Test Inputs for Deep Neural Networks via Mutation Analysis. In 43rd IEEE/ACM International Conference on Software Engineering. 397–409.Google ScholarDigital Library
Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, and Dawn Song. 2018. Spatially Transformed Adversarial Examples. In ICLR (Poster). OpenReview.net.Google Scholar
Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan L. Yuille. 2018. Mitigating Adversarial Effects Through Randomization. In ICLR (Poster). OpenReview.net.Google Scholar
Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: a coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, Beijing, China, July 15-19, 2019. ACM, 146–157.Google ScholarDigital Library
Lin Yang, Junjie Chen, Zan Wang, Weijing Wang, Jiajun Jiang, Xuyuan Dong, and Wenbin Zhang. 2021. Semi-supervised Log-based Anomaly Detection via Probabilistic Label Estimation. In 43rd IEEE/ACM International Conference on Software Engineering. 1448–1460.Google ScholarDigital Library
Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. [n.d.]. An Empirical Study on Program Failures of Deep Learning Jobs.Google Scholar
Tianyi Zhang, Cuiyun Gao, Lei Ma, Michael R. Lyu, and Miryung Kim. 2019. An Empirical Study of Common Challenges in Developing Deep Learning Applications. In 30th IEEE International Symposium on Software Reliability Engineering, ISSRE 2019, Berlin, Germany, October 28-31, 2019. IEEE, 104–115.Google Scholar
Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An empirical study on TensorFlow program bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2018, Amsterdam, The Netherlands, July 16-21, 2018. ACM, 129–140.Google ScholarDigital Library
Yuhao Zhang, Luyao Ren, Liqian Chen, Yingfei Xiong, Shing-Chi Cheung, and Tao Xie. 2020. Detecting numerical bugs in neural network architectures. In ESEC/SIGSOFT FSE. ACM, 826–837.Google Scholar

Index Terms

Exposing numerical bugs in deep learning via gradient back-propagation
1. Computing methodologies
  1. Machine learning
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

An Empirical Study on Numerical Bugs in Deep Learning Programs
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

The task of a deep learning (DL) program is to train a model with high precision and apply it to different scenarios. A DL program often involves massive numerical calculations. Therefore, the robustness and stability of the numerical calculations are ...
Read More
Who’s debugging the debuggers? exposing debug information bugs in optimized binaries
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Despite the advancements in software testing, bugs still plague deployed software and result in crashes in production. When debugging issues —sometimes caused by “heisenbugs”— there is the need to interpret core dumps and reproduce the issue offline on ...
Read More
Understanding, detecting and exposing concurrency bugs
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
August 2021
1690 pages
ISBN:9781450385626
DOI:10.1145/3468264
General Chairs:
Diomidis Spinellis
Athens University of Economics and Business, Greece
,
Georgios Gousios
Facebook, Netherlands / Delft University of Technology, Netherlands
,
Program Chairs:
Marsha Chechik
University of Toronto, Canada
,
Massimiliano Di Penta
University of Sannio, Italy
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 August 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep Learning Testing
Gradient Back-propagation
Numerical Bug
Search-based Software Testing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate112of543submissions,21%
Upcoming Conference
FSE '24

Sponsor:

sigsoft

32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Ipojuca (Pernambuco) , Brazil
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 1,029
  Total Downloads
- Downloads (Last 12 months)322
- Downloads (Last 6 weeks)41
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exposing numerical bugs in deep learning via gradient back-propagation

ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Empirical Study on Numerical Bugs in Deep Learning Programs

Who’s debugging the debuggers? exposing debug information bugs in optimized binaries

Understanding, detecting and exposing concurrency bugs