research-article

Better test cases for better automated program repair

Authors:
Jinqiu Yang

University of Waterloo, Canada

University of Waterloo, Canada
View Profile

,
Alexey Zhikhartsev

University of Waterloo, Canada

University of Waterloo, Canada
View Profile

,
Yuefei Liu

University of Waterloo, Canada

University of Waterloo, Canada
View Profile

,
Lin Tan

University of Waterloo, Canada

University of Waterloo, Canada
View Profile

ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software EngineeringAugust 2017Pages 831–841https://doi.org/10.1145/3106237.3106274

Published:21 August 2017Publication History

ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

Pages 831–841

ABSTRACT

Automated generate-and-validate program repair techniques (G&V techniques) suffer from generating many overfitted patches due to in-capabilities of test cases. Such overfitted patches are incor- rect patches, which only make all given test cases pass, but fail to fix the bugs. In this work, we propose an overfitted patch detec- tion framework named Opad (Overfitted PAtch Detection). Opad helps improve G&V techniques by enhancing existing test cases to filter out overfitted patches. To enhance test cases, Opad uses fuzz testing to generate new test cases, and employs two test or- acles (crash and memory-safety) to enhance validity checking of automatically-generated patches. Opad also uses a novel metric (named O-measure) for deciding whether automatically-generated patches overfit. Evaluated on 45 bugs from 7 large systems (the same benchmark used by GenProg and SPR), Opad filters out 75.2% (321/427) over- fitted patches generated by GenProg/AE, Kali, and SPR. In addition, Opad guides SPR to generate correct patches for one more bug (the original SPR generates correct patches for 11 bugs). Our analysis also shows that up to 40% of such automatically-generated test cases may further improve G&V techniques if empowered with better test oracles (in addition to crash and memory-safety oracles employed by Opad).

References

2016. American Fuzzy Lop. (2016). http://lcamtuf.coredump.cx/afl/. 2016.Google Scholar
gcov—a Test Coverage Program. (2016).Google Scholar
https://gcc.gnu.org/onlinedocs/gcc/Gcov.html. 2016. Valgrind. (2016). http://valgrind.org/.Google Scholar
Paul Ammann and Jeff Offutt. 2008.Google Scholar
Introduction to Software Testing (1 ed.). Cambridge University Press, New York, NY, USA.Google Scholar
Earl T. Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro. 2014. The Plastic Surgery Hypothesis. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). ACM, New York, NY, USA, 306–317. Google ScholarDigital Library
Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and Automatic Generation of High-coverage Tests for Complex Systems Programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI 2008). USENIX Association, Berkeley, CA, USA, 209–224. Google ScholarDigital Library
Christoph Csallner and Yannis Smaragdakis. 2004.Google Scholar
JCrasher: An automatic robustness tester for Java. Software—Practice & Experience 34, 11 (Sept. 2004), 1025–1050. Google ScholarDigital Library
Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering (FSE 2011). ACM, 416–419. Google ScholarDigital Library
Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite Generation for Object-oriented Software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE 2011). ACM, New York, NY, USA, 416–419. Google ScholarDigital Library
Juan Pablo Galeotti, Gordon Fraser, and Andrea Arcuri. 2014.Google Scholar
Extending a Search-Based Test Generator with Adaptive Dynamic Symbolic Execution (Tool paper). In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA 2014). ACM, ACM, New York, NY, USA, 421–424. Google ScholarDigital Library
S Hocevar. 2011. zzuf—multi-purpose fuzzer. (2011).Google Scholar
Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, 435–445. Google ScholarDigital Library
René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are Mutants a Valid Substitute for Real Faults in Software Testing?. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). ACM, New York, NY, USA, 654–665. Google ScholarDigital Library
Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic Patch Generation Learned from Human-written Patches. In Proceedings of the 2013 International Conference on Software Engineering (ICSE 2013). IEEE Press, Piscataway, NJ, USA, 802–811. Google ScholarDigital Library
X. B. D. Le, D. Lo, and C. L. Goues. 2016.Google Scholar
History Driven Program Repair. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER 2016), Vol. 1. 213–224.Google Scholar
Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A Systematic Study of Automated Program Repair: Fixing 55 out of 105 Bugs for $8 Each. In Proceedings of the 2012 International Conference on Software Engineering (ICSE 2012). IEEE Press, Piscataway, NJ, USA, 3–13. Google ScholarDigital Library
Xinyuan Liu, Muhan Zeng, Yingfei Xiong, Lu Zhang, and Gang Huang. 2017.Google Scholar
Identifying Patch Correctness in Test-Based Automatic Program Repair. CoRR abs/1706.09120 (2017).Google Scholar
Fan Long and Martin Rinard. 2015. Staged program repair with condition synthesis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE (FSE 2015). 166–178. Google ScholarDigital Library
Fan Long and Martin Rinard. 2016. An analysis of the search spaces for generate and validate patch generation systems. In Proceedings of the 38th International Conference on Software Engineering (ICSE 2016). ACM, 702–713. Google ScholarDigital Library
Fan Long and Martin Rinard. 2016. Automatic Patch Generation by Learning Correct Code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL 2016). ACM, New York, NY, USA, 298–312. Google ScholarDigital Library
Matias Martinez, Westley Weimer, and Martin Monperrus. 2014. Do the Fix Ingredients Already Exist? An Empirical Inquiry into the Redundancy Assumptions of Program Repair Approaches. In Companion Proceedings of the 36th International Conference on Software Engineering (ICSE Companion 2014). ACM, New York, NY, USA, 492–495. Google ScholarDigital Library
Richard McNally, Ken Yiu, Duncan Grove, and Damien Gerhardy. 2012.Google Scholar
Fuzzing: the state of the art. Technical Report. DTIC Document.Google Scholar
Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2015. Directfix: Looking for simple program repairs. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE 2015), Vol. 1. IEEE, 448–458. Google ScholarDigital Library
Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable multiline program patch synthesis via symbolic analysis. In Proceedings of the 38th International Conference on Software Engineering. ACM, 691–701. Google ScholarDigital Library
Barton P Miller, Louis Fredriksen, and Bryan So. 1990. An empirical study of the reliability of UNIX utilities. Commun. ACM 33, 12 (1990), 32–44. Google ScholarDigital Library
Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN conference on Objectoriented programming systems and applications companion. ACM, 815–816. Google ScholarDigital Library
Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007.Google Scholar
Feedback-Directed Random Test Generation. In Proceedings of the 29th International Conference on Software Engineering (ICSE 2007). IEEE Computer Society, Washington, DC, USA, 75–84. Google ScholarDigital Library
Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). 254–265. Google ScholarDigital Library
Zichao Qi, Fan Long, Sara Anchor, and Martin Rinard. An Analysis of Patch Plausibility and Correctness for Generate-And-Validate Patch Generation Systems. In Proceedings of the 2015 International Symposium on Software Testing and Analysis (ISSTA 2015). 24–36. Google ScholarDigital Library
Koushik Sen, Darko Marinov, and Gul Agha. 2005.Google Scholar
CUTE: A Concolic Unit Testing Engine for C. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE 2013). ACM, New York, NY, USA, 263–272. Google ScholarDigital Library
Tan Shin Weiand, H Yoshida, Prasad M, and A. Roychoudhury. Anti-patterns in Search-based Program Repair. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). 727– 738. Google ScholarDigital Library
Edward K. Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. 2015.Google Scholar
Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, New York, NY, USA, 532–543. Google ScholarDigital Library
Michael Sutton. 2012. FileFuzz. (2012).Google Scholar
Shin Hwei Tan and Abhik Roychoudhury. 2015.Google Scholar
relifix: Automated Repair of Software Regressions. In Proceedings of the 2015 Internaltional Conference on Software Engineering (ICSE 2015). 471–482. Google ScholarDigital Library
Westley Weimer, Zachary P Fry, and Stephen Forrest. 2013. Leveraging program equivalence for adaptive program repair: Models and first results. In Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on. IEEE, 356–366. Google ScholarDigital Library
Tao Xie. 2006.Google Scholar
Augmenting Automatically Generated Unit-test Suites with Regression Oracle Checking. In Proceedings of the 20th European Conference on Object-Oriented Programming (ECOOP 2006). Springer-Verlag, Berlin, Heidelberg, 380–403. Google ScholarDigital Library
Qi Xin and Steven P. Reiss. 2017.Google ScholarDigital Library
Identifying Test-Suite-Overfitted Patches through Test Case Generation. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA 2017). Google ScholarDigital Library
J. Xuan, M. Martinez, F. DeMarco, M. Clement, S. Lamelas Marcote, T. Durieux, D. Le Berre, and M. Monperrus. 2016. Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs. IEEE Transactions on Software Engineering PP, 99 (2016), 1–1. Google ScholarDigital Library
Z. Yu, M. Martinez, B. Danglot, T. Durieux, and M. Monperrus. 2017. Test Case Generation for Program Repair: A Study of Feasibility and Effectiveness. ArXiv e-prints (March 2017). arXiv:cs.SE/1703.00198Google Scholar
Yucheng Zhang and Ali Mesbah. 2015.Google Scholar
Assertions Are Strongly Correlated with Test Suite Effectiveness. In Proceedings of the joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2015). ACM, 214–224. Abstract 1 Introduction 2 Background on Automated G&V Program Repair 3 Approach 3.1 Generating New Test Cases Using Fuzz Testing 3.2 Generating Memory-Safety Oracles 3.3 Measuring the Overfitness of a Patch Using an Overfitness Metric (O-measure) 3.4 An Optimized Setting of Opad 4 Evaluation 5 Threats to Validity 6 Related Work 7 Conclusions References Google ScholarDigital Library

Index Terms

Better test cases for better automated program repair
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Identifying patch correctness in test-based program repair
ICSE '18: Proceedings of the 40th International Conference on Software Engineering

Test-based automatic program repair has attracted a lot of attention in recent years. However, the test suites in practice are often too weak to guarantee correctness and existing approaches often generate a large number of incorrect patches.

To reduce ...
Read More
Context-aware patch generation for better automated program repair
ICSE '18: Proceedings of the 40th International Conference on Software Engineering

The effectiveness of search-based automated program repair is limited in the number of correct patches that can be successfully generated. There are two causes of such limitation. First, the search space does not contain the correct patch. Second, the ...
Read More
Poracle: Testing Patches under Preservation Conditions to Combat the Overfitting Problem of Program Repair
To date, the users of test-driven program repair tools suffer from the overfitting problem; a generated patch may pass all available tests without being correct. In the existing work, users are treated as merely passive consumers of the tests. However, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering
August 2017
1073 pages
ISBN:9781450351058
DOI:10.1145/3106237
General Chairs:
Eric Bodden
Paderborn University, Germany / Fraunhofer IEM, Germany
,
Wilhelm Schäfer
Paderborn University, Germany
,
Program Chairs:
Arie van Deursen
Delft University of Technology, Netherlands
,
Andrea Zisman
Open University, UK
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Overfitting in automated program repair
Patch validation
Testing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate112of543submissions,21%
Upcoming Conference
FSE '24

Sponsor:

sigsoft

32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Ipojuca (Pernambuco) , Brazil
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 76
  Total Citations
  View Citations
- 759
  Total Downloads
- Downloads (Last 12 months)102
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Better test cases for better automated program repair

ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Identifying patch correctness in test-based program repair

Context-aware patch generation for better automated program repair

Poracle: Testing Patches under Preservation Conditions to Combat the Overfitting Problem of Program Repair