ABSTRACT
Automated generate-and-validate program repair techniques (G&V techniques) suffer from generating many overfitted patches due to in-capabilities of test cases. Such overfitted patches are incor- rect patches, which only make all given test cases pass, but fail to fix the bugs. In this work, we propose an overfitted patch detec- tion framework named Opad (Overfitted PAtch Detection). Opad helps improve G&V techniques by enhancing existing test cases to filter out overfitted patches. To enhance test cases, Opad uses fuzz testing to generate new test cases, and employs two test or- acles (crash and memory-safety) to enhance validity checking of automatically-generated patches. Opad also uses a novel metric (named O-measure) for deciding whether automatically-generated patches overfit. Evaluated on 45 bugs from 7 large systems (the same benchmark used by GenProg and SPR), Opad filters out 75.2% (321/427) over- fitted patches generated by GenProg/AE, Kali, and SPR. In addition, Opad guides SPR to generate correct patches for one more bug (the original SPR generates correct patches for 11 bugs). Our analysis also shows that up to 40% of such automatically-generated test cases may further improve G&V techniques if empowered with better test oracles (in addition to crash and memory-safety oracles employed by Opad).
- 2016. American Fuzzy Lop. (2016). http://lcamtuf.coredump.cx/afl/. 2016.Google Scholar
- gcov—a Test Coverage Program. (2016).Google Scholar
- https://gcc.gnu.org/onlinedocs/gcc/Gcov.html. 2016. Valgrind. (2016). http://valgrind.org/.Google Scholar
- Paul Ammann and Jeff Offutt. 2008.Google Scholar
- Introduction to Software Testing (1 ed.). Cambridge University Press, New York, NY, USA.Google Scholar
- Earl T. Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro. 2014. The Plastic Surgery Hypothesis. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). ACM, New York, NY, USA, 306–317. Google ScholarDigital Library
- Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and Automatic Generation of High-coverage Tests for Complex Systems Programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI 2008). USENIX Association, Berkeley, CA, USA, 209–224. Google ScholarDigital Library
- Christoph Csallner and Yannis Smaragdakis. 2004.Google Scholar
- JCrasher: An automatic robustness tester for Java. Software—Practice & Experience 34, 11 (Sept. 2004), 1025–1050. Google ScholarDigital Library
- Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering (FSE 2011). ACM, 416–419. Google ScholarDigital Library
- Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite Generation for Object-oriented Software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE 2011). ACM, New York, NY, USA, 416–419. Google ScholarDigital Library
- Juan Pablo Galeotti, Gordon Fraser, and Andrea Arcuri. 2014.Google Scholar
- Extending a Search-Based Test Generator with Adaptive Dynamic Symbolic Execution (Tool paper). In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA 2014). ACM, ACM, New York, NY, USA, 421–424. Google ScholarDigital Library
- S Hocevar. 2011. zzuf—multi-purpose fuzzer. (2011).Google Scholar
- Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, 435–445. Google ScholarDigital Library
- René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are Mutants a Valid Substitute for Real Faults in Software Testing?. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). ACM, New York, NY, USA, 654–665. Google ScholarDigital Library
- Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic Patch Generation Learned from Human-written Patches. In Proceedings of the 2013 International Conference on Software Engineering (ICSE 2013). IEEE Press, Piscataway, NJ, USA, 802–811. Google ScholarDigital Library
- X. B. D. Le, D. Lo, and C. L. Goues. 2016.Google Scholar
- History Driven Program Repair. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER 2016), Vol. 1. 213–224.Google Scholar
- Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A Systematic Study of Automated Program Repair: Fixing 55 out of 105 Bugs for $8 Each. In Proceedings of the 2012 International Conference on Software Engineering (ICSE 2012). IEEE Press, Piscataway, NJ, USA, 3–13. Google ScholarDigital Library
- Xinyuan Liu, Muhan Zeng, Yingfei Xiong, Lu Zhang, and Gang Huang. 2017.Google Scholar
- Identifying Patch Correctness in Test-Based Automatic Program Repair. CoRR abs/1706.09120 (2017).Google Scholar
- Fan Long and Martin Rinard. 2015. Staged program repair with condition synthesis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE (FSE 2015). 166–178. Google ScholarDigital Library
- Fan Long and Martin Rinard. 2016. An analysis of the search spaces for generate and validate patch generation systems. In Proceedings of the 38th International Conference on Software Engineering (ICSE 2016). ACM, 702–713. Google ScholarDigital Library
- Fan Long and Martin Rinard. 2016. Automatic Patch Generation by Learning Correct Code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL 2016). ACM, New York, NY, USA, 298–312. Google ScholarDigital Library
- Matias Martinez, Westley Weimer, and Martin Monperrus. 2014. Do the Fix Ingredients Already Exist? An Empirical Inquiry into the Redundancy Assumptions of Program Repair Approaches. In Companion Proceedings of the 36th International Conference on Software Engineering (ICSE Companion 2014). ACM, New York, NY, USA, 492–495. Google ScholarDigital Library
- Richard McNally, Ken Yiu, Duncan Grove, and Damien Gerhardy. 2012.Google Scholar
- Fuzzing: the state of the art. Technical Report. DTIC Document.Google Scholar
- Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2015. Directfix: Looking for simple program repairs. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE 2015), Vol. 1. IEEE, 448–458. Google ScholarDigital Library
- Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable multiline program patch synthesis via symbolic analysis. In Proceedings of the 38th International Conference on Software Engineering. ACM, 691–701. Google ScholarDigital Library
- Barton P Miller, Louis Fredriksen, and Bryan So. 1990. An empirical study of the reliability of UNIX utilities. Commun. ACM 33, 12 (1990), 32–44. Google ScholarDigital Library
- Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN conference on Objectoriented programming systems and applications companion. ACM, 815–816. Google ScholarDigital Library
- Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007.Google Scholar
- Feedback-Directed Random Test Generation. In Proceedings of the 29th International Conference on Software Engineering (ICSE 2007). IEEE Computer Society, Washington, DC, USA, 75–84. Google ScholarDigital Library
- Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). 254–265. Google ScholarDigital Library
- Zichao Qi, Fan Long, Sara Anchor, and Martin Rinard. An Analysis of Patch Plausibility and Correctness for Generate-And-Validate Patch Generation Systems. In Proceedings of the 2015 International Symposium on Software Testing and Analysis (ISSTA 2015). 24–36. Google ScholarDigital Library
- Koushik Sen, Darko Marinov, and Gul Agha. 2005.Google Scholar
- CUTE: A Concolic Unit Testing Engine for C. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE 2013). ACM, New York, NY, USA, 263–272. Google ScholarDigital Library
- Tan Shin Weiand, H Yoshida, Prasad M, and A. Roychoudhury. Anti-patterns in Search-based Program Repair. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). 727– 738. Google ScholarDigital Library
- Edward K. Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. 2015.Google Scholar
- Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, New York, NY, USA, 532–543. Google ScholarDigital Library
- Michael Sutton. 2012. FileFuzz. (2012).Google Scholar
- Shin Hwei Tan and Abhik Roychoudhury. 2015.Google Scholar
- relifix: Automated Repair of Software Regressions. In Proceedings of the 2015 Internaltional Conference on Software Engineering (ICSE 2015). 471–482. Google ScholarDigital Library
- Westley Weimer, Zachary P Fry, and Stephen Forrest. 2013. Leveraging program equivalence for adaptive program repair: Models and first results. In Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on. IEEE, 356–366. Google ScholarDigital Library
- Tao Xie. 2006.Google Scholar
- Augmenting Automatically Generated Unit-test Suites with Regression Oracle Checking. In Proceedings of the 20th European Conference on Object-Oriented Programming (ECOOP 2006). Springer-Verlag, Berlin, Heidelberg, 380–403. Google ScholarDigital Library
- Qi Xin and Steven P. Reiss. 2017.Google ScholarDigital Library
- Identifying Test-Suite-Overfitted Patches through Test Case Generation. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA 2017). Google ScholarDigital Library
- J. Xuan, M. Martinez, F. DeMarco, M. Clement, S. Lamelas Marcote, T. Durieux, D. Le Berre, and M. Monperrus. 2016. Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs. IEEE Transactions on Software Engineering PP, 99 (2016), 1–1. Google ScholarDigital Library
- Z. Yu, M. Martinez, B. Danglot, T. Durieux, and M. Monperrus. 2017. Test Case Generation for Program Repair: A Study of Feasibility and Effectiveness. ArXiv e-prints (March 2017). arXiv:cs.SE/1703.00198Google Scholar
- Yucheng Zhang and Ali Mesbah. 2015.Google Scholar
- Assertions Are Strongly Correlated with Test Suite Effectiveness. In Proceedings of the joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2015). ACM, 214–224. Abstract 1 Introduction 2 Background on Automated G&V Program Repair 3 Approach 3.1 Generating New Test Cases Using Fuzz Testing 3.2 Generating Memory-Safety Oracles 3.3 Measuring the Overfitness of a Patch Using an Overfitness Metric (O-measure) 3.4 An Optimized Setting of Opad 4 Evaluation 5 Threats to Validity 6 Related Work 7 Conclusions References Google ScholarDigital Library
Index Terms
- Better test cases for better automated program repair
Recommendations
Identifying patch correctness in test-based program repair
ICSE '18: Proceedings of the 40th International Conference on Software EngineeringTest-based automatic program repair has attracted a lot of attention in recent years. However, the test suites in practice are often too weak to guarantee correctness and existing approaches often generate a large number of incorrect patches.
To reduce ...
Context-aware patch generation for better automated program repair
ICSE '18: Proceedings of the 40th International Conference on Software EngineeringThe effectiveness of search-based automated program repair is limited in the number of correct patches that can be successfully generated. There are two causes of such limitation. First, the search space does not contain the correct patch. Second, the ...
Poracle: Testing Patches under Preservation Conditions to Combat the Overfitting Problem of Program Repair
To date, the users of test-driven program repair tools suffer from the overfitting problem; a generated patch may pass all available tests without being correct. In the existing work, users are treated as merely passive consumers of the tests. However, ...
Comments