ABSTRACT
In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are evaluated on a single benchmark of bugs, which are also rarely reproduced by other researchers. In this paper, we present a large-scale experiment using 11 Java test-suite-based repair tools and 2,141 bugs from 5 benchmarks. Our goal is to have a better understanding of the current state of automatic program repair tools on a large diversity of benchmarks. Our investigation is guided by the hypothesis that the repairability of repair tools might not be generalized across different benchmarks. We found that the 11 tools 1) are able to generate patches for 21% of the bugs from the 5 benchmarks, and 2) have better performance on Defects4J compared to other benchmarks, by generating patches for 47% of the bugs from Defects4J compared to 10-30% of bugs from the other benchmarks. Our experiment comprises 23,551 repair attempts, which we used to find causes of non-patch generation. These causes are reported in this paper, which can help repair tool designers to improve their approaches and tools.
- Daniel Balouek, Alexandra Carpen Amarie, Ghislain Charrier, Frédéric Desprez, Emmanuel Jeannot, Emmanuel Jeanvoine, Adrien Lèbre, David Margery, Nicolas Niclausse, Lucas Nussbaum, Olivier Richard, Christian Pérez, Flavien Quesnel, Cyril Rohr, and Luc Sarzyniec. 2013. Adding Virtualization Capabilities to the Grid’5000 Testbed. In Cloud Computing and Services Science, Ivan I. Ivanov, Marten van Sinderen, Frank Leymann, and Tony Shan (Eds.). Communications in Computer and Information Science, Vol. 367. Springer International Publishing, Cham, 3–20.Google Scholar
- Liushan Chen, Yu Pei, and Carlo A. Furia. 2017. Contract-Based Program Repair without the Contracts. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE ’17). IEEE Press, Piscataway, NJ, USA, 637–647. Google ScholarDigital Library
- Valentin Dallmeier and Thomas Zimmermann. 2007. Extraction of Bug Localization Benchmarks from History. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE ’07). ACM, New York, NY, USA, 433–436. Google ScholarDigital Library
- Vidroha Debroy and W. Eric Wong. 2010. Using Mutation to Automatically Suggest Fixes for Faulty Programs. In Proceedings of the 2010 Third International Conference on Software Testing, Verification and Validation (ICST ’10). IEEE Computer Society, Washington, DC, USA, 65–74. Google ScholarDigital Library
- Defects4J. 2011. Defects4J patch for Closure-51 bug. http://program-repair.org/ defects4j-dissection/#!/bug/Closure/51.Google Scholar
- Thomas Durieux, Benoit Cornu, Lionel Seinturier, and Martin Monperrus. 2017. Dynamic Patch Generation for Null Pointer Exceptions Using Metaprogramming. In Proceedings of the 24th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’17). IEEE, Klagenfurt, Austria, 349–358.Google ScholarCross Ref
- Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019. The RepairThemAll framework repository. https://github.com/program-repair/ RepairThemAll.Google Scholar
- Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019. The repair attempts’ results. https://github.com/program-repair/RepairThemAll_ experiment.Google Scholar
- Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019. Website for browsing the generated patches. http://program-repair.org/ RepairThemAll_experiment.Google Scholar
- Thomas Durieux and Martin Monperrus. 2016. DynaMoth: Dynamic Code Synthesis for Automatic Program Repair. In Proceedings of the 11th International Workshop on Automation of Software Test (AST ’16). ACM, New York, NY, USA, 85–91. Google ScholarDigital Library
- Thomas Durieux and Martin Monperrus. 2016. IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs. Technical Report #hal-01272126. University of Lille, University of Lille.Google Scholar
- Péter Gyimesi, Béla Vancsics, Andrea Stocco, Davood Mazinanian, Árpád Beszédes, Rudolf Ferenc, and Ali Mesbah. 2019.Google Scholar
- Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. 2018. Towards Practical Program Repair with On-Demand Candidate Generation. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). ACM, New York, NY, USA, 12–23. Google ScholarDigital Library
- Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping Program Repair Space with Existing Patches and Similar Code. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’18). ACM, New York, NY, USA, 298–309. Google ScholarDigital Library
- johnlenz. 2011. Human patch for Defects4J Closure-51 bug. https://github.com/ google/closure-compiler/commit/a02241e5df48e44e23dc0e66dbef3fdc3c91eb3e.Google Scholar
- René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 23rd International Symposium on Software Testing and Analysis (ISSTA ’14). ACM, New York, NY, USA, 437–440. Google ScholarDigital Library
- Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic Patch Generation Learned from Human-Written Patches. In Proceedings of the 35th International Conference on Software Engineering (ICSE ’13). IEEE Press, Piscataway, NJ, USA, 802–811. Google ScholarDigital Library
- Xuan Bach D. Le, David Lo, and Claire Le Goues. 2016. History Driven Program Repair. In Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’16). IEEE, Suita, Japan, 213–224.Google ScholarCross Ref
- Xuan-Bach D. Le, Ferdian Thung, David Lo, and Claire Le Goues. 2018. Overfitting in semantics-based automated program repair. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). ACM, New York, NY, USA, 163–163. Google ScholarDigital Library
- Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A Systematic Study of Automated Program Repair: Fixing 55 out of 105 Bugs for $8 Each. In Proceedings of the 34th International Conference on Software Engineering (ICSE ’12). IEEE Press, Piscataway, NJ, USA, 3–13. Google ScholarDigital Library
- Claire Le Goues, Neal Holtschulte, Edward K. Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. 2015. The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs. IEEE Transactions on Software Engineering 41, 12 (Dec. 2015), 1236–1256.Google ScholarDigital Library
- Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A Multi-Lingual Program Repair Benchmark Set Based on the Quixey Challenge. In Proceedings of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity (SPLASH Companion 2017). ACM, New York, NY, USA, 55–56. Google ScholarDigital Library
- Kui Liu, Anil Koyuncu, Kisub Kim, Dongsun Kim, and Tegawendé F. Bissyandé. 2018. LSRepair: Live Search of Fix Ingredients for Automated Program Repair. In Proceedings of the 25th Asia-Pacific Software Engineering Conference (APSEC ’18). IEEE Computer Society, Washington, DC, USA, 1–5.Google Scholar
- Xuliang Liu and Hao Zhong. 2018. Mining StackOverflow for Program Repair. In Proceedings of the 25th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’18). IEEE, Campobasso, Italy, 118–129.Google ScholarCross Ref
- Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Monperrus. 2019. Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’19). IEEE, Hangzhou, China, 468–478.Google ScholarCross Ref
- Matias Martinez, Thomas Durieux, Romain Sommerard, Jifeng Xuan, and Martin Monperrus. 2017. Automatic Repair of Real Bugs in Java: A Large-scale Experiment on the Defects4J Dataset. Empirical Software Engineering 22, 4 (Aug. 2017), 1936–1964. Google ScholarDigital Library
- Matias Martinez and Martin Monperrus. 2016. ASTOR: A Program Repair Library for Java. In Proceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA ’16), Demonstration Track. ACM, New York, NY, USA, 441– 444. Google ScholarDigital Library
- Matias Martinez and Martin Monperrus. 2018. Ultra-Large Repair Search Space with Automatically Mined Templates: the Cardumen Mode of Astor. In Proceedings of the 10th International Symposium on Search-Based Software Engineering (SSBSE ’18). Lecture Notes in Computer Science, vol 11036, Thelma Elita Colanzi and Phil McMinn (Eds.). Springer International Publishing, Cham, 65–86.Google ScholarCross Ref
- Matias Martinez, Westley Weimer, and Martin Monperrus. 2014. Do the Fix Ingredients Already Exist? An Empirical Inquiry into the Redundancy Assumptions of Program Repair Approaches. In Proceedings of the 36th International Conference on Software Engineering (ICSE Companion 2014). ACM, New York, NY, USA, 492–495. Google ScholarDigital Library
- Martin Monperrus. 2014. A Critical Review of “Automatic Patch Generation Learned from Human-Written Patches”: Essay on the Problem Statement and the Evaluation of Automatic Software Repair. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14). ACM, New York, NY, USA, 234– 242. Google ScholarDigital Library
- Martin Monperrus. 2018. Automatic Software Repair: a Bibliography. Comput. Surveys 51, 1, Article 17 (Jan. 2018), 24 pages. Google ScholarDigital Library
- Martin Monperrus. 2018. The Living Review on Automated Program Repair. Technical Report hal-01956501. HAL/archives-ouvertes.fr, HAL/archives-ouvertes.fr.Google Scholar
- Manish Motwani, Sandhya Sankaranarayanan, René Just, and Yuriy Brun. 2018. Do automated program repair techniques repair hard and important bugs? Empirical Software Engineering 23, 5 (Oct. 2018), 2901–2947. Google ScholarDigital Library
- Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014. The Strength of Random Search on Automated Program Repair. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14). ACM, New York, NY, USA, 254–265. Google ScholarDigital Library
- Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An Analysis of Patch Plausibility and Correctness for Generate-and-Validate Patch Generation Systems. In Proceedings of the 2015 International Symposium on Software Testing and Analysis (ISSTA ’15). ACM, New York, NY, USA, 24–36. Google ScholarDigital Library
- Ripon K. Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul R. Prasad. 2018. Bugs.jar: A Large-scale, Diverse Dataset of Real-world Java Bugs. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR ’18). ACM, New York, NY, USA, 10–13. Google ScholarDigital Library
- Ripon K. Saha, Yingjun Lyu, Hiroaki Yoshida, and Mukul R. Prasad. 2017. ELIXIR: Effective Object-Oriented Program Repair. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE ’17). IEEE Press, Piscataway, NJ, USA, 648–659. Google ScholarDigital Library
- Edward K. Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE ’15). ACM, New York, NY, USA, 532–543. Google ScholarDigital Library
- Victor Sobreira, Thomas Durieux, Fernanda Madeiral, Martin Monperrus, and Marcelo A. Maia. 2018. Dissection of a Bug Dataset: Anatomy of 395 Patches from Defects4J. In Proceedings of the 25th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’18). IEEE, Campobasso, Italy, 130–140.Google Scholar
- Shin Hwei Tan, Jooyong Yi, Yulis, Sergey Mechtaev, and Abhik Roychoudhury. 2017. Codeflaws: A Programming Competition Benchmark for Evaluating Automated Program Repair Tools. In Proceedings of the 39th International Conference ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu on Software Engineering Companion (ICSE-C ’17). IEEE Press, Piscataway, NJ, USA, 180–182. Google ScholarDigital Library
- Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically Finding Patches Using Genetic Programming. In Proceedings of the 31st International Conference on Software Engineering (ICSE ’09). IEEE Computer Society, Washington, DC, USA, 364–374. Google ScholarDigital Library
- Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. 2018. Context-Aware Patch Generation for Better Automated Program Repair. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). ACM, New York, NY, USA, 1–11. Google ScholarDigital Library
- Martin White, Michele Tufano, Matias Martinez, Martin Monperrus, and Denys Poshyvanyk. 2019. Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’19). IEEE, Hangzhou, China, 479–490.Google ScholarCross Ref
- Qi Xin and Steven P. Reiss. 2017. Identifying Test-Suite-Overfitted Patches throughTest Case Generation. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’17). ACM, New York, NY, USA, 226–236. Google ScholarDigital Library
- Qi Xin and Steven P. Reiss. 2017. Leveraging Syntax-Related Code for Automated Program Repair. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE ’17). IEEE Press, Piscataway, NJ, USA, 660–670. Google ScholarDigital Library
- Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise Condition Synthesis for Program Repair. In Proceedings of the 39th International Conference on Software Engineering (ICSE ’17). IEEE Press, Piscataway, NJ, USA, 416–426. Google ScholarDigital Library
- Jifeng Xuan, Matias Martinez, Favio DeMarco, Maxime Clément, Sebastian Lamelas, Thomas Durieux, Daniel Le Berre, and Martin Monperrus. 2016. Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs. IEEE Transactions on Software Engineering 43, 1 (April 2016), 34–55. Google ScholarDigital Library
- He Ye, Matias Martinez, Thomas Durieux, and Martin Monperrus. 2019. A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark. In International Workshop on Intelligent Bug Fixing (IBF ’19, co-located with SANER). IEEE, Hangzhou, China, 1–10.Google ScholarCross Ref
- Zhongxing Yu, Matias Martinez, Benjamin Danglot, Thomas Durieux, and Martin Monperrus. 2019. Alleviating Patch Overfitting with Automatic Test Generation: A Study of Feasibility and Effectiveness for the Nopol Repair System. Empirical Software Engineering 24, 1 (Feb. 2019), 33–67. Google ScholarDigital Library
- Yuan Yuan and Wolfgang Banzhaf. 2018. ARJA: Automated Repair of Java Programs via Multi-Objective Genetic Programming. IEEE Transactions on Software Engineering PP (2018).Google Scholar
Index Terms
- Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts
Recommendations
ELIXIR: effective object oriented program repair
ASE '17: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software EngineeringThis work is motivated by the pervasive use of method invocations in object-oriented (OO) programs, and indeed their prevalence in patches of OO-program bugs. We propose a generate-and-validate repair technique, called ELIXIR designed to be able to ...
Towards API-specific automatic program repair
ASE '17: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software EngineeringThe domain of Automatic Program Repair (APR) had many research contributions in recent years. So far, most approaches target fixing generic bugs in programs (e.g., off-by-one errors). Nevertheless, recent studies reveal that about 50% of real bugs ...
Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset
Defects4J is a large, peer-reviewed, structured dataset of real-world Java bugs. Each bug in Defects4J comes with a test suite and at least one failing test case that triggers the bug. In this paper, we report on an experiment to explore the ...
Comments