Skip to main content
Erschienen in: Empirical Software Engineering 2/2016

01.04.2016

The impact of tangled code changes on defect prediction models

verfasst von: Kim Herzig, Sascha Just, Andreas Zeller

Erschienen in: Empirical Software Engineering | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

When interacting with source control management system, developers often commit unrelated or loosely related code changes in a single transaction. When analyzing version histories, such tangled changes will make all changes to all modules appear related, possibly compromising the resulting analyses through noise and bias. In an investigation of five open-source Java projects, we found between 7 % and 20 % of all bug fixes to consist of multiple tangled changes. Using a multi-predictor approach to untangle changes, we show that on average at least 16.6 % of all source files are incorrectly associated with bug reports. These incorrect bug file associations seem to not significantly impact models classifying source files to have at least one bug or no bugs. But our experiments show that untangling tangled code changes can result in more accurate regression bug prediction models when compared to models trained and tested on tangled bug datasets—in our experiments, the statistically significant accuracy improvements lies between 5 % and 200 %. We recommend better change organization to limit the impact of tangled changes.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
These findings confirm results of earlier research presented by Kawrykow and Robillard (2011), Kawrykow (2011).
 
2
Since it is undecidable whether a program will ever terminate under arbitrary conditions, we are, in general, also unable to decide whether two code changes may influence each other during a possible infinite program run.
 
3
This ConfVoters is slightly penalized by the artificial blob generation strategy pack creating blobs by combining changes to files based on directory distance (see Subsection 4.2). However, we favored a more realistic distribution of changes over total fairness across all ConfVoters.
 
4
Since we are analyzing artificially tangled change sets only, the file mapping error rate without untangling lies at 100 %. Having a error rate after untangling of 19 %, the result is a reduction rate of 100 %−19 %=81 %.
 
Literatur
Zurück zum Zitat Alam O, Adams B, Hassan AE (2009) Measuring the progress of projects using the time dependence of code changes. In: ICSM, pp 329–338 Alam O, Adams B, Hassan AE (2009) Measuring the progress of projects using the time dependence of code changes. In: ICSM, pp 329–338
Zurück zum Zitat Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: Proceedings of the 28th international conference on Software engineering. ACM, pp 361–370 Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: Proceedings of the 28th international conference on Software engineering. ACM, pp 361–370
Zurück zum Zitat Bachmann A, Bird C, Rahman F, Devanbu P, Bernstein A (2010) The missing links: bugs and bug-fix commits. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, pp 97–106 Bachmann A, Bird C, Rahman F, Devanbu P, Bernstein A (2010) The missing links: bugs and bug-fix commits. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, pp 97–106
Zurück zum Zitat Bhattacharya P (2011) Using software evolution history to facilitate development and maintenance. In: Proceeding of the 33rd international conference on Software engineering. ACM, pp 1122– 1123 Bhattacharya P (2011) Using software evolution history to facilitate development and maintenance. In: Proceeding of the 33rd international conference on Software engineering. ACM, pp 1122– 1123
Zurück zum Zitat Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced? Bias in bug-fix datasets. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ESEC/FSE ’09. ACM, pp 121–130 Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced? Bias in bug-fix datasets. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ESEC/FSE ’09. ACM, pp 121–130
Zurück zum Zitat Bird C, Nagappan N, Gall H, Murphy B, Devanbu P (2009) Putting it all together: Using socio-technical networks to predict failures. In: Proceedings of the 2009 20th International Symposium on Software Reliability Engineering, ISSRE ’09. IEEE Computer Society, Washington, pp 109–119. doi:10.1109/ISSRE.2009.17 Bird C, Nagappan N, Gall H, Murphy B, Devanbu P (2009) Putting it all together: Using socio-technical networks to predict failures. In: Proceedings of the 2009 20th International Symposium on Software Reliability Engineering, ISSRE ’09. IEEE Computer Society, Washington, pp 109–119. doi:10.​1109/​ISSRE.​2009.​17
Zurück zum Zitat Bonacich P (1987) Power and centrality: a family of measures. American journal of sociology Bonacich P (1987) Power and centrality: a family of measures. American journal of sociology
Zurück zum Zitat Dagenais B, Hendren L (2008) Enabling static analysis for partial Java programs. In: Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications, OOPSLA ’08. ACM, pp 313–328 Dagenais B, Hendren L (2008) Enabling static analysis for partial Java programs. In: Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications, OOPSLA ’08. ACM, pp 313–328
Zurück zum Zitat Dallmeier V (2010) Mining and checking object behavior. Ph.D. thesis, Universität des Saarlandes Dallmeier V (2010) Mining and checking object behavior. Ph.D. thesis, Universität des Saarlandes
Zurück zum Zitat Herzig K (2012) Mining and untangling change genealogies. Ph.D. thesis, Universität des Saarlandes Herzig K (2012) Mining and untangling change genealogies. Ph.D. thesis, Universität des Saarlandes
Zurück zum Zitat Herzig K., Just S., Rau A., Zeller A. (2013) Predicting defects using change genealogies. In: Proceedings of the 2013 IEEE 24nd international symposium on software reliability engineering, ISSRE ’13. IEEE Computer Society Herzig K., Just S., Rau A., Zeller A. (2013) Predicting defects using change genealogies. In: Proceedings of the 2013 IEEE 24nd international symposium on software reliability engineering, ISSRE ’13. IEEE Computer Society
Zurück zum Zitat Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. IEEE Press, Piscataway, pp 392–401 Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. IEEE Press, Piscataway, pp 392–401
Zurück zum Zitat Herzig K., Zeller A. (2013) The Impact of Tangled Code Changes. IEEE Press, Piscataway, pp 121–130 Herzig K., Zeller A. (2013) The Impact of Tangled Code Changes. IEEE Press, Piscataway, pp 121–130
Zurück zum Zitat Hindle A, German D, Godfrey M, Holt R (2009) Automatic classication of large changes into maintenance categories. In: Program comprehension, 2009. ICPC ’09. 17th International Conference on IEEE, pp. 30–39 Hindle A, German D, Godfrey M, Holt R (2009) Automatic classication of large changes into maintenance categories. In: Program comprehension, 2009. ICPC ’09. 17th International Conference on IEEE, pp. 30–39
Zurück zum Zitat Hindle A, German DM, Holt R (2008) What do large commits tell us? A taxonomical study of large commits. In: Proceedings of the 2008 international working conference on Mining software repositories, MSR ’08. ACM, pp 99–108 Hindle A, German DM, Holt R (2008) What do large commits tell us? A taxonomical study of large commits. In: Proceedings of the 2008 international working conference on Mining software repositories, MSR ’08. ACM, pp 99–108
Zurück zum Zitat Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37:547–579 Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37:547–579
Zurück zum Zitat Karypis G, Kumar V (1995) Analysis of multilevel graph partitioning. In: Proceedings of the 1995 ACM/IEEE conference on Supercomputing, Supercomputing 1995. ACM Karypis G, Kumar V (1995) Analysis of multilevel graph partitioning. In: Proceedings of the 1995 ACM/IEEE conference on Supercomputing, Supercomputing 1995. ACM
Zurück zum Zitat Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20:359–392MathSciNetCrossRefMATH Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20:359–392MathSciNetCrossRefMATH
Zurück zum Zitat Kawrykow D. (2011) Enabling precise interpretations of software change data. Master’s thesis, McGill University Kawrykow D. (2011) Enabling precise interpretations of software change data. Master’s thesis, McGill University
Zurück zum Zitat Kawrykow D, Robillard MP (2011) Non-essential changes in version histories. In: Proceeding of the 33rd international conference on Software engineering, ICSE ’11. ACM, pp 351–360 Kawrykow D, Robillard MP (2011) Non-essential changes in version histories. In: Proceeding of the 33rd international conference on Software engineering, ICSE ’11. ACM, pp 351–360
Zurück zum Zitat Kim S, Whitehead Jr. EJ, Zhang Y (2008) Classifying software changes: Clean or buggy. IEEE Trans Softw Eng 34:181–196CrossRef Kim S, Whitehead Jr. EJ, Zhang Y (2008) Classifying software changes: Clean or buggy. IEEE Trans Softw Eng 34:181–196CrossRef
Zurück zum Zitat Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceeding of the 33rd international conference on Software engineering, ICSE ’11. ACM, pp 481–490 Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceeding of the 33rd international conference on Software engineering, ICSE ’11. ACM, pp 481–490
Zurück zum Zitat Kim S, Zimmermann T, Whitehead Jr. EJ, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th International Conference on Software Engineering, ICSE ’07. IEEE Computer Society, Washington, pp 489–498. doi:10.1109/ICSE.2007.66 Kim S, Zimmermann T, Whitehead Jr. EJ, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th International Conference on Software Engineering, ICSE ’07. IEEE Computer Society, Washington, pp 489–498. doi:10.​1109/​ICSE.​2007.​66
Zurück zum Zitat Li PL, Kivett R, Zhan Z, Jeon SE, Nagappan N, Murphy B, Ko AJ (2011) Characterizing the differences between pre- and post-release versions of software. In: Proceeding of the 33rd international conference on Software engineering. ACM, pp 716–725 Li PL, Kivett R, Zhan Z, Jeon SE, Nagappan N, Murphy B, Ko AJ (2011) Characterizing the differences between pre- and post-release versions of software. In: Proceeding of the 33rd international conference on Software engineering. ACM, pp 716–725
Zurück zum Zitat Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Automated Software Engg 17:375–407CrossRef Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Automated Software Engg 17:375–407CrossRef
Zurück zum Zitat Mockus A, Votta LG (2000) Identifying reasons for software changes using historic databases. In: Proceedings of the international conference on software maintenance (ICSM’00), ICSM ’00. IEEE Computer Society, pp 120–130 Mockus A, Votta LG (2000) Identifying reasons for software changes using historic databases. In: Proceedings of the international conference on software maintenance (ICSM’00), ICSM ’00. IEEE Computer Society, pp 120–130
Zurück zum Zitat Murphy-Hill E, Black A (2008) Refactoring tools: fitness for purpose. IEEE Software 25(5):38–44CrossRef Murphy-Hill E, Black A (2008) Refactoring tools: fitness for purpose. IEEE Software 25(5):38–44CrossRef
Zurück zum Zitat Murphy-Hill E, Parnin C, Black AP (2009) How we refactor, and how we know it. Int Conf Softw Eng 287–297 Murphy-Hill E, Parnin C, Black AP (2009) How we refactor, and how we know it. Int Conf Softw Eng 287–297
Zurück zum Zitat Nagappan N, Murphy B, Basili V (2008) The influence of organizational structure on software quality: an empirical case study. In: Proceedings of the 30th international conference on Software engineering, ICSE ’08. ACM, New York, pp 521–530. doi:10.1145/1368088.1368160 Nagappan N, Murphy B, Basili V (2008) The influence of organizational structure on software quality: an empirical case study. In: Proceedings of the 30th international conference on Software engineering, ICSE ’08. ACM, New York, pp 521–530. doi:10.​1145/​1368088.​1368160
Zurück zum Zitat Nguyen TH, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: 2010 17th Working Conference on Reverse Engineering. IEEE Computer Society, pp 259–268 Nguyen TH, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: 2010 17th Working Conference on Reverse Engineering. IEEE Computer Society, pp 259–268
Zurück zum Zitat Premraj R, Herzig K (2011) Network versus code metrics to predict defects: a replication study. In: Proceedings of the 2011 international symposium on empirical software engineering and measurement, ESEM ’11. IEEE Computer Society, Washington, pp 215–224. doi:10.1109/ESEM.2011.30 CrossRef Premraj R, Herzig K (2011) Network versus code metrics to predict defects: a replication study. In: Proceedings of the 2011 international symposium on empirical software engineering and measurement, ESEM ’11. IEEE Computer Society, Washington, pp 215–224. doi:10.​1109/​ESEM.​2011.​30 CrossRef
Zurück zum Zitat R Development Core Team (2010) R: a language and environment for statistical computing. R Foundation for Statistical Computing R Development Core Team (2010) R: a language and environment for statistical computing. R Foundation for Statistical Computing
Zurück zum Zitat Robbes R, Lanza M, Lungu M (2007) An approach to software evolution based on semantic change. In: Fundamental approaches to software engineering, Lecture notes in computer science, vol 4422. Springer, Berlin, pp 27–41 Robbes R, Lanza M, Lungu M (2007) An approach to software evolution based on semantic change. In: Fundamental approaches to software engineering, Lecture notes in computer science, vol 4422. Springer, Berlin, pp 27–41
Zurück zum Zitat Stoerzer M, Ryder BG, Ren X, Tip F (2006) Finding failure-inducing changes in java programs using change classification. In: Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering. ACM, pp 57–68 Stoerzer M, Ryder BG, Ren X, Tip F (2006) Finding failure-inducing changes in java programs using change classification. In: Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering. ACM, pp 57–68
Zurück zum Zitat Tosun A, Turhan B, Bener A (2009) Validation of network measures as indicators of defective modules in software systems. In: Proceedings of the 5th international conference on predictor models in software engineering, PROMISE ’09. ACM, New York, pp 5:1–5:9. doi:10.1145/1540438.1540446 Tosun A, Turhan B, Bener A (2009) Validation of network measures as indicators of defective modules in software systems. In: Proceedings of the 5th international conference on predictor models in software engineering, PROMISE ’09. ACM, New York, pp 5:1–5:9. doi:10.​1145/​1540438.​1540446
Zurück zum Zitat Williams BJ, Carver JC (2010) Characterizing software architecture changes: a systematic review. Information and Software Technology 52(1):1–51CrossRef Williams BJ, Carver JC (2010) Characterizing software architecture changes: a systematic review. Information and Software Technology 52(1):1–51CrossRef
Zurück zum Zitat Wloka J, Ryder B, Tip F, Ren X (2009) Safe-commit analysis to facilitate team software development. In: Proceedings of the 31st International Conference on Software Engineering, ICSE ’09. IEEE Computer Society, pp 507–517 Wloka J, Ryder B, Tip F, Ren X (2009) Safe-commit analysis to facilitate team software development. In: Proceedings of the 31st International Conference on Software Engineering, ICSE ’09. IEEE Computer Society, pp 507–517
Zurück zum Zitat Zimmermann T, Nagappan N (2008) Predicting defects using network analysis on dependency graphs. In: Proceedings of the 30th international conference on Software engineering, ICSE ’08. ACM, New York, pp 531–540. doi:10.1145/1368088.1368161 Zimmermann T, Nagappan N (2008) Predicting defects using network analysis on dependency graphs. In: Proceedings of the 30th international conference on Software engineering, ICSE ’08. ACM, New York, pp 531–540. doi:10.​1145/​1368088.​1368161
Zurück zum Zitat Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the third international workshop on predictor models in software engineering, PROMISE ’07. IEEE Computer Society Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the third international workshop on predictor models in software engineering, PROMISE ’07. IEEE Computer Society
Zurück zum Zitat Zimmermann T, Weigerber P, Diehl S, Zeller A (2004) Mining version histories to guide software changes. In: Proceedings of the 26th international conference on software engineering. IEEE Computer Society, pp 563–572 Zimmermann T, Weigerber P, Diehl S, Zeller A (2004) Mining version histories to guide software changes. In: Proceedings of the 26th international conference on software engineering. IEEE Computer Society, pp 563–572
Metadaten
Titel
The impact of tangled code changes on defect prediction models
verfasst von
Kim Herzig
Sascha Just
Andreas Zeller
Publikationsdatum
01.04.2016
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 2/2016
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-015-9376-6

Weitere Artikel der Ausgabe 2/2016

Empirical Software Engineering 2/2016 Zur Ausgabe

Premium Partner