Skip to main content
Top
Published in: Empirical Software Engineering 7/2022

01-12-2022

Pitfalls and guidelines for using time-based Git data

Authors: Samuel W. Flint, Jigyasa Chauhan, Robert Dyer

Published in: Empirical Software Engineering | Issue 7/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could affect the research results. This paper presents an extended survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 754 technical track and data papers published in MSR 2004–2021, we saw at least 290 (38%) papers utilized time-based data. We also observed that most time-based data used in research papers comes in the form of Git commits, often from GitHub. Based on those results, we then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty Git timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
All authors brainstormed potential keywords and helped create the final list.
 
3
Note that: “While Subversion automatically attaches properties (svn:date, svn:author, svn:log, and so on) to revisions, it does not presume thereafter the existence of those properties, and neither should you or the tools you use to interact with your repository.” https://​svnbook.​red-bean.​com/​en/​1.​7/​svn.​advanced.​props.​html
 
4
The Kotlin dataset contains some projects which may exist in the Java dataset.
 
5
The SF.net dataset contained Subversion projects, which store commit IDs as integers and thus are not unique across projects and can not be easily deduplicated.
 
Literature
go back to reference Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2016) Mining duplicate questions in stack overflow. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16. Association for Computing Machinery, New York, pp 402–412. https://doi.org/10.1145/2901739.2901770 Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2016) Mining duplicate questions in stack overflow. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16. Association for Computing Machinery, New York, pp 402–412. https://​doi.​org/​10.​1145/​2901739.​2901770
go back to reference Antoniol G, Rollo VF, Venturi G (2005) Linear predictive coding and cepstrum coefficients for mining time variant information from software repositories. In: Proceedings of the 2005 international workshop on mining software repositories, MSR ’05, vol 2005. Association for Computing Machinery, New York, pp 1–5. https://doi.org/10.1145/1083142.1083156 Antoniol G, Rollo VF, Venturi G (2005) Linear predictive coding and cepstrum coefficients for mining time variant information from software repositories. In: Proceedings of the 2005 international workshop on mining software repositories, MSR ’05, vol 2005. Association for Computing Machinery, New York, pp 1–5. https://​doi.​org/​10.​1145/​1083142.​1083156
go back to reference Cito J, Schermann G, Wittern JE, Leitner P, Zumberi S, Gall HC (2017) An empirical analysis of the Docker container ecosystem on GitHub. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR). IEEE. https://doi.org/10.1109/msr.2017.67 Cito J, Schermann G, Wittern JE, Leitner P, Zumberi S, Gall HC (2017) An empirical analysis of the Docker container ecosystem on GitHub. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR). IEEE. https://​doi.​org/​10.​1109/​msr.​2017.​67
go back to reference Claes M, Mäntylä MV (2020) 20-MAD: 20 years of issues and commits of Mozilla and Apache development. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. Association for Computing Machinery, New York, pp 503–507. https://doi.org/10.1145/3379597.3387487 Claes M, Mäntylä MV (2020) 20-MAD: 20 years of issues and commits of Mozilla and Apache development. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. Association for Computing Machinery, New York, pp 503–507. https://​doi.​org/​10.​1145/​3379597.​3387487
go back to reference Cosentino V, Izquierdo JLC, Cabot J (2016) Findings from GitHub: methods, datasets and limitations. In: 2016 IEEE/ACM 13th working conference on mining software repositories (MSR), pp 137–141 Cosentino V, Izquierdo JLC, Cabot J (2016) Findings from GitHub: methods, datasets and limitations. In: 2016 IEEE/ACM 13th working conference on mining software repositories (MSR), pp 137–141
go back to reference Cosmo RD, Zacchiroli S (2017) Software Heritage: why and how to preserve software source code. In: iPRES 2017: 14th international conference on digital preservation. Kyoto, Japan Cosmo RD, Zacchiroli S (2017) Software Heritage: why and how to preserve software source code. In: iPRES 2017: 14th international conference on digital preservation. Kyoto, Japan
go back to reference Demeyer S, Murgia A, Wyckmans K, Lamkanfi A (2013) Happy birthday! A trend analysis on past MSR papers. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. IEEE Press, pp 353–362 Demeyer S, Murgia A, Wyckmans K, Lamkanfi A (2013) Happy birthday! A trend analysis on past MSR papers. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. IEEE Press, pp 353–362
go back to reference Durieux T, Le Goues C, Hilton M, Abreu R (2020) Empirical study of restarted and flaky builds on Travis CI. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. Association for Computing Machinery, New York, pp 254–264. https://doi.org/10.1145/3379597.3387460 Durieux T, Le Goues C, Hilton M, Abreu R (2020) Empirical study of restarted and flaky builds on Travis CI. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. Association for Computing Machinery, New York, pp 254–264. https://​doi.​org/​10.​1145/​3379597.​3387460
go back to reference Dyer R, Nguyen HA, Rajan H, Nguyen TN (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: Proceedings of the international conference on software engineering, ICSE ’13, vol 2013. IEEE Press, pp 422–431. https://doi.org/10.5555/2486788.2486844 Dyer R, Nguyen HA, Rajan H, Nguyen TN (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: Proceedings of the international conference on software engineering, ICSE ’13, vol 2013. IEEE Press, pp 422–431. https://​doi.​org/​10.​5555/​2486788.​2486844
go back to reference Gasser L, Ripoche G, Sandusky RJ (2004) Research infrastructure for empirical science of F/OSS. In: Proceedings of the 1st international workshop on mining software repositories Gasser L, Ripoche G, Sandusky RJ (2004) Research infrastructure for empirical science of F/OSS. In: Proceedings of the 1st international workshop on mining software repositories
go back to reference Ghezzi G, Gall HC (2013) Replicating mining studies with SOFAS. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. IEEE Press, pp 363–372 Ghezzi G, Gall HC (2013) Replicating mining studies with SOFAS. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. IEEE Press, pp 363–372
go back to reference Gonzalez-Barahona JM, Robles G, Izquierdo-Cortazar D (2015) The MetricsGrimoire database collection. In: Proceedings of the 12th working conference on mining software repositories, MSR ’15. IEEE Press, pp 478–481 Gonzalez-Barahona JM, Robles G, Izquierdo-Cortazar D (2015) The MetricsGrimoire database collection. In: Proceedings of the 12th working conference on mining software repositories, MSR ’15. IEEE Press, pp 478–481
go back to reference Hemmati H, Nadi S, Baysal O, Kononenko O, Wang W, Holmes R, Godfrey MW (2013) The MSR cookbook: mining a decade of research. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. IEEE Press, pp 343–352 Hemmati H, Nadi S, Baysal O, Kononenko O, Wang W, Holmes R, Godfrey MW (2013) The MSR cookbook: mining a decade of research. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. IEEE Press, pp 343–352
go back to reference Kagdi H, Yusuf S, Maletic JI (2006) Mining sequences of changed-files from version histories. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06. Association for Computing Machinery, New York, pp 47–53. https://doi.org/10.1145/1137983.1137996 Kagdi H, Yusuf S, Maletic JI (2006) Mining sequences of changed-files from version histories. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06. Association for Computing Machinery, New York, pp 47–53. https://​doi.​org/​10.​1145/​1137983.​1137996
go back to reference Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. Association for Computing Machinery, New York, pp 92–101. https://doi.org/10.1145/2597073.2597074 Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. Association for Computing Machinery, New York, pp 92–101. https://​doi.​org/​10.​1145/​2597073.​2597074
go back to reference Kikas R, Dumas M, Pfahl D (2016) Using dynamic and contextual features to predict issue lifetime in GitHub projects. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16. Association for Computing Machinery, New York, pp 291–302. https://doi.org/10.1145/2901739.2901751 Kikas R, Dumas M, Pfahl D (2016) Using dynamic and contextual features to predict issue lifetime in GitHub projects. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16. Association for Computing Machinery, New York, pp 291–302. https://​doi.​org/​10.​1145/​2901739.​2901751
go back to reference Liu Y, Lin J, Cleland-Huang J (2020) Traceability support for multi-lingual software projects. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. Association for Computing Machinery, New York, pp 443–454. https://doi.org/10.1145/3379597.3387440 Liu Y, Lin J, Cleland-Huang J (2020) Traceability support for multi-lingual software projects. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. Association for Computing Machinery, New York, pp 443–454. https://​doi.​org/​10.​1145/​3379597.​3387440
go back to reference Pietri A, Rousseau G, Zacchiroli S (2020) Forking without clicking: on how to identify software repository forks. In: Proceedings of the 17th international conference on mining software repositories. Association for Computing Machinery, New York, pp 277–287 Pietri A, Rousseau G, Zacchiroli S (2020) Forking without clicking: on how to identify software repository forks. In: Proceedings of the 17th international conference on mining software repositories. Association for Computing Machinery, New York, pp 277–287
go back to reference Pimentel JaF, Murta L, Braganholo V, Freire J (2019) A large-scale study about quality and reproducibility of Jupyter notebooks. In: Proceedings of the 16th international conference on mining software repositories, MSR ’19. IEEE Press, pp 507–517. https://doi.org/10.1109/MSR.2019.00077 Pimentel JaF, Murta L, Braganholo V, Freire J (2019) A large-scale study about quality and reproducibility of Jupyter notebooks. In: Proceedings of the 16th international conference on mining software repositories, MSR ’19. IEEE Press, pp 507–517. https://​doi.​org/​10.​1109/​MSR.​2019.​00077
go back to reference Robles G, González-Barahona JM, Cervigón C, Capiluppi A, Izquierdo-Cortázar D (2014) Estimating development effort in free/open source software projects by mining software repositories: a case study of OpenStack. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. Association for Computing Machinery, New York, pp 222–231. https://doi.org/10.1145/2597073.2597107 Robles G, González-Barahona JM, Cervigón C, Capiluppi A, Izquierdo-Cortázar D (2014) Estimating development effort in free/open source software projects by mining software repositories: a case study of OpenStack. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. Association for Computing Machinery, New York, pp 222–231. https://​doi.​org/​10.​1145/​2597073.​2597107
go back to reference Sadowski C, Lewis C, Lin Z, Zhu X, Whitehead EJ (2011) An empirical analysis of the FixCache algorithm. In: Proceedings of the 8th working conference on mining software repositories, MSR ’11. Association for Computing Machinery, New York, pp 219–222. https://doi.org/10.1145/1985441.1985475 Sadowski C, Lewis C, Lin Z, Zhu X, Whitehead EJ (2011) An empirical analysis of the FixCache algorithm. In: Proceedings of the 8th working conference on mining software repositories, MSR ’11. Association for Computing Machinery, New York, pp 219–222. https://​doi.​org/​10.​1145/​1985441.​1985475
go back to reference Steff M, Russo B (2012) Co-evolution of logical couplings and commits for defect estimation. In: Proceedings of the 9th IEEE working conference on mining software repositories, MSR ’12. IEEE Press, pp 213–216 Steff M, Russo B (2012) Co-evolution of logical couplings and commits for defect estimation. In: Proceedings of the 9th IEEE working conference on mining software repositories, MSR ’12. IEEE Press, pp 213–216
go back to reference Walker RJ, Holmes R, Hedgeland I, Kapur P, Smith A (2006) A lightweight approach to technical risk estimation via probabilistic impact analysis. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06. Association for Computing Machinery, New York, pp 98–104. https://doi.org/10.1145/1137983.1138008 Walker RJ, Holmes R, Hedgeland I, Kapur P, Smith A (2006) A lightweight approach to technical risk estimation via probabilistic impact analysis. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06. Association for Computing Machinery, New York, pp 98–104. https://​doi.​org/​10.​1145/​1137983.​1138008
go back to reference Zimmermann T, Weißgerber P (2004) Preprocessing CVS data for fine-grained analysis. In: Proceedings of the 1st international workshop on mining software repositories, MSR ’04, pp 2–6 Zimmermann T, Weißgerber P (2004) Preprocessing CVS data for fine-grained analysis. In: Proceedings of the 1st international workshop on mining software repositories, MSR ’04, pp 2–6
Metadata
Title
Pitfalls and guidelines for using time-based Git data
Authors
Samuel W. Flint
Jigyasa Chauhan
Robert Dyer
Publication date
01-12-2022
Publisher
Springer US
Published in
Empirical Software Engineering / Issue 7/2022
Print ISSN: 1382-3256
Electronic ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-022-10200-y

Other articles of this Issue 7/2022

Empirical Software Engineering 7/2022 Go to the issue

Premium Partner