Skip to main content
Erschienen in: Empirical Software Engineering 5/2016

01.10.2016

An in-depth study of the promises and perils of mining GitHub

verfasst von: Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, Daniela Damian

Erschienen in: Empirical Software Engineering | Ausgabe 5/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

With over 10 million git repositories, GitHub is becoming one of the most important sources of software artifacts on the Internet. Researchers mine the information stored in GitHub’s event logs to understand how its users employ the site to collaborate on software, but so far there have been no studies describing the quality and properties of the available GitHub data. We document the results of an empirical study aimed at understanding the characteristics of the repositories and users in GitHub; we see how users take advantage of GitHub’s main features and how their activity is tracked on GitHub and related datasets to point out misalignment between the real and mined data. Our results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. For example, we show that the majority of the projects are personal and inactive, and that almost 40 % of all pull requests do not appear as merged even though they were. Also, approximately half of GitHub’s registered users do not have public activity, while the activity of GitHub users in repositories is not always easy to pinpoint. We use our identified perils to see if they can pose validity threats; we review selected papers from the MSR 2014 Mining Challenge and see if there are potential impacts to consider. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
2
A collection of open source software data, formerly known as OssMole.
 
4
ghtorrent associates a commit with the repository where it first sees it (table commits) and also links it to all repositories this commit has appeared into (table repo_commits)
 
8
We currently track all sources of commits in the Linux kernel: hydraladder.​turingmachine.​org
 
12
The authors clarified this view in private communication.
 
Literatur
Zurück zum Zitat Aranda J, Venolia G (2009) The secret life of bugs: Going past the errors and omissions in software repositories. In: Proceedings of the 31st international conference on software engineering, pp 298– 308 Aranda J, Venolia G (2009) The secret life of bugs: Going past the errors and omissions in software repositories. In: Proceedings of the 31st international conference on software engineering, pp 298– 308
Zurück zum Zitat Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges of modern code review. In: Proceedings international conference on soft engineering, ICSE ’13, pp 712–721 Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges of modern code review. In: Proceedings international conference on soft engineering, ICSE ’13, pp 712–721
Zurück zum Zitat Bachmann A, Bird C, Rahman F, Devanbu P, Bernstein A (2010) The missing links: bugs and bug-fix commits. In: Proceedings of the 18th ACM SIGSOFT international symposium on Foundations of software engineering, pp 97–106 Bachmann A, Bird C, Rahman F, Devanbu P, Bernstein A (2010) The missing links: bugs and bug-fix commits. In: Proceedings of the 18th ACM SIGSOFT international symposium on Foundations of software engineering, pp 97–106
Zurück zum Zitat Begel A, Bosch J, Storey MA (2013) Social networking meets software development: perspectives from github, msdn, stack exchange, and topcoder. Software, IEEE 30(1):52–66CrossRef Begel A, Bosch J, Storey MA (2013) Social networking meets software development: perspectives from github, msdn, stack exchange, and topcoder. Software, IEEE 30(1):52–66CrossRef
Zurück zum Zitat Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, et al. (2009a) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the the symposium on the foundations of software engineering, pp 121–130 Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, et al. (2009a) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the the symposium on the foundations of software engineering, pp 121–130
Zurück zum Zitat Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009b) The promises and perils of mining git. In: Mining software repositories, (MSR’09). IEEE, pp 1–10 Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009b) The promises and perils of mining git. In: Mining software repositories, (MSR’09). IEEE, pp 1–10
Zurück zum Zitat Bissyande TF, Lo D, Jiang L, Reveillere L, Klein J, Le Traon Y (2013) Got issues? who cares about it? a large scale investigation of issue trackers from github. In: 2013 IEEE 24th international symposium on software reliability engineering (ISSRE). IEEE, pp 188–197 Bissyande TF, Lo D, Jiang L, Reveillere L, Klein J, Le Traon Y (2013) Got issues? who cares about it? a large scale investigation of issue trackers from github. In: 2013 IEEE 24th international symposium on software reliability engineering (ISSRE). IEEE, pp 188–197
Zurück zum Zitat Corbin J, Strauss A (2008) Basics of qualitative research: Techniques and procedures for developing grounded theory. Sage Corbin J, Strauss A (2008) Basics of qualitative research: Techniques and procedures for developing grounded theory. Sage
Zurück zum Zitat Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in GitHub: transparency and collaboration in an open software repository. In: Proceedings conference on computer supported cooperative work, pp 1277–1286 Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in GitHub: transparency and collaboration in an open software repository. In: Proceedings conference on computer supported cooperative work, pp 1277–1286
Zurück zum Zitat Gousios G, Spinellis D (2012) GHTorrent: GitHub’s data from a firehose. In: MSR ’12: proceedings of the 9th working conference on mining software repositories, pp 12–21 Gousios G, Spinellis D (2012) GHTorrent: GitHub’s data from a firehose. In: MSR ’12: proceedings of the 9th working conference on mining software repositories, pp 12–21
Zurück zum Zitat Gousios G, Zaidman A (2014a) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 368–371 Gousios G, Zaidman A (2014a) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 368–371
Zurück zum Zitat Gousios G, Zaidman A (2014b) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 368–371 Gousios G, Zaidman A (2014b) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 368–371
Zurück zum Zitat Gousios G, Pinzger M, Av D (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering, ICSE 2014, pp 345– 355 Gousios G, Pinzger M, Av D (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering, ICSE 2014, pp 345– 355
Zurück zum Zitat Gousios G, Zaidman A, Storey MA, Av D (2015) Work practices and challenges in pull-based development: The integratorĂŹs perspective. In: Proceedings of the 37th international conference on software engineering, ICSE 2015, to appear Gousios G, Zaidman A, Storey MA, Av D (2015) Work practices and challenges in pull-based development: The integratorĂŹs perspective. In: Proceedings of the 37th international conference on software engineering, ICSE 2015, to appear
Zurück zum Zitat Howison J, Crowston K (2004) The perils and pitfalls of mining sourceforge. In: Proceedings of the international workshop on mining software repositories, pp 7–11 Howison J, Crowston K (2004) The perils and pitfalls of mining sourceforge. In: Proceedings of the international workshop on mining software repositories, pp 7–11
Zurück zum Zitat Kalliamvakou E, Damian D, Singer L, German DM (2014a) The code-centric collaboration perspective: evidence from GitHub. Technical Report DCS-352-IR, University of Victoria Kalliamvakou E, Damian D, Singer L, German DM (2014a) The code-centric collaboration perspective: evidence from GitHub. Technical Report DCS-352-IR, University of Victoria
Zurück zum Zitat Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014b) The promises and perils of mining github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 92–101 Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014b) The promises and perils of mining github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 92–101
Zurück zum Zitat Kochhar PS, Bissyandé TF, Lo D, Jiang L (2013) Adoption of software testing in open source projects–a preliminary study on 50,000 projects. In: 2013 17th European conference on software maintenance and reengineering (CSMR). IEEE, pp 353–356 Kochhar PS, Bissyandé TF, Lo D, Jiang L (2013) Adoption of software testing in open source projects–a preliminary study on 50,000 projects. In: 2013 17th European conference on software maintenance and reengineering (CSMR). IEEE, pp 353–356
Zurück zum Zitat Marlow J, Dabbish L, Herbsleb J (2013) Impression formation in online peer production: activity traces and personal profiles in github. In: Proceedings of conference computer supported cooperative work, pp 117–128 Marlow J, Dabbish L, Herbsleb J (2013) Impression formation in online peer production: activity traces and personal profiles in github. In: Proceedings of conference computer supported cooperative work, pp 117–128
Zurück zum Zitat Matragkas N, Williams JR, Kolovos DS, Paige RF (2014) Analysing the ’biodiversity’ of open source ecosystems: The github case. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 356–359 Matragkas N, Williams JR, Kolovos DS, Paige RF (2014) Analysing the ’biodiversity’ of open source ecosystems: The github case. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 356–359
Zurück zum Zitat McDonald N, Goggins S (2013) Performance and participation in open source software on github. In: CHI’13 extended abstracts on human factors in computing systems. ACM, pp 139–144 McDonald N, Goggins S (2013) Performance and participation in open source software on github. In: CHI’13 extended abstracts on human factors in computing systems. ACM, pp 139–144
Zurück zum Zitat Nguyen TH, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: 2010 17th working conference on reverse engineering (WCRE). IEEE, pp 259–268 Nguyen TH, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: 2010 17th working conference on reverse engineering (WCRE). IEEE, pp 259–268
Zurück zum Zitat Padhye R, Mani S, Sinha VS (2014) A Study of External Community Contribution to Open-source Projects on GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 332–335 Padhye R, Mani S, Sinha VS (2014) A Study of External Community Contribution to Open-source Projects on GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 332–335
Zurück zum Zitat Pham R, Singer L, Liskin O, Figueira Filho F, Schneider K (2013) Creating a shared understanding of testing culture on a social coding site. In: Proceedings of international conference on soft engineering, ICSE ’13, pp 112–121 Pham R, Singer L, Liskin O, Figueira Filho F, Schneider K (2013) Creating a shared understanding of testing culture on a social coding site. In: Proceedings of international conference on soft engineering, ICSE ’13, pp 112–121
Zurück zum Zitat Rahman F, Posnett D, Herraiz I, Devanbu P (2013) Sample size vs. bias in defect prediction. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, pp 147–157 Rahman F, Posnett D, Herraiz I, Devanbu P (2013) Sample size vs. bias in defect prediction. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, pp 147–157
Zurück zum Zitat Rahman MM, Roy CK (2014) An insight into the pull requests of GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 364–367 Rahman MM, Roy CK (2014) An insight into the pull requests of GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 364–367
Zurück zum Zitat Rainer A, Gale S (2005) Evaluating the quality and quantity of data on open source software projects. In: Proceedings of the first international conference on open source systems (OSS 2005), pp 29– 36 Rainer A, Gale S (2005) Evaluating the quality and quantity of data on open source software projects. In: Proceedings of the first international conference on open source systems (OSS 2005), pp 29– 36
Zurück zum Zitat Rigby P C, Bird C (2013) Convergent contemporary software peer review practices. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, ESEC/FSE 2013, pp 202–212 Rigby P C, Bird C (2013) Convergent contemporary software peer review practices. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, ESEC/FSE 2013, pp 202–212
Zurück zum Zitat Rigby PC, German DM, Storey MA (2008) Open source software peer review practices: a case study of the Apache server. In: Proceedings of the 30th international conferences on software engineering, ICSE ’08, pp 541–550 Rigby PC, German DM, Storey MA (2008) Open source software peer review practices: a case study of the Apache server. In: Proceedings of the 30th international conferences on software engineering, ICSE ’08, pp 541–550
Zurück zum Zitat Sheoran J, Blincoe K, Kalliamvakou E, Damian D, Ell J (2014) Understanding ”watchers” on github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 336–339 Sheoran J, Blincoe K, Kalliamvakou E, Damian D, Ell J (2014) Understanding ”watchers” on github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 336–339
Zurück zum Zitat Thung F, Bissyande T, Lo D, Jiang L (2013) Network structure of social coding in GitHub. In: 17th European conference on software maintenance and reengineering (CSMR), pp 323–326 Thung F, Bissyande T, Lo D, Jiang L (2013) Network structure of social coding in GitHub. In: 17th European conference on software maintenance and reengineering (CSMR), pp 323–326
Zurück zum Zitat Tsay J, Dabbish L, Herbsleb J (2014) Influence of social and technical factors for evaluating contribution in github. In: Proceedings of the 36th international conference on software engineering, ICSE 2014, pp 356–366 Tsay J, Dabbish L, Herbsleb J (2014) Influence of social and technical factors for evaluating contribution in github. In: Proceedings of the 36th international conference on software engineering, ICSE 2014, pp 356–366
Zurück zum Zitat Tsay JT, Dabbish L, Herbsleb J (2012) Social media and success in open source projects. In: Proceedings of computer supported cooperative work companion, pp 223–226 Tsay JT, Dabbish L, Herbsleb J (2012) Social media and success in open source projects. In: Proceedings of computer supported cooperative work companion, pp 223–226
Zurück zum Zitat Wagstrom P, Jergensen C, Sarma A (2013) A network of rails: a graph dataset of ruby on rails and associated projects. In: Proceedings of the 10th international work conferences on mining software repositories, pp 229–232 Wagstrom P, Jergensen C, Sarma A (2013) A network of rails: a graph dataset of ruby on rails and associated projects. In: Proceedings of the 10th international work conferences on mining software repositories, pp 229–232
Zurück zum Zitat Weiss D (2005) Quantitative analysis of open source projects on sourceforge. In: Proceedings of the first international conference on open source systems (OSS 2005), pp 140–147 Weiss D (2005) Quantitative analysis of open source projects on sourceforge. In: Proceedings of the first international conference on open source systems (OSS 2005), pp 140–147
Metadaten
Titel
An in-depth study of the promises and perils of mining GitHub
verfasst von
Eirini Kalliamvakou
Georgios Gousios
Kelly Blincoe
Leif Singer
Daniel M. German
Daniela Damian
Publikationsdatum
01.10.2016
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 5/2016
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-015-9393-5

Weitere Artikel der Ausgabe 5/2016

Empirical Software Engineering 5/2016 Zur Ausgabe

OriginalPaper

Prompter