Skip to main content
Top
Published in: Empirical Software Engineering 5/2016

01-10-2016

An in-depth study of the promises and perils of mining GitHub

Authors: Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, Daniela Damian

Published in: Empirical Software Engineering | Issue 5/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

With over 10 million git repositories, GitHub is becoming one of the most important sources of software artifacts on the Internet. Researchers mine the information stored in GitHub’s event logs to understand how its users employ the site to collaborate on software, but so far there have been no studies describing the quality and properties of the available GitHub data. We document the results of an empirical study aimed at understanding the characteristics of the repositories and users in GitHub; we see how users take advantage of GitHub’s main features and how their activity is tracked on GitHub and related datasets to point out misalignment between the real and mined data. Our results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. For example, we show that the majority of the projects are personal and inactive, and that almost 40 % of all pull requests do not appear as merged even though they were. Also, approximately half of GitHub’s registered users do not have public activity, while the activity of GitHub users in repositories is not always easy to pinpoint. We use our identified perils to see if they can pose validity threats; we review selected papers from the MSR 2014 Mining Challenge and see if there are potential impacts to consider. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Footnotes
2
A collection of open source software data, formerly known as OssMole.
 
4
ghtorrent associates a commit with the repository where it first sees it (table commits) and also links it to all repositories this commit has appeared into (table repo_commits)
 
8
We currently track all sources of commits in the Linux kernel: hydraladder.​turingmachine.​org
 
12
The authors clarified this view in private communication.
 
Literature
go back to reference Aranda J, Venolia G (2009) The secret life of bugs: Going past the errors and omissions in software repositories. In: Proceedings of the 31st international conference on software engineering, pp 298– 308 Aranda J, Venolia G (2009) The secret life of bugs: Going past the errors and omissions in software repositories. In: Proceedings of the 31st international conference on software engineering, pp 298– 308
go back to reference Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges of modern code review. In: Proceedings international conference on soft engineering, ICSE ’13, pp 712–721 Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges of modern code review. In: Proceedings international conference on soft engineering, ICSE ’13, pp 712–721
go back to reference Bachmann A, Bird C, Rahman F, Devanbu P, Bernstein A (2010) The missing links: bugs and bug-fix commits. In: Proceedings of the 18th ACM SIGSOFT international symposium on Foundations of software engineering, pp 97–106 Bachmann A, Bird C, Rahman F, Devanbu P, Bernstein A (2010) The missing links: bugs and bug-fix commits. In: Proceedings of the 18th ACM SIGSOFT international symposium on Foundations of software engineering, pp 97–106
go back to reference Begel A, Bosch J, Storey MA (2013) Social networking meets software development: perspectives from github, msdn, stack exchange, and topcoder. Software, IEEE 30(1):52–66CrossRef Begel A, Bosch J, Storey MA (2013) Social networking meets software development: perspectives from github, msdn, stack exchange, and topcoder. Software, IEEE 30(1):52–66CrossRef
go back to reference Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, et al. (2009a) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the the symposium on the foundations of software engineering, pp 121–130 Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, et al. (2009a) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the the symposium on the foundations of software engineering, pp 121–130
go back to reference Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009b) The promises and perils of mining git. In: Mining software repositories, (MSR’09). IEEE, pp 1–10 Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009b) The promises and perils of mining git. In: Mining software repositories, (MSR’09). IEEE, pp 1–10
go back to reference Bissyande TF, Lo D, Jiang L, Reveillere L, Klein J, Le Traon Y (2013) Got issues? who cares about it? a large scale investigation of issue trackers from github. In: 2013 IEEE 24th international symposium on software reliability engineering (ISSRE). IEEE, pp 188–197 Bissyande TF, Lo D, Jiang L, Reveillere L, Klein J, Le Traon Y (2013) Got issues? who cares about it? a large scale investigation of issue trackers from github. In: 2013 IEEE 24th international symposium on software reliability engineering (ISSRE). IEEE, pp 188–197
go back to reference Corbin J, Strauss A (2008) Basics of qualitative research: Techniques and procedures for developing grounded theory. Sage Corbin J, Strauss A (2008) Basics of qualitative research: Techniques and procedures for developing grounded theory. Sage
go back to reference Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in GitHub: transparency and collaboration in an open software repository. In: Proceedings conference on computer supported cooperative work, pp 1277–1286 Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in GitHub: transparency and collaboration in an open software repository. In: Proceedings conference on computer supported cooperative work, pp 1277–1286
go back to reference Gousios G, Spinellis D (2012) GHTorrent: GitHub’s data from a firehose. In: MSR ’12: proceedings of the 9th working conference on mining software repositories, pp 12–21 Gousios G, Spinellis D (2012) GHTorrent: GitHub’s data from a firehose. In: MSR ’12: proceedings of the 9th working conference on mining software repositories, pp 12–21
go back to reference Gousios G, Zaidman A (2014a) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 368–371 Gousios G, Zaidman A (2014a) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 368–371
go back to reference Gousios G, Zaidman A (2014b) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 368–371 Gousios G, Zaidman A (2014b) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 368–371
go back to reference Gousios G, Pinzger M, Av D (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering, ICSE 2014, pp 345– 355 Gousios G, Pinzger M, Av D (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering, ICSE 2014, pp 345– 355
go back to reference Gousios G, Zaidman A, Storey MA, Av D (2015) Work practices and challenges in pull-based development: The integratorĂŹs perspective. In: Proceedings of the 37th international conference on software engineering, ICSE 2015, to appear Gousios G, Zaidman A, Storey MA, Av D (2015) Work practices and challenges in pull-based development: The integratorĂŹs perspective. In: Proceedings of the 37th international conference on software engineering, ICSE 2015, to appear
go back to reference Howison J, Crowston K (2004) The perils and pitfalls of mining sourceforge. In: Proceedings of the international workshop on mining software repositories, pp 7–11 Howison J, Crowston K (2004) The perils and pitfalls of mining sourceforge. In: Proceedings of the international workshop on mining software repositories, pp 7–11
go back to reference Kalliamvakou E, Damian D, Singer L, German DM (2014a) The code-centric collaboration perspective: evidence from GitHub. Technical Report DCS-352-IR, University of Victoria Kalliamvakou E, Damian D, Singer L, German DM (2014a) The code-centric collaboration perspective: evidence from GitHub. Technical Report DCS-352-IR, University of Victoria
go back to reference Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014b) The promises and perils of mining github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 92–101 Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014b) The promises and perils of mining github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 92–101
go back to reference Kochhar PS, Bissyandé TF, Lo D, Jiang L (2013) Adoption of software testing in open source projects–a preliminary study on 50,000 projects. In: 2013 17th European conference on software maintenance and reengineering (CSMR). IEEE, pp 353–356 Kochhar PS, Bissyandé TF, Lo D, Jiang L (2013) Adoption of software testing in open source projects–a preliminary study on 50,000 projects. In: 2013 17th European conference on software maintenance and reengineering (CSMR). IEEE, pp 353–356
go back to reference Marlow J, Dabbish L, Herbsleb J (2013) Impression formation in online peer production: activity traces and personal profiles in github. In: Proceedings of conference computer supported cooperative work, pp 117–128 Marlow J, Dabbish L, Herbsleb J (2013) Impression formation in online peer production: activity traces and personal profiles in github. In: Proceedings of conference computer supported cooperative work, pp 117–128
go back to reference Matragkas N, Williams JR, Kolovos DS, Paige RF (2014) Analysing the ’biodiversity’ of open source ecosystems: The github case. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 356–359 Matragkas N, Williams JR, Kolovos DS, Paige RF (2014) Analysing the ’biodiversity’ of open source ecosystems: The github case. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 356–359
go back to reference McDonald N, Goggins S (2013) Performance and participation in open source software on github. In: CHI’13 extended abstracts on human factors in computing systems. ACM, pp 139–144 McDonald N, Goggins S (2013) Performance and participation in open source software on github. In: CHI’13 extended abstracts on human factors in computing systems. ACM, pp 139–144
go back to reference Nguyen TH, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: 2010 17th working conference on reverse engineering (WCRE). IEEE, pp 259–268 Nguyen TH, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: 2010 17th working conference on reverse engineering (WCRE). IEEE, pp 259–268
go back to reference Padhye R, Mani S, Sinha VS (2014) A Study of External Community Contribution to Open-source Projects on GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 332–335 Padhye R, Mani S, Sinha VS (2014) A Study of External Community Contribution to Open-source Projects on GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 332–335
go back to reference Pham R, Singer L, Liskin O, Figueira Filho F, Schneider K (2013) Creating a shared understanding of testing culture on a social coding site. In: Proceedings of international conference on soft engineering, ICSE ’13, pp 112–121 Pham R, Singer L, Liskin O, Figueira Filho F, Schneider K (2013) Creating a shared understanding of testing culture on a social coding site. In: Proceedings of international conference on soft engineering, ICSE ’13, pp 112–121
go back to reference Rahman F, Posnett D, Herraiz I, Devanbu P (2013) Sample size vs. bias in defect prediction. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, pp 147–157 Rahman F, Posnett D, Herraiz I, Devanbu P (2013) Sample size vs. bias in defect prediction. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, pp 147–157
go back to reference Rahman MM, Roy CK (2014) An insight into the pull requests of GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 364–367 Rahman MM, Roy CK (2014) An insight into the pull requests of GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 364–367
go back to reference Rainer A, Gale S (2005) Evaluating the quality and quantity of data on open source software projects. In: Proceedings of the first international conference on open source systems (OSS 2005), pp 29– 36 Rainer A, Gale S (2005) Evaluating the quality and quantity of data on open source software projects. In: Proceedings of the first international conference on open source systems (OSS 2005), pp 29– 36
go back to reference Rigby P C, Bird C (2013) Convergent contemporary software peer review practices. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, ESEC/FSE 2013, pp 202–212 Rigby P C, Bird C (2013) Convergent contemporary software peer review practices. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, ESEC/FSE 2013, pp 202–212
go back to reference Rigby PC, German DM, Storey MA (2008) Open source software peer review practices: a case study of the Apache server. In: Proceedings of the 30th international conferences on software engineering, ICSE ’08, pp 541–550 Rigby PC, German DM, Storey MA (2008) Open source software peer review practices: a case study of the Apache server. In: Proceedings of the 30th international conferences on software engineering, ICSE ’08, pp 541–550
go back to reference Sheoran J, Blincoe K, Kalliamvakou E, Damian D, Ell J (2014) Understanding ”watchers” on github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 336–339 Sheoran J, Blincoe K, Kalliamvakou E, Damian D, Ell J (2014) Understanding ”watchers” on github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 336–339
go back to reference Thung F, Bissyande T, Lo D, Jiang L (2013) Network structure of social coding in GitHub. In: 17th European conference on software maintenance and reengineering (CSMR), pp 323–326 Thung F, Bissyande T, Lo D, Jiang L (2013) Network structure of social coding in GitHub. In: 17th European conference on software maintenance and reengineering (CSMR), pp 323–326
go back to reference Tsay J, Dabbish L, Herbsleb J (2014) Influence of social and technical factors for evaluating contribution in github. In: Proceedings of the 36th international conference on software engineering, ICSE 2014, pp 356–366 Tsay J, Dabbish L, Herbsleb J (2014) Influence of social and technical factors for evaluating contribution in github. In: Proceedings of the 36th international conference on software engineering, ICSE 2014, pp 356–366
go back to reference Tsay JT, Dabbish L, Herbsleb J (2012) Social media and success in open source projects. In: Proceedings of computer supported cooperative work companion, pp 223–226 Tsay JT, Dabbish L, Herbsleb J (2012) Social media and success in open source projects. In: Proceedings of computer supported cooperative work companion, pp 223–226
go back to reference Wagstrom P, Jergensen C, Sarma A (2013) A network of rails: a graph dataset of ruby on rails and associated projects. In: Proceedings of the 10th international work conferences on mining software repositories, pp 229–232 Wagstrom P, Jergensen C, Sarma A (2013) A network of rails: a graph dataset of ruby on rails and associated projects. In: Proceedings of the 10th international work conferences on mining software repositories, pp 229–232
go back to reference Weiss D (2005) Quantitative analysis of open source projects on sourceforge. In: Proceedings of the first international conference on open source systems (OSS 2005), pp 140–147 Weiss D (2005) Quantitative analysis of open source projects on sourceforge. In: Proceedings of the first international conference on open source systems (OSS 2005), pp 140–147
Metadata
Title
An in-depth study of the promises and perils of mining GitHub
Authors
Eirini Kalliamvakou
Georgios Gousios
Kelly Blincoe
Leif Singer
Daniel M. German
Daniela Damian
Publication date
01-10-2016
Publisher
Springer US
Published in
Empirical Software Engineering / Issue 5/2016
Print ISSN: 1382-3256
Electronic ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-015-9393-5

Other articles of this Issue 5/2016

Empirical Software Engineering 5/2016 Go to the issue

Premium Partner