Skip to main content
Erschienen in: Empirical Software Engineering 6/2017

18.04.2017

Curating GitHub for engineered software projects

verfasst von: Nuthan Munaiah, Steven Kroh, Craig Cabrey, Meiyappan Nagappan

Erschienen in: Empirical Software Engineering | Ausgabe 6/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Software forges like GitHub host millions of repositories. Software engineering researchers have been able to take advantage of such a large corpora of potential study subjects with the help of tools like GHTorrent and Boa. However, the simplicity in querying comes with a caveat: there are limited means of separating the signal (e.g. repositories containing engineered software projects) from the noise (e.g. repositories containing home work assignments). The proportion of noise in a random sample of repositories could skew the study and may lead to researchers reaching unrealistic, potentially inaccurate, conclusions. We argue that it is imperative to have the ability to sieve out the noise in such large repository forges. We propose a framework, and present a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project. We identify software engineering practices (called dimensions) and propose means for validating their existence in a GitHub repository. We used reaper to measure the dimensions of 1,857,423 GitHub repositories. We then used manually classified data sets of repositories to train classifiers capable of predicting if a given GitHub repository contains an engineered software project. The performance of the classifiers was evaluated using a set of 200 repositories with known ground truth classification. We also compared the performance of the classifiers to other approaches to classification (e.g. number of GitHub Stargazers) and found our classifiers to outperform existing approaches. We found stargazers-based classifier (with 10 as the threshold for number of stargazers) to exhibit high precision (97%) but an inversely proportional recall (32%). On the other hand, our best classifier exhibited a high precision (82%) and a high recall (86%). The stargazer-based criteria offers precision but fails to recall a significant portion of the population.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
National Science Foundation (NSF) Grant CNS-1513263
 
4
http://doc.qt.io/qt-5/qtgraphicaleffects-index.html
 
5
http://merproject.org/
 
6
Citation counts retrieved from Google Scholar
 
Literatur
Zurück zum Zitat Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’t touch my code!: examining the effects of ownership on software quality Proceedings of the 19th ACM SIGSOFT symposium and the 13th european conference on foundations of software engineering, ACM, New York, NY, USA, ESEC/FSE ’11, pp 4–14. doi:10.1145/2025113.2025119 Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’t touch my code!: examining the effects of ownership on software quality Proceedings of the 19th ACM SIGSOFT symposium and the 13th european conference on foundations of software engineering, ACM, New York, NY, USA, ESEC/FSE ’11, pp 4–14. doi:10.​1145/​2025113.​2025119
Zurück zum Zitat Bissyandé TF, Lo D, Jiang L, Réveillère L, Klein J, Traon YL (2013) Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub 2013 IEEE 24th international symposium on software reliability engineering (ISSRE), pp 188–197. doi:10.1109/ISSRE.2013.6698918 Bissyandé TF, Lo D, Jiang L, Réveillère L, Klein J, Traon YL (2013) Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub 2013 IEEE 24th international symposium on software reliability engineering (ISSRE), pp 188–197. doi:10.​1109/​ISSRE.​2013.​6698918
Zurück zum Zitat Bissyandé TF, Thung F, Lo D, Jiang L, Réveillère L (2013a) Orion: a software project search engine with integrated diverse software artifacts 2013 18th international conference on engineering of complex computer systems, pp 242–245. doi:10.1109/ICECCS.2013.42 Bissyandé TF, Thung F, Lo D, Jiang L, Réveillère L (2013a) Orion: a software project search engine with integrated diverse software artifacts 2013 18th international conference on engineering of complex computer systems, pp 242–245. doi:10.​1109/​ICECCS.​2013.​42
Zurück zum Zitat Bissyandé TF, Thung F, Lo D, Jiang L, Réveillère L (2013b) Popularity, interoperability, and impact of programming languages in 100,000 open source projects 2013 IEEE 37th annual computer software and applications conference, pp 303–312. doi:10.1109/COMPSAC.2013.55 Bissyandé TF, Thung F, Lo D, Jiang L, Réveillère L (2013b) Popularity, interoperability, and impact of programming languages in 100,000 open source projects 2013 IEEE 37th annual computer software and applications conference, pp 303–312. doi:10.​1109/​COMPSAC.​2013.​55
Zurück zum Zitat de Souza CB, Anquetil N, de Oliveira KM (2005) A study of the documentation essential to software maintenance Proceedings of the 23rd annual international conference on design of communication: documenting & designing for pervasive information, ACM, New York, NY, USA, SIGDOC ’05, pp 68–75. doi:10.1145/1085313.1085331 de Souza CB, Anquetil N, de Oliveira KM (2005) A study of the documentation essential to software maintenance Proceedings of the 23rd annual international conference on design of communication: documenting & designing for pervasive information, ACM, New York, NY, USA, SIGDOC ’05, pp 68–75. doi:10.​1145/​1085313.​1085331
Zurück zum Zitat Eick SG, Graves TL, Karr AF, Marron JS, Mockus A (2001) Does code decay? Assessing the evidence from change management data. IEEE Trans Softw Eng 27(1):1–12. doi:10.1109/32.895984 CrossRef Eick SG, Graves TL, Karr AF, Marron JS, Mockus A (2001) Does code decay? Assessing the evidence from change management data. IEEE Trans Softw Eng 27(1):1–12. doi:10.​1109/​32.​895984 CrossRef
Zurück zum Zitat Emam KE, Benlarbi S, Goel N, Rai SN (2001) The confounding effect of class size on the validity of object-oriented metrics. IEEE Trans Softw Eng 27(7):630–650. doi:10.1109/32.935855 CrossRef Emam KE, Benlarbi S, Goel N, Rai SN (2001) The confounding effect of class size on the validity of object-oriented metrics. IEEE Trans Softw Eng 27(7):630–650. doi:10.​1109/​32.​935855 CrossRef
Zurück zum Zitat Guzman E, Azócar D, Li Y (2014) Sentiment analysis of commit comments in GitHub: an empirical study Proceedings of the 11th working conference on mining software repositories, ACM, New York, NY, USA, MSR 2014, pp 352–355. doi:10.1145/2597073.2597118 Guzman E, Azócar D, Li Y (2014) Sentiment analysis of commit comments in GitHub: an empirical study Proceedings of the 11th working conference on mining software repositories, ACM, New York, NY, USA, MSR 2014, pp 352–355. doi:10.​1145/​2597073.​2597118
Zurück zum Zitat Jarczyk O, Gruszka B, Jaroszewicz S, Bukowski L, Wierzbicki A (2014) Github projects. Quality analysis of open-source software. Springer International Publishing, Cham, pp 80–94. doi:10.1007/978-3-319-13734-6_6 Jarczyk O, Gruszka B, Jaroszewicz S, Bukowski L, Wierzbicki A (2014) Github projects. Quality analysis of open-source software. Springer International Publishing, Cham, pp 80–94. doi:10.​1007/​978-3-319-13734-6_​6
Zurück zum Zitat Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining GitHub Proceedings of the 11th working conference on mining software repositories, ACM, New York, NY, USA, MSR 2014, pp 92–101. doi:10.1145/2597073.2597074 Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining GitHub Proceedings of the 11th working conference on mining software repositories, ACM, New York, NY, USA, MSR 2014, pp 92–101. doi:10.​1145/​2597073.​2597074
Zurück zum Zitat Kochhar PS, Bissyandé TF, Lo D, Jiang L (2013) Adoption of software testing in open source projects–a preliminary study on 50,000 projects 2013 17th european conference on software maintenance and reengineering, pp 353–356. doi:10.1109/CSMR.2013.48 Kochhar PS, Bissyandé TF, Lo D, Jiang L (2013) Adoption of software testing in open source projects–a preliminary study on 50,000 projects 2013 17th european conference on software maintenance and reengineering, pp 353–356. doi:10.​1109/​CSMR.​2013.​48
Zurück zum Zitat Kofink A (2015) Contributions of the under-appreciated: gender bias in an open-source ecology Companion proceedings of the 2015 ACM SIGPLAN international conference on systems, programming, languages and applications: Software for humanity, ACM, New York, NY, USA, SPLASH Companion 2015, pp 83–84. doi:10.1145/2814189.2815369 Kofink A (2015) Contributions of the under-appreciated: gender bias in an open-source ecology Companion proceedings of the 2015 ACM SIGPLAN international conference on systems, programming, languages and applications: Software for humanity, ACM, New York, NY, USA, SPLASH Companion 2015, pp 83–84. doi:10.​1145/​2814189.​2815369
Zurück zum Zitat Laplante P (2007) What every engineer should know about software engineering. What every engineer should know. CRC Press Laplante P (2007) What every engineer should know about software engineering. What every engineer should know. CRC Press
Zurück zum Zitat Mockus A, Fielding R T, Herbsleb J (2000) A case study of open source software development: the apache server Proceedings of the 2000 international conference on software engineering. ICSE 2000 the new millennium, pp 263–272. doi:10.1145/337180.337209 Mockus A, Fielding R T, Herbsleb J (2000) A case study of open source software development: the apache server Proceedings of the 2000 international conference on software engineering. ICSE 2000 the new millennium, pp 263–272. doi:10.​1145/​337180.​337209
Zurück zum Zitat Nagappan N, Williams L, Osborne J, Vouk M, Abrahamsson P (2005) Providing test quality feedback using static source code and automatic test suite metrics 16th IEEE international symposium on software reliability engineering (ISSRE’05), pp 10–94. doi:10.1109/ISSRE.2005.35 Nagappan N, Williams L, Osborne J, Vouk M, Abrahamsson P (2005) Providing test quality feedback using static source code and automatic test suite metrics 16th IEEE international symposium on software reliability engineering (ISSRE’05), pp 10–94. doi:10.​1109/​ISSRE.​2005.​35
Zurück zum Zitat Ray B, Posnett D, Filkov V, Devanbu P (2014) A large scale study of programming languages and code quality in github Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, ACM, New York, NY, USA, FSE 2014, pp 155–165. doi:10.1145/2635868.2635922 Ray B, Posnett D, Filkov V, Devanbu P (2014) A large scale study of programming languages and code quality in github Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, ACM, New York, NY, USA, FSE 2014, pp 155–165. doi:10.​1145/​2635868.​2635922
Zurück zum Zitat Ross SM (2003) Peirce’s criterion for the elimination of suspect experimental data. J Eng Technol 20(2):38–41 Ross SM (2003) Peirce’s criterion for the elimination of suspect experimental data. J Eng Technol 20(2):38–41
Zurück zum Zitat Sajnani H, Saini V, Ossher J, Lopes CV (2014) Is popularity a measure of quality? an analysis of maven components 2014 IEEE international conference on software maintenance and evolution, pp 231–240. doi:10.1109/ICSME.2014.45 Sajnani H, Saini V, Ossher J, Lopes CV (2014) Is popularity a measure of quality? an analysis of maven components 2014 IEEE international conference on software maintenance and evolution, pp 231–240. doi:10.​1109/​ICSME.​2014.​45
Zurück zum Zitat Syer MD, Nagappan M, Hassan AE, Adams B (2013) Revisiting prior empirical findings for mobile apps: an empirical case study on the 15 most popular open-source android apps Proceedings of the 2013 conference of the center for advanced studies on collaborative research, IBM corp., riverton, NJ, USA, CASCON ’13, pp 283–297. http://dl.acm.org/citation.cfm?id=2555523.2555553 Syer MD, Nagappan M, Hassan AE, Adams B (2013) Revisiting prior empirical findings for mobile apps: an empirical case study on the 15 most popular open-source android apps Proceedings of the 2013 conference of the center for advanced studies on collaborative research, IBM corp., riverton, NJ, USA, CASCON ’13, pp 283–297. http://​dl.​acm.​org/​citation.​cfm?​id=​2555523.​2555553
Zurück zum Zitat Vasilescu B, van Schuylenburg S, Wulms J, Serebrenik A, van den Brand MGJ (2014) Continuous integration in a social-coding world empirical evidence from GitHub 2014 IEEE international conference on software maintenance and evolution, pp 401–405. doi:10.1109/ICSME.2014.62 Vasilescu B, van Schuylenburg S, Wulms J, Serebrenik A, van den Brand MGJ (2014) Continuous integration in a social-coding world empirical evidence from GitHub 2014 IEEE international conference on software maintenance and evolution, pp 401–405. doi:10.​1109/​ICSME.​2014.​62
Zurück zum Zitat Vendome C (2015) A large scale study of license usage on GitHub 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 2, pp 772–774. doi:10.1109/ICSE.2015.245 Vendome C (2015) A large scale study of license usage on GitHub 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 2, pp 772–774. doi:10.​1109/​ICSE.​2015.​245
Zurück zum Zitat Zaidman A, Rompaey BV, Demeyer S, v Deursen A (2008) Mining software repositories to study co-evolution of production & test code 2008 1st international conference on software testing, verification, and validation, pp 220–229. doi:10.1109/ICST.2008.47 Zaidman A, Rompaey BV, Demeyer S, v Deursen A (2008) Mining software repositories to study co-evolution of production & test code 2008 1st international conference on software testing, verification, and validation, pp 220–229. doi:10.​1109/​ICST.​2008.​47
Metadaten
Titel
Curating GitHub for engineered software projects
verfasst von
Nuthan Munaiah
Steven Kroh
Craig Cabrey
Meiyappan Nagappan
Publikationsdatum
18.04.2017
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 6/2017
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-017-9512-6

Weitere Artikel der Ausgabe 6/2017

Empirical Software Engineering 6/2017 Zur Ausgabe

Premium Partner