Skip to main content
Erschienen in: Empirical Software Engineering 1/2021

01.01.2021

What makes a popular academic AI repository?

verfasst von: Yuanrui Fan, Xin Xia, David Lo, Ahmed E. Hassan, Shanping Li

Erschienen in: Empirical Software Engineering | Ausgabe 1/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Many AI researchers are publishing code, data and other resources that accompany their papers in GitHub repositories. In this paper, we refer to these repositories as academic AI repositories. Our preliminary study shows that highly cited papers are more likely to have popular academic AI repositories (and vice versa). Hence, in this study, we perform an empirical study on academic AI repositories to highlight good software engineering practices of popular academic AI repositories for AI researchers. We collect 1,149 academic AI repositories, in which we label the top 20% repositories that have the most number of stars as popular, and we label the bottom 70% repositories as unpopular. The remaining 10% repositories are set as a gap between popular and unpopular academic AI repositories. We propose 21 features to characterize the software engineering practices of academic AI repositories. Our experimental results show that popular and unpopular academic AI repositories are statistically significantly different in 11 of the studied features—indicating that the two groups of repositories have significantly different software engineering practices. Furthermore, we find that the number of links to other GitHub repositories in the README file, the number of images in the README file and the inclusion of a license are the most important features for differentiating the two groups of academic AI repositories. Our dataset and code are made publicly available to share with the community.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aggarwal K, Hindle A, Stroulia E (2014) Co-evolution of project documentation and popularity within github. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 360–363 Aggarwal K, Hindle A, Stroulia E (2014) Co-evolution of project documentation and popularity within github. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 360–363
Zurück zum Zitat Alves TL, Ypma C, Visser J (2010) Deriving metric thresholds from benchmark data. In: IEEE international conference on software maintenance. IEEE, pp 1–10 Alves TL, Ypma C, Visser J (2010) Deriving metric thresholds from benchmark data. In: IEEE international conference on software maintenance. IEEE, pp 1–10
Zurück zum Zitat Balcan MF, Dick T, Sandholm T, Vitercik E (2018) Learning to branch. In: International conference on machine learning, pp 344–353 Balcan MF, Dick T, Sandholm T, Vitercik E (2018) Learning to branch. In: International conference on machine learning, pp 344–353
Zurück zum Zitat Bissyandé TF, Thung F, Lo D, Jiang L, Réveillere L (2013) Popularity, interoperability, and impact of programming languages in 100,000 open source projects. In: 2013 IEEE 37th annual computer software and applications conference. IEEE, pp 303–312 Bissyandé TF, Thung F, Lo D, Jiang L, Réveillere L (2013) Popularity, interoperability, and impact of programming languages in 100,000 open source projects. In: 2013 IEEE 37th annual computer software and applications conference. IEEE, pp 303–312
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH
Zurück zum Zitat Boettiger C (2015) An introduction to docker for reproducible research. ACM SIGOPS Oper Syst Rev 49(1):71–79CrossRef Boettiger C (2015) An introduction to docker for reproducible research. ACM SIGOPS Oper Syst Rev 49(1):71–79CrossRef
Zurück zum Zitat Borges H, Hora A, Valente MT (2016a) Predicting the popularity of github repositories. In: Proceedings of the the 12th international conference on predictive models and data analytics in software engineering. ACM, p 9 Borges H, Hora A, Valente MT (2016a) Predicting the popularity of github repositories. In: Proceedings of the the 12th international conference on predictive models and data analytics in software engineering. ACM, p 9
Zurück zum Zitat Borges H, Hora A, Valente MT (2016b) Understanding the factors that impact the popularity of github repositories. In: 2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 334–344 Borges H, Hora A, Valente MT (2016b) Understanding the factors that impact the popularity of github repositories. In: 2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 334–344
Zurück zum Zitat Cliff N (2014) Ordinal methods for behavioral data analysis. Psychology Press, New York, NYCrossRef Cliff N (2014) Ordinal methods for behavioral data analysis. Psychology Press, New York, NYCrossRef
Zurück zum Zitat Collberg C, Proebsting TA (2016) Repeatability in computer systems research. Commun ACM 59(3):62–69CrossRef Collberg C, Proebsting TA (2016) Repeatability in computer systems research. Commun ACM 59(3):62–69CrossRef
Zurück zum Zitat Collobert R, Bengio S, Mariéthoz J (2002) Torch: a modular machine learning software library. Tech. rep., Idiap Collobert R, Bengio S, Mariéthoz J (2002) Torch: a modular machine learning software library. Tech. rep., Idiap
Zurück zum Zitat Cutler A, Cutler DR, Stevens JR (2012) Random forests. In: Ensemble machine learning. Springer, pp 157–175 Cutler A, Cutler DR, Stevens JR (2012) Random forests. In: Ensemble machine learning. Springer, pp 157–175
Zurück zum Zitat Fan Y, Xia X, Lo D, Hassan AE (2018a) Chaff from the wheat: characterizing and determining valid bug reports. In: IEEE transactions on software engineering Fan Y, Xia X, Lo D, Hassan AE (2018a) Chaff from the wheat: characterizing and determining valid bug reports. In: IEEE transactions on software engineering
Zurück zum Zitat Fan Y, Xia X, Lo D, Li S (2018b) Early prediction of merged code changes to prioritize reviewing tasks. Empir Softw Eng 23(6):3346–3393CrossRef Fan Y, Xia X, Lo D, Li S (2018b) Early prediction of merged code changes to prioritize reviewing tasks. Empir Softw Eng 23(6):3346–3393CrossRef
Zurück zum Zitat Fan Y, Xia X, da Costa DA, Lo D, Hassan AE, Li S (2019) The impact of changes mislabeled by szz on just-in-time defect prediction. In: IEEE transactions on software engineering Fan Y, Xia X, da Costa DA, Lo D, Hassan AE, Li S (2019) The impact of changes mislabeled by szz on just-in-time defect prediction. In: IEEE transactions on software engineering
Zurück zum Zitat Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378CrossRef Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378CrossRef
Zurück zum Zitat Fogel K (2005) Producing open source software: how to run a successful free software project. O’Reilly Media, Inc Fogel K (2005) Producing open source software: how to run a successful free software project. O’Reilly Media, Inc
Zurück zum Zitat Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 1. IEEE, pp 789–800 Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 1. IEEE, pp 789–800
Zurück zum Zitat Gousios G, Pinzger M, Deursen AV (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering. ACM, pp 345–355 Gousios G, Pinzger M, Deursen AV (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering. ACM, pp 345–355
Zurück zum Zitat Gundersen OE, Gil Y, Aha DW (2017) On reproducible ai: towards reproducible research, open science, and digital scholarship in ai publications. AI Mag 39(3):56–68CrossRef Gundersen OE, Gil Y, Aha DW (2017) On reproducible ai: towards reproducible research, open science, and digital scholarship in ai publications. AI Mag 39(3):56–68CrossRef
Zurück zum Zitat Han J, Deng S, Xia X, Wang D, Yin J (2019) Characterization and prediction of popular projects on github. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), vol 1. IEEE, pp 21–26 Han J, Deng S, Xia X, Wang D, Yin J (2019) Characterization and prediction of popular projects on github. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), vol 1. IEEE, pp 21–26
Zurück zum Zitat Harrell FE Jr (2015) Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer, BerlinCrossRef Harrell FE Jr (2015) Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer, BerlinCrossRef
Zurück zum Zitat Hosmer DW Jr, Lemeshow S, Sturdiest RX (2013) Applied logistic regression, vol 398. Wiley, HobokenCrossRef Hosmer DW Jr, Lemeshow S, Sturdiest RX (2013) Applied logistic regression, vol 398. Wiley, HobokenCrossRef
Zurück zum Zitat Hu Y, Zhang J, Bai X, Yu S, Yang Z (2016) Influence analysis of github repositories. SpringerPlus 5(1):1268CrossRef Hu Y, Zhang J, Bai X, Yu S, Yang Z (2016) Influence analysis of github repositories. SpringerPlus 5(1):1268CrossRef
Zurück zum Zitat Huang J, Ling CX (2005) Using auc and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310CrossRef Huang J, Ling CX (2005) Using auc and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310CrossRef
Zurück zum Zitat Jiang J, Lo D, He J, Xia X, Kochhar PS, Zhang L (2017) Why and how developers fork what from whom in github. Empir Softw Eng 22(1):547–578CrossRef Jiang J, Lo D, He J, Xia X, Kochhar PS, Zhang L (2017) Why and how developers fork what from whom in github. Empir Softw Eng 22(1):547–578CrossRef
Zurück zum Zitat Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in oopl. In: Proceedings. 2004 International symposium on empirical software engineering, ISESE’04. IEEE, pp 83–92 Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in oopl. In: Proceedings. 2004 International symposium on empirical software engineering, ISESE’04. IEEE, pp 83–92
Zurück zum Zitat Kimble J (1992) Plain english: a charter for clear writing. TM Cooley L Rev 9:1 Kimble J (1992) Plain english: a charter for clear writing. TM Cooley L Rev 9:1
Zurück zum Zitat Li Z, Lu S, Myagmar S, Zhou Y (2006) Cp-miner: finding copy-paste and related bugs in large-scale software code. IEEE Trans Softw Eng 32 (3):176–192CrossRef Li Z, Lu S, Myagmar S, Zhou Y (2006) Cp-miner: finding copy-paste and related bugs in large-scale software code. IEEE Trans Softw Eng 32 (3):176–192CrossRef
Zurück zum Zitat Newman D, Lau JH, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics, pp 100–108 Newman D, Lau JH, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics, pp 100–108
Zurück zum Zitat Nosek BA, Alter G, Banks GC, Borsboom D, Bowman SD, Breckler SJ, Buck S, Chambers CD, Chin G, Christensen G, et al. (2015) Promoting an open research culture. Science 348(6242):1422–1425CrossRef Nosek BA, Alter G, Banks GC, Borsboom D, Bowman SD, Breckler SJ, Buck S, Chambers CD, Chin G, Christensen G, et al. (2015) Promoting an open research culture. Science 348(6242):1422–1425CrossRef
Zurück zum Zitat Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) Pytorch: an imperative style high-performance deep learning library. In: Advances in neural information processing systems, pp 8024–8035 Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) Pytorch: an imperative style high-performance deep learning library. In: Advances in neural information processing systems, pp 8024–8035
Zurück zum Zitat Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor Newsl 6(1):50–59CrossRef Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor Newsl 6(1):50–59CrossRef
Zurück zum Zitat Portugal RLQ, do Prado Leite JCS (2016) Extracting requirements patterns from software repositories. In: 2016 IEEE 24th international requirements engineering conference workshops (REW). IEEE, pp 304–307 Portugal RLQ, do Prado Leite JCS (2016) Extracting requirements patterns from software repositories. In: 2016 IEEE 24th international requirements engineering conference workshops (REW). IEEE, pp 304–307
Zurück zum Zitat Prana GAA, Treude C, Thung F, Atapattu T, Lo D (2019) Categorizing the content of GitHub README files. Empir Softw Eng 24(3):1296–1327CrossRef Prana GAA, Treude C, Thung F, Atapattu T, Lo D (2019) Categorizing the content of GitHub README files. Empir Softw Eng 24(3):1296–1327CrossRef
Zurück zum Zitat Schober P, Boer C, Schwarte LA (2018) Correlation coefficients: appropriate use and interpretation. Anesth Analg 126(5):1763–1768CrossRef Schober P, Boer C, Schwarte LA (2018) Correlation coefficients: appropriate use and interpretation. Anesth Analg 126(5):1763–1768CrossRef
Zurück zum Zitat Scott AJ, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 30(3):507–512CrossRef Scott AJ, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 30(3):507–512CrossRef
Zurück zum Zitat Sonnenburg S, Braun ML, Ong CS, Bengio S, Bottou L, Holmes G, LeCun Y, MÞller KR, Pereira F, Rasmussen CE, et al. (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466 Sonnenburg S, Braun ML, Ong CS, Bengio S, Bottou L, Holmes G, LeCun Y, MÞller KR, Pereira F, Rasmussen CE, et al. (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466
Zurück zum Zitat Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press
Zurück zum Zitat Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: pitfalls and challenges. In: Proceedings of the 40th international conference on software engineering: software engineering in practice. ACM, pp 286–295 Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: pitfalls and challenges. In: Proceedings of the 40th international conference on software engineering: software engineering in practice. ACM, pp 286–295
Zurück zum Zitat Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng 43(1):1–18CrossRef Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng 43(1):1–18CrossRef
Zurück zum Zitat Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2018) The impact of automated parameter optimization on defect prediction models. IEEE Trans Softw Eng 45(7):683–711CrossRef Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2018) The impact of automated parameter optimization on defect prediction models. IEEE Trans Softw Eng 45(7):683–711CrossRef
Zurück zum Zitat Tian Y, Nagappan M, Lo D, Hassan AE (2015) What are the characteristics of high-rated apps? A case study on free android applications. In: 2015 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 301–310 Tian Y, Nagappan M, Lo D, Hassan AE (2015) What are the characteristics of high-rated apps? A case study on free android applications. In: 2015 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 301–310
Zurück zum Zitat Upton GJ (1992) Fisher’s exact test. J R Stat Soc: Ser A (Stat Soc) 155(3):395–402CrossRef Upton GJ (1992) Fisher’s exact test. J R Stat Soc: Ser A (Stat Soc) 155(3):395–402CrossRef
Zurück zum Zitat Wan Z, Lo D, Xia X, Cai L, Li S (2017) Mining sandboxes for linux containers. In: IEEE international conference on software testing, verification and validation (ICST). IEEE, pp 92–102 Wan Z, Lo D, Xia X, Cai L, Li S (2017) Mining sandboxes for linux containers. In: IEEE international conference on software testing, verification and validation (ICST). IEEE, pp 92–102
Zurück zum Zitat Wan Z, Xia X, Hassan AE, Lo D, Yin J, Yang X (2018) Perceptions, expectations, and challenges in defect prediction. IEEE Trans Softw Eng 46(11):1241–1266CrossRef Wan Z, Xia X, Hassan AE, Lo D, Yin J, Yang X (2018) Perceptions, expectations, and challenges in defect prediction. IEEE Trans Softw Eng 46(11):1241–1266CrossRef
Zurück zum Zitat Wang TC, Liu MY, Zhu JY, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. In: Advances in neural information processing systems, vol 31, pp 1144–1156 Wang TC, Liu MY, Zhu JY, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. In: Advances in neural information processing systems, vol 31, pp 1144–1156
Zurück zum Zitat Weber S, Luo J (2014) What makes an open source code popular on git hub?. In: IEEE international conference on data mining workshop. IEEE, pp 851–855 Weber S, Luo J (2014) What makes an open source code popular on git hub?. In: IEEE international conference on data mining workshop. IEEE, pp 851–855
Zurück zum Zitat Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRef Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRef
Zurück zum Zitat Woodfield SN, Dunsmore HE, Shen VY (1981) The effect of modularization and comments on program comprehension. In: Proceedings of the 5th international conference on Software engineering. IEEE Press, pp 215–223 Woodfield SN, Dunsmore HE, Shen VY (1981) The effect of modularization and comments on program comprehension. In: Proceedings of the 5th international conference on Software engineering. IEEE Press, pp 215–223
Zurück zum Zitat Xia X, Wan Z, Kochhar PS, Lo D (2019) How practitioners perceive coding proficiency. In: 2019 IEEE/ACM 41st international conference on software engineering (ICSE). IEEE, pp 924–935 Xia X, Wan Z, Kochhar PS, Lo D (2019) How practitioners perceive coding proficiency. In: 2019 IEEE/ACM 41st international conference on software engineering (ICSE). IEEE, pp 924–935
Zurück zum Zitat Yan M, Xia X, Zhang X, Yang D, Xu L (2017) Automating aggregation for software quality modeling. In: IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 529–533 Yan M, Xia X, Zhang X, Yang D, Xu L (2017) Automating aggregation for software quality modeling. In: IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 529–533
Zurück zum Zitat Yan M, Xia X, Shihab E, Lo D, Yin J, Yang X (2018) Automating change-level self-admitted technical debt determination. IEEE Trans Softw Eng 45(12):1211–1229CrossRef Yan M, Xia X, Shihab E, Lo D, Yin J, Yang X (2018) Automating change-level self-admitted technical debt determination. IEEE Trans Softw Eng 45(12):1211–1229CrossRef
Zurück zum Zitat Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph r-cnn for scene graph generation. In: Proceedings of the European conference on computer vision (ECCV, pp 670–685 Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph r-cnn for scene graph generation. In: Proceedings of the European conference on computer vision (ECCV, pp 670–685
Zurück zum Zitat Zar JH (2005) Spearman rank correlation. Encyclopedia of Biostatistics 7 Zar JH (2005) Spearman rank correlation. Encyclopedia of Biostatistics 7
Zurück zum Zitat Zhu J, Zhou M, Mockus A (2014) Patterns of folder use and project popularity: a case study of github repositories. In: Proceedings of the 8th ACM/IEEE international symposium on empirical software engineering and measurement. ACM, p 30 Zhu J, Zhou M, Mockus A (2014) Patterns of folder use and project popularity: a case study of github repositories. In: Proceedings of the 8th ACM/IEEE international symposium on empirical software engineering and measurement. ACM, p 30
Metadaten
Titel
What makes a popular academic AI repository?
verfasst von
Yuanrui Fan
Xin Xia
David Lo
Ahmed E. Hassan
Shanping Li
Publikationsdatum
01.01.2021
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 1/2021
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-020-09916-6

Weitere Artikel der Ausgabe 1/2021

Empirical Software Engineering 1/2021 Zur Ausgabe

Premium Partner