Skip to main content
Erschienen in: Empirical Software Engineering 2/2020

03.01.2020

ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems

verfasst von: Sadika Amreen, Audris Mockus, Russell Zaretzki, Christopher Bogart, Yuxia Zhang

Erschienen in: Empirical Software Engineering | Ausgabe 2/2020

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

An accurate determination of developer identities is important for software engineering research and practice. Without it, even simple questions such as “how many developers does a project have?” cannot be answered. The commonly used version control data from Git is full of identity errors and the existing approaches to correct these errors are difficult to validate on large scale and cannot be easily improved. We, therefore, aim to develop a scalable, highly accurate, easy to use and easy to improve approach to correct software developer identity errors. We first amalgamate developer identities from version control systems in open source software repositories and investigate the nature and prevalence of these errors, design corrective algorithms, and estimate the impact of the errors on networks inferred from this data. We investigate these questions using a collection of over 1B Git commits with over 23M recorded author identities. By inspecting the author strings that occur most frequently, we group identity errors into categories. We then augment the author strings with three behavioral fingerprints: time-zone frequencies, the set of files modified, and a vector embedding of the commit messages. We create a manually validated set of identities for a subset of OpenStack developers using an active learning approach and use it to fit supervised learning models to predict the identities for the remaining author strings in OpenStack. We then compare these predictions with a competing commercially available effort and a leading research method. Finally, we compare network measures for file-induced author networks based on corrected and raw data. We find commits done from different environments, misspellings, organizational ids, default values, and anonymous IDs to be the major sources of errors. We also find supervised learning methods to reduce errors by several times in comparison to existing research and commercial methods and the active learning approach to be an effective way to create validated datasets. Results also indicate that correction of developer identity has a large impact on the inference of the social network. We believe that our proposed Active Learning Fingerprint Based Anti-Aliasing (ALFAA) approach will expedite research progress in the software engineering domain for applications that involve developer identities.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
4
On large and diverse bodies of text, a larger vector size of 300 is recommended (Řehu̇řek and Sojka 2010)
 
5
We found that more accurate predictors can be obtained by training the learner only on the matched pairs, since the transitive closure typically results in some pairs that are extremely dissimilar, leading the learner to learn from such pairs and, subsequently, produce many more false positives
 
6
Assuming independence of observations and using binomial distribution.
 
9
The author got much better results than we could obtained using their published code without modifications.
 
Literatur
Zurück zum Zitat Badashian AS, Esteki A, Gholipour A, Hindle A, Stroulia E (2014) Involvement, contribution and influence in github and stack overflow. In: Proceedings of 24th annual international conference on computer science and software engineering, pp 19–33. IBM Corp Badashian AS, Esteki A, Gholipour A, Hindle A, Stroulia E (2014) Involvement, contribution and influence in github and stack overflow. In: Proceedings of 24th annual international conference on computer science and software engineering, pp 19–33. IBM Corp
Zurück zum Zitat Baltes S, Diehl S (2018) Usage and attribution of stack overflow code snippets in github projects. Empir Softw Eng 24:1–37 Baltes S, Diehl S (2018) Usage and attribution of stack overflow code snippets in github projects. Empir Softw Eng 24:1–37
Zurück zum Zitat Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009) The promises and perils of mining git Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009) The promises and perils of mining git
Zurück zum Zitat Burt RS (1992) Structural holes. Harvard University Press, Harvard Burt RS (1992) Structural holes. Harvard University Press, Harvard
Zurück zum Zitat Cataldo M, Wagstrom PA, Herbsleb JD, Carley KM (2006) Identification of coordination requirements: implications for the design of collaboration and awareness tools. In: Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work, pp 353–362. ACM Cataldo M, Wagstrom PA, Herbsleb JD, Carley KM (2006) Identification of coordination requirements: implications for the design of collaboration and awareness tools. In: Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work, pp 353–362. ACM
Zurück zum Zitat Cataldo M, Herbsleb JD, Carley KM (2008) Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software development productivity. In: Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, pp 2–11. ACM Cataldo M, Herbsleb JD, Carley KM (2008) Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software development productivity. In: Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, pp 2–11. ACM
Zurück zum Zitat Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string metrics for matching names and records. In: KDD Workshop on data cleaning and object consolidation Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string metrics for matching names and records. In: KDD Workshop on data cleaning and object consolidation
Zurück zum Zitat Czerwonka J, Nagappan N, Schulte W, Murphy B (2013) Codemine: Building a software development data analytics platform at microsoft. IEEE Softw 30 (4):64–71CrossRef Czerwonka J, Nagappan N, Schulte W, Murphy B (2013) Codemine: Building a software development data analytics platform at microsoft. IEEE Softw 30 (4):64–71CrossRef
Zurück zum Zitat Edberg DT, Bowman BJ (1996) User-developed applications: An empirical study of application quality and developer productivity. J Manag Inf Syst 13(1):167–185CrossRef Edberg DT, Bowman BJ (1996) User-developed applications: An empirical study of application quality and developer productivity. J Manag Inf Syst 13(1):167–185CrossRef
Zurück zum Zitat German DM (2004) Mining cvs repositories, the softchange experience. In: 1st international workshop on mining software repositories, pp 17–21. Citeseer German DM (2004) Mining cvs repositories, the softchange experience. In: 1st international workshop on mining software repositories, pp 17–21. Citeseer
Zurück zum Zitat German D, Mockus A (2003) Automating the measurement of open source projects. In: Proceedings of the 3rd workshop on open source software engineering, pp 63–67. University College Cork Cork Ireland German D, Mockus A (2003) Automating the measurement of open source projects. In: Proceedings of the 3rd workshop on open source software engineering, pp 63–67. University College Cork Cork Ireland
Zurück zum Zitat Hallgren KA (2012) Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in quantitative methods for psychology 8(1):23CrossRef Hallgren KA (2012) Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in quantitative methods for psychology 8(1):23CrossRef
Zurück zum Zitat Jergensen C, Sarma A, Wagstrom P (2011) The onion patch: migration in open source ecosystems. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pp 70–80. ACM Jergensen C, Sarma A, Wagstrom P (2011) The onion patch: migration in open source ecosystems. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pp 70–80. ACM
Zurück zum Zitat Kouters E, Vasilescu B, Serebrenik A, van den Brand MGJ (2012) Who’s who in gnome: using lsa to merge software repository identities. In: 28th IEEE international conference on software maintenance (ICSM). IEEE Kouters E, Vasilescu B, Serebrenik A, van den Brand MGJ (2012) Who’s who in gnome: using lsa to merge software repository identities. In: 28th IEEE international conference on software maintenance (ICSM). IEEE
Zurück zum Zitat Lawrence S, Giles CL, Bollacker K (1999) Digital libraries and autonomous citation indexing. Computer 32(6):67–71. 10.1109/2.769447CrossRef Lawrence S, Giles CL, Bollacker K (1999) Digital libraries and autonomous citation indexing. Computer 32(6):67–71. 10.1109/2.769447CrossRef
Zurück zum Zitat Ma Y, Bogart C, Amreen S, Zaretzki R, Mockus A (2019) World of code: An infrastructure for mining the universe of open source vcs data. In: Proceedings of the 2019 international conference on mining software repositories Ma Y, Bogart C, Amreen S, Zaretzki R, Mockus A (2019) World of code: An infrastructure for mining the universe of open source vcs data. In: Proceedings of the 2019 international conference on mining software repositories
Zurück zum Zitat Martinez-Romo J, Robles G, Gonzalez-Barahona JM, Ortuṅo-Perez M (2008) Using social network analysis techniques to study collaboration between a floss community and a company. In: Russo B, Damiani E, Hissam S, Lundell B, Succi G (eds) Open source development, communities and quality. Springer, Boston, pp 171–186 Martinez-Romo J, Robles G, Gonzalez-Barahona JM, Ortuṅo-Perez M (2008) Using social network analysis techniques to study collaboration between a floss community and a company. In: Russo B, Damiani E, Hissam S, Lundell B, Succi G (eds) Open source development, communities and quality. Springer, Boston, pp 171–186
Zurück zum Zitat Mockus A (2009a) Amassing and indexing a large sample of version control systems: towards the census of public source code history. In: 6th IEEE working conference on mining software repositories. IEEE. papers/amassing.pdf Mockus A (2009a) Amassing and indexing a large sample of version control systems: towards the census of public source code history. In: 6th IEEE working conference on mining software repositories. IEEE. papers/amassing.pdf
Zurück zum Zitat Mockus A (2009b) Succession: Measuring transfer of code and developer productivity. In: Proceedings of the 31st international conference on software engineering, pp 67–77. IEEE Computer Society Mockus A (2009b) Succession: Measuring transfer of code and developer productivity. In: Proceedings of the 31st international conference on software engineering, pp 67–77. IEEE Computer Society
Zurück zum Zitat Mockus A (2009c) Succession: Measuring transfer of code and developer productivity. In: 2009 international conference on software engineering. papers/succession.pdf. ACM Press, Vancouver Mockus A (2009c) Succession: Measuring transfer of code and developer productivity. In: 2009 international conference on software engineering. papers/succession.pdf. ACM Press, Vancouver
Zurück zum Zitat Mockus A, Herbsleb JD (2002) Expertise browser: a quantitative approach to identifying expertise. In: Proceedings of the 24th international conference on software engineering, pp 503–512. ACM Mockus A, Herbsleb JD (2002) Expertise browser: a quantitative approach to identifying expertise. In: Proceedings of the 24th international conference on software engineering, pp 503–512. ACM
Zurück zum Zitat Nagappan N, Murphy B, Basili V (2008) The influence of organizational structure on software quality. In: 2008 ACM/IEEE 30th international conference on software engineering, pp 521–530. IEEE Nagappan N, Murphy B, Basili V (2008) The influence of organizational structure on software quality. In: 2008 ACM/IEEE 30th international conference on software engineering, pp 521–530. IEEE
Zurück zum Zitat Petersen K, Wohlin C (2011) Measuring the flow in lean software development. Software: Practice and experience 41(9):975–996 Petersen K, Wohlin C (2011) Measuring the flow in lean software development. Software: Practice and experience 41(9):975–996
Zurück zum Zitat Pinzger M, Nagappan N, Murphy B (2008) Can developer-module networks predict failures?. In: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering, SIGSOFT ’08/FSE-16. https://doi.org/10.1145/1453101.1453105. ACM, New York, pp 2–12 Pinzger M, Nagappan N, Murphy B (2008) Can developer-module networks predict failures?. In: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering, SIGSOFT ’08/FSE-16. https://​doi.​org/​10.​1145/​1453101.​1453105. ACM, New York, pp 2–12
Zurück zum Zitat Řehu̇řek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Valletta, pp 45–50 Řehu̇řek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Valletta, pp 45–50
Zurück zum Zitat Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02. https://doi.org/10.1145/775047.775087. ACM, New York, pp 269–278 Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02. https://​doi.​org/​10.​1145/​775047.​775087. ACM, New York, pp 269–278
Zurück zum Zitat Spencer D, Warfel T (2004) Card sorting: A definitive guide. Boxes and Arrows, pp 2 Spencer D, Warfel T (2004) Card sorting: A definitive guide. Boxes and Arrows, pp 2
Zurück zum Zitat Thung F, Bissyande TF, Lo D, Jiang L (2013) Network structure of social coding in github. In: 2013 17th European conference on software maintenance and reengineering, pp 323–326. IEEE Thung F, Bissyande TF, Lo D, Jiang L (2013) Network structure of social coding in github. In: 2013 17th European conference on software maintenance and reengineering, pp 323–326. IEEE
Zurück zum Zitat Ventura SL, Nugent R, Fuchs ER (2015) Seeing the non-starts: (some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records. Elsevier Ventura SL, Nugent R, Fuchs ER (2015) Seeing the non-starts: (some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records. Elsevier
Zurück zum Zitat Wang DJ, Shi X, McFarland DA, Leskovec J (2012) Measurement error in network data: A re-classification. Soc Netw 34(4):396–409CrossRef Wang DJ, Shi X, McFarland DA, Leskovec J (2012) Measurement error in network data: A re-classification. Soc Netw 34(4):396–409CrossRef
Zurück zum Zitat Wiese IS, da Silva JT, Steinmacher I, Treude C, Gerosa MA (2016) Who is who in the mailing list? comparing six disambiguation heuristics to identify multiple addresses of a participant. In: 2016 IEEE international conference on software maintenance and evolution (ICSME), pp 345–355, DOI 10.1109/ICSME.2016.13, (to appear in print) Wiese IS, da Silva JT, Steinmacher I, Treude C, Gerosa MA (2016) Who is who in the mailing list? comparing six disambiguation heuristics to identify multiple addresses of a participant. In: 2016 IEEE international conference on software maintenance and evolution (ICSME), pp 345–355, DOI 10.1109/ICSME.2016.13, (to appear in print)
Zurück zum Zitat Winkler WE (2006) Overview of record linkage and current research directions. Tech. rep., Bureau of the Census Winkler WE (2006) Overview of record linkage and current research directions. Tech. rep., Bureau of the Census
Zurück zum Zitat Wolf T, Schröter A, Damian D, Panjer LD, Nguyen THD (2009) Mining task-based social networks to explore collaboration in software teams. IEEE Softw 26 (1):58–66. 10.1109/MS.2009.16CrossRef Wolf T, Schröter A, Damian D, Panjer LD, Nguyen THD (2009) Mining task-based social networks to explore collaboration in software teams. IEEE Softw 26 (1):58–66. 10.1109/MS.2009.16CrossRef
Zurück zum Zitat Zhou M, Mockus A, Ma X, Zhang L, Mei H (2016) Inflow and retention in oss communities with commercial involvement: A case study of three hybrid projects. ACM Transactions on Software Engineering and Methodology (TOSEM) 25(2):13CrossRef Zhou M, Mockus A, Ma X, Zhang L, Mei H (2016) Inflow and retention in oss communities with commercial involvement: A case study of three hybrid projects. ACM Transactions on Software Engineering and Methodology (TOSEM) 25(2):13CrossRef
Zurück zum Zitat Zhu J, Wei J (2019) An empirical study of multiple names and email addresses in oss version control repositories. In: Proceedings of 16th international conference on mining software repositories (MSR). IEEE/ACM Zhu J, Wei J (2019) An empirical study of multiple names and email addresses in oss version control repositories. In: Proceedings of 16th international conference on mining software repositories (MSR). IEEE/ACM
Metadaten
Titel
ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems
verfasst von
Sadika Amreen
Audris Mockus
Russell Zaretzki
Christopher Bogart
Yuxia Zhang
Publikationsdatum
03.01.2020
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 2/2020
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-019-09786-7

Weitere Artikel der Ausgabe 2/2020

Empirical Software Engineering 2/2020 Zur Ausgabe