Skip to main content
Top
Published in: Empirical Software Engineering 5/2023

01-10-2023

Using the uniqueness of global identifiers to determine the provenance of Python software source code

Authors: Yiming Sun, Daniel German, Stefano Zacchiroli

Published in: Empirical Software Engineering | Issue 5/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers—such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination. By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI’s 244  K packages we found 11.2  M different global identifiers (classes and method/function names—with only 0.6% of identifiers shared among the two types of entities); 76% of identifiers were used only in one package, and 93% in at most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89% of the cases. We validate the proposed approach by mapping Debian source packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Footnotes
1
https://​pypi.​org/​, accessed 2021-11-15
 
3
Note that this copy might not have been done directly from the corpus; it is, however, a copy of the same entity that exists in the corpus.
 
4
Projects do not reside in PyPI, but PyPI links to their actual location.
 
7
https://​ctags.​io/​, accessed 2021-10-25. Universal Ctags 0.0.0 (2015) derived from Exuberant Ctags 5.8.
 
8
At that time Debian Buster was already shipped as a “stable” release, so while it is possible that its content has changed since, modifications are expected to be minimal according to Debian release processes.
 
10
Titled: “Scraping tripadvisor review, len container change, no such element Unable to locate element”, https://​stackoverflow.​com/​questions/​68878857, accessed 2022-01-16
 
Literature
go back to reference Arnaoudova V, Eshkevari LM, Di Penta M, Oliveto R, Antoniol G, Guéhéneuc YG (2014) Repent: analyzing the nature of identifier renamings. IEEE Transactions on Software Engineering 40(5):502–532CrossRef Arnaoudova V, Eshkevari LM, Di Penta M, Oliveto R, Antoniol G, Guéhéneuc YG (2014) Repent: analyzing the nature of identifier renamings. IEEE Transactions on Software Engineering 40(5):502–532CrossRef
go back to reference Binkley D, Davis M, Lawrie D, Maletic JI, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empirical Software Engineering 18(2):219–276CrossRef Binkley D, Davis M, Lawrie D, Maletic JI, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empirical Software Engineering 18(2):219–276CrossRef
go back to reference Bose RPJC, Phokela KK, Kaulgud V, Podder S (2019) Blinker: a blockchain-enabled framework for software provenance. In: 2019 26th Asia-pacific software engineering conference (APSEC) IEEE p 1–8 Bose RPJC, Phokela KK, Kaulgud V, Podder S (2019) Blinker: a blockchain-enabled framework for software provenance. In: 2019 26th Asia-pacific software engineering conference (APSEC) IEEE p 1–8
go back to reference Butler G, Grogono P, Shinghal R, Tjandra I (1995) Retrieving information from data flow diagrams. In: Proceedings of 2nd working conference on reverse engineering IEEE p 22–29 Butler G, Grogono P, Shinghal R, Tjandra I (1995) Retrieving information from data flow diagrams. In: Proceedings of 2nd working conference on reverse engineering IEEE p 22–29
go back to reference Butt AS, Fitch P (2020) ProvONE+: a provenance model for scientific workflows. In: International conference on web information systems engineering, Springer, p 431–444 Butt AS, Fitch P (2020) ProvONE+: a provenance model for scientific workflows. In: International conference on web information systems engineering, Springer, p 431–444
go back to reference Caniell M, German DM (2017) Zacchiroli S (2017) The debsources dataset: two decades of free and open source software. Empirical Software Engineering 22:1405–1437CrossRef Caniell M, German DM (2017) Zacchiroli S (2017) The debsources dataset: two decades of free and open source software. Empirical Software Engineering 22:1405–1437CrossRef
go back to reference Caprile B, Tonella P (2000) Restructuring program identifier names In icsm, p 97–107 Caprile B, Tonella P (2000) Restructuring program identifier names In icsm, p 97–107
go back to reference Cordy JR, Roy CK (2011) The NiCad clone detector. In: The 19th IEEE international conference on program comprehension, icpc 2011, kingston, on, canada, June 22-24, 2011, IEEE Computer Society, p 219–220 Cordy JR, Roy CK (2011) The NiCad clone detector. In: The 19th IEEE international conference on program comprehension, icpc 2011, kingston, on, canada, June 22-24, 2011, IEEE Computer Society, p 219–220
go back to reference Cosmo RD, Zacchiroli S (2017) Software heritage: why and how to preserve software source code. In: Proceedings of the 14th international conference on digital preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017 Cosmo RD, Zacchiroli S (2017) Software heritage: why and how to preserve software source code. In: Proceedings of the 14th international conference on digital preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017
go back to reference Dang YB, Cheng P, Luo L, Cho A (2008) A code provenance management tool for ip-aware software development. In: Companion of the 30th international conference on software engineering, p 975–976 Dang YB, Cheng P, Luo L, Cho A (2008) A code provenance management tool for ip-aware software development. In: Companion of the 30th international conference on software engineering, p 975–976
go back to reference Davies J, German Dm, Godfrey MW, Hindle A (2011) Software bertillonage: Finding the provenance of an entity. In: Proceedings of the 8th working conference on mining software repositories, p 183–192 Davies J, German Dm, Godfrey MW, Hindle A (2011) Software bertillonage: Finding the provenance of an entity. In: Proceedings of the 8th working conference on mining software repositories, p 183–192
go back to reference Davies J, German DM, Godfrey MW, Hindle A (2013) Software bertillonage. Empirical Software Engineering 18(6):1195–1237CrossRef Davies J, German DM, Godfrey MW, Hindle A (2013) Software bertillonage. Empirical Software Engineering 18(6):1195–1237CrossRef
go back to reference Deissenboeck F, Pizka M (2006) Concise and consistent naming. Software Quality Journal 14(3):261–282CrossRef Deissenboeck F, Pizka M (2006) Concise and consistent naming. Software Quality Journal 14(3):261–282CrossRef
go back to reference Penta MD, German DM, Antoniol G (2010) Identifying licensing of jar archives using a code-search approach. In: 2010 7th IEEE working conference on mining software repositories (MSR 2010), IEEE, p 151–160 Penta MD, German DM, Antoniol G (2010) Identifying licensing of jar archives using a code-search approach. In: 2010 7th IEEE working conference on mining software repositories (MSR 2010), IEEE, p 151–160
go back to reference Gabel M, Su Z (2010) A study of the uniqueness of source code. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on foundations of software engineering, p 147–156 Gabel M, Su Z (2010) A study of the uniqueness of source code. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on foundations of software engineering, p 147–156
go back to reference Gautam P, Saini H (2016) Various code clone detection techniques and tools: a comprehensive survey. In: International conference on smart trends for information technology and computer communications, Springer, p 655–667 Gautam P, Saini H (2016) Various code clone detection techniques and tools: a comprehensive survey. In: International conference on smart trends for information technology and computer communications, Springer, p 655–667
go back to reference Gharehyazie M, Ray B, Filkov V (2017) Some from here, some from there: cross-project code reuse in github. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR), IEEE, p 291–301 Gharehyazie M, Ray B, Filkov V (2017) Some from here, some from there: cross-project code reuse in github. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR), IEEE, p 291–301
go back to reference Godfrey MW (2015) Understanding software artifact provenance. Science of Computer Programming 97:86–90CrossRef Godfrey MW (2015) Understanding software artifact provenance. Science of Computer Programming 97:86–90CrossRef
go back to reference Godfrey MW, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Transactions on Software Engineering 31(2):166–181CrossRef Godfrey MW, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Transactions on Software Engineering 31(2):166–181CrossRef
go back to reference Gupta A, Suri B (2018) A survey on code clone, its behavior and applications. In: Networking communication and data knowledge engineering, Springer, p 27–39 Gupta A, Suri B (2018) A survey on code clone, its behavior and applications. In: Networking communication and data knowledge engineering, Springer, p 27–39
go back to reference Harutyunyan N (2020) Managing your open source supply chain-why and how? Computer 53(6):77–81CrossRef Harutyunyan N (2020) Managing your open source supply chain-why and how? Computer 53(6):77–81CrossRef
go back to reference Hofmeister J, Siegmund J, Holt DV (2017) Shorter identifier names take longer to comprehend. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER), IEEE, p 217–227 Hofmeister J, Siegmund J, Holt DV (2017) Shorter identifier names take longer to comprehend. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER), IEEE, p 217–227
go back to reference Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28(7):654–670CrossRef Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28(7):654–670CrossRef
go back to reference Kapdan M, Aktas M, Yigit M (2014) On the structural code clone detection problem: a survey and software metric based approach. In: International conference on computational science and its applications, Springer, p 492–507 Kapdan M, Aktas M, Yigit M (2014) On the structural code clone detection problem: a survey and software metric based approach. In: International conference on computational science and its applications, Springer, p 492–507
go back to reference Lawrie D, Morrell C, Feild H, Binkley D (2007) Effective identifier names for comprehension and memory. Innovations in Systems and Software Engineering 3(4):303–318CrossRef Lawrie D, Morrell C, Feild H, Binkley D (2007) Effective identifier names for comprehension and memory. Innovations in Systems and Software Engineering 3(4):303–318CrossRef
go back to reference Li Z, Lu S, Myagmar S, Zhou Y (2006) Cp-miner: finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering 32(3):176–192CrossRef Li Z, Lu S, Myagmar S, Zhou Y (2006) Cp-miner: finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering 32(3):176–192CrossRef
go back to reference Manning CD, Raghavan P, Schutze H (2009) An Introduction to Information Retrieval. Cambridge University Press, Cambridge, EnglandMATH Manning CD, Raghavan P, Schutze H (2009) An Introduction to Information Retrieval. Cambridge University Press, Cambridge, EnglandMATH
go back to reference McMillan C, Grechanik M, Poshyvanyk D, Fu C, Xie Q (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Software Eng 38(5):1069–1087CrossRef McMillan C, Grechanik M, Poshyvanyk D, Fu C, Xie Q (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Software Eng 38(5):1069–1087CrossRef
go back to reference Miles S, Groth P, Munroe S, Moreau L (2011) Prime: a methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology (TOSEM) 20(3):1–42CrossRef Miles S, Groth P, Munroe S, Moreau L (2011) Prime: a methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology (TOSEM) 20(3):1–42CrossRef
go back to reference Missier P, Belhajjame K, Cheney J (2013) The W3C PROV family of specifications for modelling provenance metadata. In: Proceedings of the 16th international conference on extending database technology, p 773–776 Missier P, Belhajjame K, Cheney J (2013) The W3C PROV family of specifications for modelling provenance metadata. In: Proceedings of the 16th international conference on extending database technology, p 773–776
go back to reference Missier P, Dey S, Belhajjame K, Vicenttín VC, Ludäscher B (2013) D-prov: extending the PROV provenance model with workflow structure. In: 5th USENIX workshop on the theory and practice of provenance (TaPP 13) Missier P, Dey S, Belhajjame K, Vicenttín VC, Ludäscher B (2013) D-prov: extending the PROV provenance model with workflow structure. In: 5th USENIX workshop on the theory and practice of provenance (TaPP 13)
go back to reference Moreau L, Clifford B, Freire J, Futrelle J, Gil Y, Groth P, Kwasnikowska N, Miles S, Missier P, Myers J et al (2011) The open provenance model core specification (v1. 1). Future Generation Computer Systems 27(6):743–756CrossRef Moreau L, Clifford B, Freire J, Futrelle J, Gil Y, Groth P, Kwasnikowska N, Miles S, Missier P, Myers J et al (2011) The open provenance model core specification (v1. 1). Future Generation Computer Systems 27(6):743–756CrossRef
go back to reference Nguyen S, Phan H, Le T, Nguyen TN (2020) Suggesting natural method names to check name consistencies. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, p 1372–1384 Nguyen S, Phan H, Le T, Nguyen TN (2020) Suggesting natural method names to check name consistencies. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, p 1372–1384
go back to reference Ombredanne Philippe (2020) Free and open source software license compliance: Tools for software composition analysis. Computer 53(10):105–109CrossRef Ombredanne Philippe (2020) Free and open source software license compliance: Tools for software composition analysis. Computer 53(10):105–109CrossRef
go back to reference Ossher J, Sajnani H, Lopes C (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: 2011 27th IEEE international conference on software maintenance (ICSM),IEEE, p 283–292 Ossher J, Sajnani H, Lopes C (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: 2011 27th IEEE international conference on software maintenance (ICSM),IEEE, p 283–292
go back to reference Perez D, Chiba S (2019) Cross-language clone detection by learning over abstract syntax trees. In: 2019 IEEE/ACM 16th international conference on mining software repositories (MSR), IEEE, p 518–528 Perez D, Chiba S (2019) Cross-language clone detection by learning over abstract syntax trees. In: 2019 IEEE/ACM 16th international conference on mining software repositories (MSR), IEEE, p 518–528
go back to reference Phipps S, Zacchiroli S (2020) Continuous open source license compliance. Computer 53(12):115–119CrossRef Phipps S, Zacchiroli S (2020) Continuous open source license compliance. Computer 53(12):115–119CrossRef
go back to reference Pietri A, Spinellis D, Zacchiroli S (2019) The software heritage graph dataset: public software development under one roof. In: Proceedings of the 16th international conference on mining software repositories, MSR 2019, 26-27 May 2019, Montreal, Canada, IEEE / ACM, p 138–142 Pietri A, Spinellis D, Zacchiroli S (2019) The software heritage graph dataset: public software development under one roof. In: Proceedings of the 16th international conference on mining software repositories, MSR 2019, 26-27 May 2019, Montreal, Canada, IEEE / ACM, p 138–142
go back to reference Rosen L (2005) Open source licensing, volume 692. Prentice hall Rosen L (2005) Open source licensing, volume 692. Prentice hall
go back to reference Rousseau G, Cosmo RD, Zacchiroli S (2020) Software provenance tracking at the scale of public source code. Empirical Software Engineering 25:2930–2959CrossRef Rousseau G, Cosmo RD, Zacchiroli S (2020) Software provenance tracking at the scale of public source code. Empirical Software Engineering 25:2930–2959CrossRef
go back to reference Roy CK, Cordy JR (2007) A survey on software clone detection research. Queen’s School of Computing TR 541(115):64–68 Roy CK, Cordy JR (2007) A survey on software clone detection research. Queen’s School of Computing TR 541(115):64–68
go back to reference Saini M, Verma R, Singh A, Chahal KK (2020) Investigating diversity and impact of the popularity metrics for ranking software packages. J. Softw Evol Process, 32(9) Saini M, Verma R, Singh A, Chahal KK (2020) Investigating diversity and impact of the popularity metrics for ranking software packages. J. Softw Evol Process, 32(9)
go back to reference Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) SourcererCC: scaling code clone detection to big-code. In: Proceedings of the 38th international conference on software engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, ACM, p 1157–1168 Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) SourcererCC: scaling code clone detection to big-code. In: Proceedings of the 38th international conference on software engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, ACM, p 1157–1168
go back to reference Sheneamer A, Kalita J (2016) A survey of software clone detection techniques. International Journal of Computer Applications 137(10):1–21CrossRef Sheneamer A, Kalita J (2016) A survey of software clone detection techniques. International Journal of Computer Applications 137(10):1–21CrossRef
go back to reference Sneed HM (1996) Object-oriented cobol recycling. In: Proceedings of WCRE’96: 4rd working conference on reverse engineering, IEEE, p 169–178 Sneed HM (1996) Object-oriented cobol recycling. In: Proceedings of WCRE’96: 4rd working conference on reverse engineering, IEEE, p 169–178
go back to reference Stewart K, Odence P, Rockett E (2010) Software package data exchange (SPDX) specification. IFOSS L. Rev. 2:191CrossRef Stewart K, Odence P, Rockett E (2010) Software package data exchange (SPDX) specification. IFOSS L. Rev. 2:191CrossRef
go back to reference Synopsys (2020) 2020 open source security and risk analysis report (OSSRA).Technical Report, Synopsys. Accessed 15 April 2020 Synopsys (2020) 2020 open source security and risk analysis report (OSSRA).Technical Report, Synopsys. Accessed 15 April 2020
go back to reference Warintarawej P, Huchard M, Lafourcade M, Laurent A, Pompidor P (2015) Software understanding: automatic classification of software identifiers. Intelligent Data Analysis 19(4):761–778CrossRef Warintarawej P, Huchard M, Lafourcade M, Laurent A, Pompidor P (2015) Software understanding: automatic classification of software identifiers. Intelligent Data Analysis 19(4):761–778CrossRef
go back to reference Wendel H, Kunde M, Schreiber A (2010) Provenance of software development processes. In: International provenance and annotation workshop, Springer, p 59–63 Wendel H, Kunde M, Schreiber A (2010) Provenance of software development processes. In: International provenance and annotation workshop, Springer, p 59–63
go back to reference Yuan Y, Guo Y (2012) Boreas: an accurate and scalable token-based approach to code clone detection. In: Proceedings of the 27th IEEE/ACM international conference on automated software engineering, p 286–289 Yuan Y, Guo Y (2012) Boreas: an accurate and scalable token-based approach to code clone detection. In: Proceedings of the 27th IEEE/ACM international conference on automated software engineering, p 286–289
go back to reference Zimmermann T (2020) A first look at an emerging model of community organizations for the long-term maintenance of ecosystems’ packages. In: Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops, ICSEW’20, New York, NY, USA, 2020. association for computing machinery. p 711-718 Zimmermann T (2020) A first look at an emerging model of community organizations for the long-term maintenance of ecosystems’ packages. In: Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops, ICSEW’20, New York, NY, USA, 2020. association for computing machinery. p 711-718
Metadata
Title
Using the uniqueness of global identifiers to determine the provenance of Python software source code
Authors
Yiming Sun
Daniel German
Stefano Zacchiroli
Publication date
01-10-2023
Publisher
Springer US
Published in
Empirical Software Engineering / Issue 5/2023
Print ISSN: 1382-3256
Electronic ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-023-10317-8

Other articles of this Issue 5/2023

Empirical Software Engineering 5/2023 Go to the issue

Premium Partner