Skip to main content
Erschienen in: Empirical Software Engineering 6/2013

01.12.2013

Software Bertillonage

Determining the provenance of software development artifacts

verfasst von: Julius Davies, Daniel M. German, Michael W. Godfrey, Abram Hindle

Erschienen in: Empirical Software Engineering | Ausgabe 6/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Deployed software systems are typically composed of many pieces, not all of which may have been created by the main development team. Often, the provenance of included components—such as external libraries or cloned source code—is not clearly stated, and this uncertainty can introduce technical and ethical concerns that make it difficult for system owners and other stakeholders to manage their software assets. In this work, we motivate the need for the recovery of the provenance of software entities by a broad set of techniques that could include signature matching, source code fact extraction, software clone detection, call flow graph matching, string matching, historical analyses, and other techniques. We liken our provenance goals to that of Bertillonage, a simple and approximate forensic analysis technique based on bio-metrics that was developed in 19th century France before the advent of fingerprints. As an example, we have developed a fast, simple, and approximate technique called anchored signature matching for identifying the source origin of binary libraries within a given Java application. This technique involves a type of structured signature matching performed against a database of candidates drawn from the Maven2 repository, a 275 GB collection of open source Java libraries. To show the approach is both valid and effective, we conducted an empirical study on 945 jars from the Debian GNU/Linux distribution, as well as an industrial case study on 81 jars from an e-commerce application.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The interdependence of the Bertillonage bio-metrics was recognized by Francis Galton, and it inspired him to devise the notion of statistical correlation.
 
2
The GPL Compliance Engineering Guide recommends the extraction of literal strings to determine potential licensing violations (Hemel 2010).
 
3
This is analogous to a policeman asking a suspect for her/his name and expecting a correct answer.
 
4
Identifying the class’s own fully qualitifed name is determinate. The indeterminism only arises when we try to resolve internal references that point to other classes.
 
6
Debian pushes critical security updates out to its stable releases. These usually represent the smallest possible changes necessary to patch the discovered security holes.
 
7
We were unable to process beta implementations of generics sometimes found in Java 1.4 class files of a few brave bleeding edge developers from that time.
 
8
Our source code contains the full list of signature canonicalizations that we apply. The source code is available to download from our replication package: http://​juliusdavies.​ca/​2013/​j.​emse/​bertillonage/​.
 
9
We suspect a file named servlet-api-2.5.jar is the true origin of this large equivalence class of perfect matches. JSP & Servlet technologies have long been an important part of Java’s popularity in servers for over 10 years, and servlet-api-2.5.jar is a critical interface library, originally published by Sun Microsystems, which all Java web and application servers must implement, including Tomcat, JBoss, Glassfish, Jetty, and many others. The 6.1.12 in this case probably comes from a version of Jetty. The Jetty project tends to rename its own critical dependencies so that they contain Jetty’s own version number alongside the original dependency’s version number.
 
10
Values are rounded to nearest 10,000.
 
11
We only count outter classes. Class files containing a $ (dollar-sign) character in their name are assumed to be inner classes, and are not included in these tallies. For example, only 3 of the class files listed earlier in Table 2 would count: A.class, B.class, and C.class, since these do not contain $ in their names.
 
12
The chance of a birthday collision from SHA1 in our data set is less than 10 − 18.
 
14
Unfortunately, we did not instrument our tools to collect unzip timings.
 
15
See email from Bob Lee to dev@hc.apache.org on 18 Mar 2010 23:47:14 GMT, subject “Re: HttpClient in Android”.
 
Literatur
Zurück zum Zitat Cubranic D, Murphy GC, Singer J, Booth KS (2005) Hipikat: a project memory for software development. IEEE Trans Softw Eng 31(6):446–465CrossRef Cubranic D, Murphy GC, Singer J, Booth KS (2005) Hipikat: a project memory for software development. IEEE Trans Softw Eng 31(6):446–465CrossRef
Zurück zum Zitat Davies J (2011) Measuring subversions: security and legal risk in reused software artifacts. In: Taylor RN, Gall H, Medvidovic N (eds) ICSE, pp 1149–1151, ACM Davies J (2011) Measuring subversions: security and legal risk in reused software artifacts. In: Taylor RN, Gall H, Medvidovic N (eds) ICSE, pp 1149–1151, ACM
Zurück zum Zitat Davies J, Germán DM, Godfrey MW, Hindle A Software bertillonage: finding the provenance of an entity. In: van Deursen A, Xie T, Zimmermann T (eds) (2011) In: Proceedings of the 8th international working conference on mining software repositories, MSR 2011 (Co-located with ICSE), Proceedings, IEEE. Waikiki, Honolulu, HI, USA, May 21–28, pp 183–192 Davies J, Germán DM, Godfrey MW, Hindle A Software bertillonage: finding the provenance of an entity. In: van Deursen A, Xie T, Zimmermann T (eds) (2011) In: Proceedings of the 8th international working conference on mining software repositories, MSR 2011 (Co-located with ICSE), Proceedings, IEEE. Waikiki, Honolulu, HI, USA, May 21–28, pp 183–192
Zurück zum Zitat Di Penta M, Germán DM, Antoniol G (2010) Identifying licensing of jar archives using a code-search approach. In: MSR’10 Proc. of the intl. working conf. on mining software repositories, pp 151–160 Di Penta M, Germán DM, Antoniol G (2010) Identifying licensing of jar archives using a code-search approach. In: MSR’10 Proc. of the intl. working conf. on mining software repositories, pp 151–160
Zurück zum Zitat Germán DM, Di Penta M, Guéhéneuc YG, Antoniol G (2009) Code siblings: technical and legal implications of copying code between applications. In: MSR ’09: Proc. of the Working Conf. on Mining Software Repositories, pp 81–90 Germán DM, Di Penta M, Guéhéneuc YG, Antoniol G (2009) Code siblings: technical and legal implications of copying code between applications. In: MSR ’09: Proc. of the Working Conf. on Mining Software Repositories, pp 81–90
Zurück zum Zitat Godfrey M, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Trans Softw Eng 31(2):166–181CrossRef Godfrey M, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Trans Softw Eng 31(2):166–181CrossRef
Zurück zum Zitat Hemel A, Kalleberg KT, Vermaas R, Dolstra E Finding software license violations through binary code clone detection. In: van Deursen A, Xie T, Zimmermann T (eds) Proceedings of the 8th international working conference on mining software repositories, MSR 2011 (Co-located with ICSE), Proceedings, IEEE. Waikiki, Honolulu, HI, USA, May 21–28, pp 63–72 Hemel A, Kalleberg KT, Vermaas R, Dolstra E Finding software license violations through binary code clone detection. In: van Deursen A, Xie T, Zimmermann T (eds) Proceedings of the 8th international working conference on mining software repositories, MSR 2011 (Co-located with ICSE), Proceedings, IEEE. Waikiki, Honolulu, HI, USA, May 21–28, pp 63–72
Zurück zum Zitat Holmes R, Walker RJ (2010) Customized awareness: recommending relevant external change events. In: Kramer J, Bishop J, Devanbu PT, Uchitel S (eds) ICSE (1), ACM, pp 465–474 Holmes R, Walker RJ (2010) Customized awareness: recommending relevant external change events. In: Kramer J, Bishop J, Devanbu PT, Uchitel S (eds) ICSE (1), ACM, pp 465–474
Zurück zum Zitat Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12):952–970CrossRef Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12):952–970CrossRef
Zurück zum Zitat Houck MM, Siegel JA (2006) Fundamentals of forensic science. Academic Press Houck MM, Siegel JA (2006) Fundamentals of forensic science. Academic Press
Zurück zum Zitat Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670CrossRef Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670CrossRef
Zurück zum Zitat Kapser C, Godfrey MW (2008) ‘Cloning considered harmful’ considered harmful: patterns of cloning in software. Empir Software Eng 13(6):645–692CrossRef Kapser C, Godfrey MW (2008) ‘Cloning considered harmful’ considered harmful: patterns of cloning in software. Empir Software Eng 13(6):645–692CrossRef
Zurück zum Zitat Kersten M, Murphy GC (2005) Mylar: a degree-of-interest model for ides. In: Mezini M, Tarr PL (eds) AOSD. ACM, pp 159–168 Kersten M, Murphy GC (2005) Mylar: a degree-of-interest model for ides. In: Mezini M, Tarr PL (eds) AOSD. ACM, pp 159–168
Zurück zum Zitat Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. ESEC/FSE 30(5):187–196CrossRef Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. ESEC/FSE 30(5):187–196CrossRef
Zurück zum Zitat Krinke J (2008) Is cloned code more stable than non-cloned code? In: SCAM’08, pp 57–66 Krinke J (2008) Is cloned code more stable than non-cloned code? In: SCAM’08, pp 57–66
Zurück zum Zitat Livieri S, Higo Y, Matsushita M, Inoue K (2007) Very-large scale code clone analysis and visualization of open source programs using distributed ccfinder: D-ccfinder. In: ICSE, pp 106–115 Livieri S, Higo Y, Matsushita M, Inoue K (2007) Very-large scale code clone analysis and visualization of open source programs using distributed ccfinder: D-ccfinder. In: ICSE, pp 106–115
Zurück zum Zitat Lozano A (2008) A methodology to assess the impact of source code flaws in changeability and its application to clones. In: ICSM 08: Proc. of the int. conf. of software maintenance, pp 424–427 Lozano A (2008) A methodology to assess the impact of source code flaws in changeability and its application to clones. In: ICSM 08: Proc. of the int. conf. of software maintenance, pp 424–427
Zurück zum Zitat Lozano A, Wermelinger M, Nuseibeh B (2007) Evaluating the harmfulness of cloning: a change based experiment. In: MSR ’07: proc. of the 4th int. workshop on mining soft. Repositories, p 18 Lozano A, Wermelinger M, Nuseibeh B (2007) Evaluating the harmfulness of cloning: a change based experiment. In: MSR ’07: proc. of the 4th int. workshop on mining soft. Repositories, p 18
Zurück zum Zitat Ossher J, Sajnani H, Lopes CV (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: ICSM, IEEE, pp 283–292 Ossher J, Sajnani H, Lopes CV (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: ICSM, IEEE, pp 283–292
Zurück zum Zitat Robillard MP, Walker RJ, Zimmermann T (2010) Recommendation systems for software engineering. IEEE Softw 27(4):80–86CrossRef Robillard MP, Walker RJ, Zimmermann T (2010) Recommendation systems for software engineering. IEEE Softw 27(4):80–86CrossRef
Zurück zum Zitat Siegel J, Saukko P, Knupfer G (2000) Encyclopedia of forensic sciences. Academic Press Siegel J, Saukko P, Knupfer G (2000) Encyclopedia of forensic sciences. Academic Press
Zurück zum Zitat Thummalapenta S, Cerulo L, Aversano L, Di Penta M (2009) An empirical study on the maintenance of source code clones. Empir Software Eng 15(1):1–34CrossRef Thummalapenta S, Cerulo L, Aversano L, Di Penta M (2009) An empirical study on the maintenance of source code clones. Empir Software Eng 15(1):1–34CrossRef
Metadaten
Titel
Software Bertillonage
Determining the provenance of software development artifacts
verfasst von
Julius Davies
Daniel M. German
Michael W. Godfrey
Abram Hindle
Publikationsdatum
01.12.2013
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 6/2013
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-012-9199-7

Weitere Artikel der Ausgabe 6/2013

Empirical Software Engineering 6/2013 Zur Ausgabe

Premium Partner