Skip to main content
Erschienen in: Empirical Software Engineering 6/2023

01.11.2023

The software heritage license dataset (2022 edition)

verfasst von: Jesus M. Gonzalez-Barahona, Sergio Montes-Leon, Gregorio Robles, Stefano Zacchiroli

Erschienen in: Empirical Software Engineering | Ausgabe 6/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Context:

When software is released publicly, it is common to include with it either the full text of the license or licenses under which it is published, or a detailed reference to them. Therefore public licenses, including FOSS (free, open source software) licenses, are usually publicly available in source code repositories

Objective:

To compile a dataset containing as many documents as possible that contain the text of software licenses, or references to the license terms. Once compiled, characterize the dataset so that it can be used for further research, or practical purposes related to license analysis

Method:

Retrieve from Software Heritage—the largest publicly available archive of FOSS source code—all versions of all files whose names are commonly used to convey licensing terms. All retrieved documents will be characterized in various ways, using automated and manual analyses

Results:

The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided, making the dataset ready to use in various contexts, including: file length measures, MIME type, SPDX license (detected using ScanCode), and oldest appearance. The results of a manual analysis of 8102 documents is also included, providing a ground truth for further analysis. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files with metadata, referencing files via cryptographic checksums

Conclusions:

Thanks to the extensive coverage of Software Heritage, the dataset presented in this paper covers a very large fraction of all software licenses for public code. We have assembled a large body of software licenses, characterized it quantitatively and qualitatively, and validated that it is mostly composed of licensing information and includes almost all known license texts. The dataset can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. It can also be used in practice to improve tools detecting licenses in source code

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
2
The version of the dataset discussed in this paper is available at https://​annex.​softwareheritage​.​org/​public/​dataset/​license-blobs/​2022-04-25/​; other versions of the dataset (both past versions and future ones) are available starting from https://​annex.​softwareheritage​.​org/​public/​dataset/​license-blobs/​
 
3
Software Heritage is an archival project established in 2015 with the stated goal of: collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it. A detailed description of the project if out-of-scope for this paper, therefore we refer the interested reader to: previous publications about the project Di Cosmo and Zacchiroli (2017); Abramatic et al. (2018), its homepage at https://​www.​softwareheritage​.​org, and the archive status page at https://​archive.​softwareheritage​.​org (accessed 2022-10-20) where one can find an up-to-date view of the software origins that are periodically crawled to populate the archive.
 
4
OSI (Open Source Initiative): https://​opensource.​org
 
5
OSI Approved licenses: https://​opensource.​org/​licenses-draft (accessed on 2022-10-30)
 
6
SPDX license list: https://​spdx.​org/​licenses/​ (accessed on 2022-10-30)
 
7
ScanCode LicenseDB:
 
12
See the “How to apply the Apache License to your work” part of the Apache 2.0 license for an example of a license reference: https://​www.​apache.​org/​licenses/​LICENSE-2.​0 (accessed 2022-11-10).
 
13
If the document was found under several different filenames, as it could happen, it will appear in the index once for each different filename
 
14
Version used: ScanCode 31.2.1.
 
16
The complete SQL query is available as part of the dataset replication package Gonzalez-Barahona et al. (2023), in the replication-package.tar.gz file.
 
19
SWHID swh:1:cnt:36406a1eee032e80a284d3ed9f5176bba67be064
 
20
SWHID swh:1:cnt:cdc98c898b1d257ddb4752ee7a1c85ed3ddf5673
 
21
SWHID swh:1:cnt:2e26bf237427aaa56f99846acb1aeb94198119e9
 
22
SWHID swh:1:cnt:606a3bce98a4ade7d80c2761b8458d79438a3c6f
 
23
SWHID swh:1:cnt:78ec4db8002adeae4fcbfa5f56b3c1e51bfaf8c5
 
25
SWHID: swh:1:cnt:c7f43dd49cbedb819fc247b3bfe5ae45841738dc
 
26
SWHID swh:1:cnt:9ea952f4a37478f17f2a2aafb45ced7a4df67de2
 
27
SWHID swh:1:cnt:aa3157cb23f7de5d062ab5d0bf0ffb44bb719df9
 
28
SWHID swh:1:cnt:509b6082ee6debe85c005d80f047668d70dd1cb8
 
29
SWHID swh:1:cnt:f961852cee6ee9e9a0b8a25af5d090ddb6abe6a8
 
30
SWHID swh:1:cnt:711ded4ae27c43ba18a71ad05e9466a268e4387a
 
31
SWHID swh:1:cnt:46ae7b2bee342168dc48d6ca7fa1753b98e525d8
 
32
SWHID swh:1:cnt:62319023a68b04f23ea30931bb1a7c1a3e741fba
 
33
SWHID swh:1:cnt:eb9ed7bfc458af9796b59426d54d0f97a199078f
 
34
SWHID swh:1:cnt:b864764d9fc4d55eb09e123e42ede11519556d18
 
35
SWHID swh:1:cnt:9bffa2d5a63151c8c9bf3d68e9f9445558273612
 
36
SWHID swh:1:cnt:c53a6c27009183d8304d26a213b1321bdfc0cb8d
 
37
SWHID swh:1:cnt:41a6fc531459dde48d1752f24eae007047361709
 
38
SWHID swh:1:cnt:4e5eebfdbebefe990e309ecbdd83842035d3852c
 
39
SWHID swh:1:cnt:105961e3702324fadaa808457338a984101d6028
 
40
SWHID swh:1:cnt:f3932de6d7f19b26afaa7bc8502c800476c2f0a5
 
41
SWHID swh:1:cnt:fed8329964dd68adcd3dc98dd405950e53614282
 
42
SWHID swh:1:cnt:60ff9a40c14915b25d265f2bdfb508274b6782fe
 
43
SWHID swh:1:cnt:ace0bbb7fe0a8677ef5ae001b5da076b2aa666a5
 
44
SWHID swh:1:cnt:9392142a987ee04c3f0d303a58b19df818df86b3
 
45
SWHID swh:1:cnt:eb531dc6990ca433ccde3100633780ad55aed22b
 
46
licen and licens are Python modules for dealing with the Document Collection.
 
47
path_from_filename is a function returning the path of a document in the collection, given its name (SHA1)
 
48
For a full, ready-to-work program, check the file truth/random_forest.py in the dataset
 
49
Software Heritage archive changelog page:
 
Literatur
Zurück zum Zitat Abramatic JF, Di Cosmo R, Zacchiroli S (2018) Building the universal archive of source code. Communications of the ACM 61(10):29–31CrossRef Abramatic JF, Di Cosmo R, Zacchiroli S (2018) Building the universal archive of source code. Communications of the ACM 61(10):29–31CrossRef
Zurück zum Zitat Allançon T, A Pietri, S Zacchiroli (2021) The software heritage filesystem (swhfs): Integrating source code archival with development. In 43rd IEEE/ACM International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2021, Madrid, Spain, May 25-28, 2021, pages 45–48. IEEE Allançon T, A Pietri, S Zacchiroli (2021) The software heritage filesystem (swhfs): Integrating source code archival with development. In 43rd IEEE/ACM International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2021, Madrid, Spain, May 25-28, 2021, pages 45–48. IEEE
Zurück zum Zitat Bird S (2006) NLTK: the natural language toolkit. In Nicoletta Calzolari, Claire Cardie, and Pierre Isabelle, editors, ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006. The Association for Computer Linguistics Bird S (2006) NLTK: the natural language toolkit. In Nicoletta Calzolari, Claire Cardie, and Pierre Isabelle, editors, ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006. The Association for Computer Linguistics
Zurück zum Zitat Boldi P, Pietri A, Vigna S, Zacchiroli S (2020) Ultra-large-scale repository analysis via graph compression. In SANER 2020: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 2020 Boldi P, Pietri A, Vigna S, Zacchiroli S (2020) Ultra-large-scale repository analysis via graph compression. In SANER 2020: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 2020
Zurück zum Zitat Caneill M, Germán DM, Zacchiroli S (2017) The debsources dataset: Two decades of free and open source software. Empirical Software Engineering 22:1405–1437CrossRef Caneill M, Germán DM, Zacchiroli S (2017) The debsources dataset: Two decades of free and open source software. Empirical Software Engineering 22:1405–1437CrossRef
Zurück zum Zitat Collet Y (2022) RFC 8878 - Zstandard compression and the “application/zstd” media type, 2021. Accessed 2022-01-24 Collet Y (2022) RFC 8878 - Zstandard compression and the “application/zstd” media type, 2021. Accessed 2022-01-24
Zurück zum Zitat Di Cosmo R, Gruenpeter M, Zacchiroli S (2018) Identifiers for digital objects: the case of software source code preservation. In Proceedings of the 15th International Conference on Digital Preservation, iPRES 2018, Boston, USA Di Cosmo R, Gruenpeter M, Zacchiroli S (2018) Identifiers for digital objects: the case of software source code preservation. In Proceedings of the 15th International Conference on Digital Preservation, iPRES 2018, Boston, USA
Zurück zum Zitat Di Cosmo R, Zacchiroli S (2017) Software Heritage: Why and how to preserve software source code. In Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017 Di Cosmo R, Zacchiroli S (2017) Software Heritage: Why and how to preserve software source code. In Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017
Zurück zum Zitat Di Penta M, German DM, Gaël Guéhéneuc Y, Antoniol G (2010) An exploratory study of the evolution of software licensing. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, page 145-154, New York, NY, USA, 2010. Association for Computing Machinery Di Penta M, German DM, Gaël Guéhéneuc Y, Antoniol G (2010) An exploratory study of the evolution of software licensing. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, page 145-154, New York, NY, USA, 2010. Association for Computing Machinery
Zurück zum Zitat Dyer R, Nguyen HA, Rajan H, Nguyen TN (2015) Boa: Ultra-large-scale software repository and source-code mining. ACM Trans. Softw Eng Methodol 25(1):7:1–7:34 Dyer R, Nguyen HA, Rajan H, Nguyen TN (2015) Boa: Ultra-large-scale software repository and source-code mining. ACM Trans. Softw Eng Methodol 25(1):7:1–7:34
Zurück zum Zitat Flint SW, Chauhan J, Dyer R (2021) Escaping the time pit: Pitfalls and guidelines for using time-based git data. In 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17-19, 2021 85–96. IEEE, 2021 Flint SW, Chauhan J, Dyer R (2021) Escaping the time pit: Pitfalls and guidelines for using time-based git data. In 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17-19, 2021 85–96. IEEE, 2021
Zurück zum Zitat Gandhi RA, Germonprez M, GJP Link (2018) Open data standards for open source software risk management routines: An examination of SPDX. In Forte A, Prilla M, Vivacqua AS, Müller C, and Lionel P. Robert Jr., editors, Proceedings of the 2018 ACM Conference on Supporting Groupwork, GROUP 2018, Sanibel Island, FL, USA, January 07 - 10, pages 219–229. ACM, 2018 Gandhi RA, Germonprez M, GJP Link (2018) Open data standards for open source software risk management routines: An examination of SPDX. In Forte A, Prilla M, Vivacqua AS, Müller C, and Lionel P. Robert Jr., editors, Proceedings of the 2018 ACM Conference on Supporting Groupwork, GROUP 2018, Sanibel Island, FL, USA, January 07 - 10, pages 219–229. ACM, 2018
Zurück zum Zitat German DM, Di Penta M, Davies J (2010) Understanding and auditing the licensing of open source software distributions. In 2010 IEEE 18th International Conference on Program Comprehension 84–93 German DM, Di Penta M, Davies J (2010) Understanding and auditing the licensing of open source software distributions. In 2010 IEEE 18th International Conference on Program Comprehension 84–93
Zurück zum Zitat German DM, González-Barahona JM (2009) An empirical study of the reuse of software licensed under the GNU General Public License. In Boldyreff C, Crowston K, Lundell B, and Wasserman AI, editors, Open Source Ecosystems: Diverse Communities Interacting, pages 185–198, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg German DM, González-Barahona JM (2009) An empirical study of the reuse of software licensed under the GNU General Public License. In Boldyreff C, Crowston K, Lundell B, and Wasserman AI, editors, Open Source Ecosystems: Diverse Communities Interacting, pages 185–198, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg
Zurück zum Zitat German DM, Hassan AE (2009) License integration patterns: Addressing license mismatches in component-based development. In 2009 IEEE 31st International Conference on Software Engineering 188–198 German DM, Hassan AE (2009) License integration patterns: Addressing license mismatches in component-based development. In 2009 IEEE 31st International Conference on Software Engineering 188–198
Zurück zum Zitat Germán DM, Manabe Y, Inoue K (2010) A sentence-matching method for automatic license identification of source code files. In Pecheur C, Andrews J, and Di Nitto E, editors, ASE 2010, 25th IEEE/ACM International Conference on Automated Software Engineering, Antwerp, Belgium, September 20-24, pages 437–446. ACM, 2010 Germán DM, Manabe Y, Inoue K (2010) A sentence-matching method for automatic license identification of source code files. In Pecheur C, Andrews J, and Di Nitto E, editors, ASE 2010, 25th IEEE/ACM International Conference on Automated Software Engineering, Antwerp, Belgium, September 20-24, pages 437–446. ACM, 2010
Zurück zum Zitat Germán DM, Di Penta M (2012) A method for open source license compliance of java applications. IEEE Softw 29(3):58–63 Germán DM, Di Penta M (2012) A method for open source license compliance of java applications. IEEE Softw 29(3):58–63
Zurück zum Zitat Gobeille R (2008) The fossology project. In Hassan AE, Lanza M, and Godfrey MW, editors, Proceedings of the 2008 International Working Conference on Mining Software Repositories, MSR 2008 (Co-located with ICSE), Leipzig, Germany, May 10-11, 2008, Proceedings 47–50. ACM Gobeille R (2008) The fossology project. In Hassan AE, Lanza M, and Godfrey MW, editors, Proceedings of the 2008 International Working Conference on Mining Software Repositories, MSR 2008 (Co-located with ICSE), Leipzig, Germany, May 10-11, 2008, Proceedings 47–50. ACM
Zurück zum Zitat Gomulkiewicz RW (2009) Open source license proliferation: Helpful diversity or hopeless confusion. Wash. UJL & Pol’y 30:261 Gomulkiewicz RW (2009) Open source license proliferation: Helpful diversity or hopeless confusion. Wash. UJL & Pol’y 30:261
Zurück zum Zitat Gousios G, Spinellis D (2012) Ghtorrent: Github’s data from a firehose. In Lanza M, Di Penta M, and Xie T, editors, 9th IEEE Working Conference of Mining Software Repositories, MSR, pages 12–21. IEEE Computer Society, 2012 Gousios G, Spinellis D (2012) Ghtorrent: Github’s data from a firehose. In Lanza M, Di Penta M, and Xie T, editors, 9th IEEE Working Conference of Mining Software Repositories, MSR, pages 12–21. IEEE Computer Society, 2012
Zurück zum Zitat Harutyunyan N (2020) Managing your open source supply chain-why and how? Computer 53(6):77–81 Harutyunyan N (2020) Managing your open source supply chain-why and how? Computer 53(6):77–81
Zurück zum Zitat Lindberg V (2008) Intellectual property and open source: a practical guide to protecting. O’Reilly Media, Inc., 2008 Lindberg V (2008) Intellectual property and open source: a practical guide to protecting. O’Reilly Media, Inc., 2008
Zurück zum Zitat Ma Y, Dey T, Bogart C, Amreen S, Valiev M, Tutko A, Kennard D, Zaretzki R, Mockus A (2021) World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data. Empir Softw Eng 26(2):22 Ma Y, Dey T, Bogart C, Amreen S, Valiev M, Tutko A, Kennard D, Zaretzki R, Mockus A (2021) World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data. Empir Softw Eng 26(2):22
Zurück zum Zitat Manabe Y, German DM, Inoue K (2014) Analyzing the relationship between the license of packages and their files in free and open source software. In Corral L, Sillitti A, Succi G, Vlasenko J, and Wasserman AI, editors, Open Source Software: Mobile Open Source Technologies 51–60, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg Manabe Y, German DM, Inoue K (2014) Analyzing the relationship between the license of packages and their files in free and open source software. In Corral L, Sillitti A, Succi G, Vlasenko J, and Wasserman AI, editors, Open Source Software: Mobile Open Source Technologies 51–60, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg
Zurück zum Zitat Manabe Y, Hayase Y, Inoue K (2010) Evolutional analysis of licenses in FOSS. In Andrea Capiluppi, Anthony Cleve, and Naouel Moha, editors, Proceedings of the Joint ERCIM Workshop on Software Evolution (EVOL) and International Workshop on Principles of Software Evolution (IWPSE), Antwerp, Belgium, September 20-21, 2010, pages 83–87. ACM, 2010 Manabe Y, Hayase Y, Inoue K (2010) Evolutional analysis of licenses in FOSS. In Andrea Capiluppi, Anthony Cleve, and Naouel Moha, editors, Proceedings of the Joint ERCIM Workshop on Software Evolution (EVOL) and International Workshop on Principles of Software Evolution (IWPSE), Antwerp, Belgium, September 20-21, 2010, pages 83–87. ACM, 2010
Zurück zum Zitat Maryka T, Germán DM, Poo-Caamaño G (2015) On the variability of the BSD and MIT licenses. In Ernesto Damiani, Fulvio Frati, Dirk Riehle, and Anthony I. Wasserman, editors, Open Source Systems: Adoption and Impact - 11th IFIP WG 2.13 International Conference, OSS 2015, Florence, Italy, May 16-17, 2015, Proceedings, volume 451 of IFIP Advances in Information and Communication Technology 146–156. Springer, 2015 Maryka T, Germán DM, Poo-Caamaño G (2015) On the variability of the BSD and MIT licenses. In Ernesto Damiani, Fulvio Frati, Dirk Riehle, and Anthony I. Wasserman, editors, Open Source Systems: Adoption and Impact - 11th IFIP WG 2.13 International Conference, OSS 2015, Florence, Italy, May 16-17, 2015, Proceedings, volume 451 of IFIP Advances in Information and Communication Technology 146–156. Springer, 2015
Zurück zum Zitat Maryka T, German DM, Poo-Caamaño G (2015) On the variability of the bsd and mit licenses. In: Damiani Ernesto, Frati Fulvio, Riehle Dirk, Wasserman Anthony I (eds) Open Source Systems: Adoption and Impact (OSS 2015). pp. Springer International Publishing, Cham, pp 146–156 Maryka T, German DM, Poo-Caamaño G (2015) On the variability of the bsd and mit licenses. In: Damiani Ernesto, Frati Fulvio, Riehle Dirk, Wasserman Anthony I (eds) Open Source Systems: Adoption and Impact (OSS 2015). pp. Springer International Publishing, Cham, pp 146–156
Zurück zum Zitat McKinney W et al (2011) Pandas: a foundational python library for data analysis and statistics. Python for high performance and scientific computing 14(9):1–9 McKinney W et al (2011) Pandas: a foundational python library for data analysis and statistics. Python for high performance and scientific computing 14(9):1–9
Zurück zum Zitat Philippe Ombredanne (2020) Free and open source software license compliance: Tools for software composition analysis. Computer 53(10):105–109 Philippe Ombredanne (2020) Free and open source software license compliance: Tools for software composition analysis. Computer 53(10):105–109
Zurück zum Zitat Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830 Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830
Zurück zum Zitat Phipps S, Zacchiroli S (2020) Continuous open source license compliance. Computer 53(12):115–119 Phipps S, Zacchiroli S (2020) Continuous open source license compliance. Computer 53(12):115–119
Zurück zum Zitat Pietri A, Spinellis D, Zacchiroli S (2019) The Software Heritage graph dataset: public software development under one roof. In Storey MAD, Adams B, and Haiduc S, editors, Proceedings of the 16th International Conference on Mining Software Repositories, MSR 2019, 26-27 May 2019, Montreal, Canada., pages 138–142. IEEE / ACM Pietri A, Spinellis D, Zacchiroli S (2019) The Software Heritage graph dataset: public software development under one roof. In Storey MAD, Adams B, and Haiduc S, editors, Proceedings of the 16th International Conference on Mining Software Repositories, MSR 2019, 26-27 May 2019, Montreal, Canada., pages 138–142. IEEE / ACM
Zurück zum Zitat Rosen L (2005) Open source licensing, volume 692. Prentice Hall Rosen L (2005) Open source licensing, volume 692. Prentice Hall
Zurück zum Zitat Rousseau G, Di Cosmo R, Zacchiroli S (2020) Software provenance tracking at the scale of public source code. Empirical Software Engineering 25(4):2930–2959 Rousseau G, Di Cosmo R, Zacchiroli S (2020) Software provenance tracking at the scale of public source code. Empirical Software Engineering 25(4):2930–2959
Zurück zum Zitat Shafranovich Y (2005) RFC 4180 - common format and MIME type for comma-separated values (CSV) files, 2005. Accessed 2022-01-24 Shafranovich Y (2005) RFC 4180 - common format and MIME type for comma-separated values (CSV) files, 2005. Accessed 2022-01-24
Zurück zum Zitat Srinivasa-Desikan B (2018) Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Packt Publishing Ltd, 2018 Srinivasa-Desikan B (2018) Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Packt Publishing Ltd, 2018
Zurück zum Zitat Stewart K, P Odence, Rockett E (2010) Software package data exchange (SPDX) specification. IFOSS L Rev 2:191 Stewart K, P Odence, Rockett E (2010) Software package data exchange (SPDX) specification. IFOSS L Rev 2:191
Zurück zum Zitat Vendome C, Bavota G, Di Penta M, Vásquez ML, Germán DM, Poshyvanyk D (2017) License usage and changes: a large-scale study on GitHub. Empir Softw Eng 22(3):1537–1577 Vendome C, Bavota G, Di Penta M, Vásquez ML, Germán DM, Poshyvanyk D (2017) License usage and changes: a large-scale study on GitHub. Empir Softw Eng 22(3):1537–1577
Zurück zum Zitat Vendome C, Linares-Vásquez M, Bavota G, Di Penta M, German DM, Poshyvanyk D (2015) When and why developers adopt and change software licenses. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME) pages 31–40 Vendome C, Linares-Vásquez M, Bavota G, Di Penta M, German DM, Poshyvanyk D (2015) When and why developers adopt and change software licenses. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME) pages 31–40
Zurück zum Zitat Vendome C, Vásquez ML, Bavota G, Di Penta M, Germán DM, Poshyvanyk D (2017) Machine learning-based detection of open source license exceptions. In Sebastián Uchitel, Alessandro Orso, and Martin P. Robillard, editors, Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, pages 118–129. IEEE / ACM, 2017 Vendome C, Vásquez ML, Bavota G, Di Penta M, Germán DM, Poshyvanyk D (2017) Machine learning-based detection of open source license exceptions. In Sebastián Uchitel, Alessandro Orso, and Martin P. Robillard, editors, Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, pages 118–129. IEEE / ACM, 2017
Zurück zum Zitat Xu S, Gao Y, Fan L, Liu Z, Liu Y, and Ji H (2023) Lidetector: License incompatibility detection for open source software. ACM Trans. Softw Eng Methodol 32(1) Xu S, Gao Y, Fan L, Liu Z, Liu Y, and Ji H (2023) Lidetector: License incompatibility detection for open source software. ACM Trans. Softw Eng Methodol 32(1)
Zurück zum Zitat Zacchiroli S (2022) A large-scale dataset of (open source) license text variants. In The 2022 Mining Software Repositories Conference (MSR 2022), pages 757–761. ACM, 2022 Zacchiroli S (2022) A large-scale dataset of (open source) license text variants. In The 2022 Mining Software Repositories Conference (MSR 2022), pages 757–761. ACM, 2022
Zurück zum Zitat Zhang D, Luo P, Tang W, and Zhou M (2021) Osldetector: Identifying open-source libraries through binary analysis. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, ASE ’20, page 1312-1315, New York, NY, USA, 2021. Association for Computing Machinery Zhang D, Luo P, Tang W, and Zhou M (2021) Osldetector: Identifying open-source libraries through binary analysis. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, ASE ’20, page 1312-1315, New York, NY, USA, 2021. Association for Computing Machinery
Metadaten
Titel
The software heritage license dataset (2022 edition)
verfasst von
Jesus M. Gonzalez-Barahona
Sergio Montes-Leon
Gregorio Robles
Stefano Zacchiroli
Publikationsdatum
01.11.2023
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 6/2023
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-023-10377-w

Weitere Artikel der Ausgabe 6/2023

Empirical Software Engineering 6/2023 Zur Ausgabe

Premium Partner