There is a newer version of the record available.

Published July 31, 2023 | Version 2022-04-25
Dataset Open

The Software Heritage License Dataset (2022 Edition)

  • 1. Universidad Rey Juan Carlos, Madrid, Spain
  • 2. LTCI, Télécom Paris, Institut Polytechnique de Paris, Paris, France

Description

This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (Other, possibly more recent, versions of the datasets can be found at https://annex.softwareheritage.org/public/dataset/license-blobs/).

In this context, a license file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.

Format

The dataset is organized as follows:

  • blobs.tar.zst: a Zst-compressed tarball containing deduplicated license blobs, one per file. The tarball contains 6’859’189 blobs, for a total uncompressed size on disk of 66 GiB.

    The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:

    • blobs/ is the root directory containing all license blobs

    • 8624bcdae55baeef00cd11d5dfcfa60f68710a02 is the SHA1 checksum of a specific license blobs, a copy of the GPL3 license in this case. Each license blob is ultimately named with its SHA1:

      $ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
                          GNU GENERAL PUBLIC LICENSE
                             Version 3, 29 June 2007
      
      $ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
      8624bcdae55baeef00cd11d5dfcfa60f68710a02  blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
    • 86 and 24 are, respectively, the first and second group of two hex digits in the blob SHA1

    One blob is missing, because its size (313MB) prevented its inclusion; (it was originally a tarball containing source code):

    swh:1:cnt:61bf63793c2ee178733b39f8456a796b72dc8bde,1340d4e2da173c92d432026ecdc54b4859fe9911,"AUTHORS"
  • blobs-sample20k.tar.zst: analogous to blobs.tar.zst, but containing “only” 20’000 randomly selected license blobs

  • license-blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a license blob and is in the format SWHID,SHA1,NAME, for example:

      swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING"
      swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3"
      swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"

    where:

    • SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2

    • SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory

    • NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it’s an original typo from some repository!).

  • blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:

    • SHA1: blob SHA1
    • MIME_TYPE: blob MIME type, as detected by libmagic
    • ENCODING: blob character encoding, as detected by libmagic
    • LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)
    • WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)
    • SIZE: blob size in bytes
  • blobs-scancode.csv.zst a Zst-compressed CSV mapping from blobs to software license detected in them by ScanCode, in the format: SHA1,LICENSE,SCORE, where:

    • SHA1: blob SHA1
    • LICENSE: license detected in the blob, as an SPDX identifier (or ScanCode identifier for non-SPDX-indexed licenses)
    • SCORE: confidence score in the result, as a decimal number between 0 and 100

    There may be zero or arbitrarily many lines for each blob.

  • blobs-scancode.ndjson.zst a Zst-compressed line-delimited JSON, containing a superset of the information in blobs-scancode.csv.zst. Each line is a JSON dictionary with three keys:

    • sha1: blob SHA1
    • licenses: output of scancode.api.get_licenses(..., min_score=0)
    • copyrights: output of scancode.api.get_copyrights(...)

    There is exactly one line for each blob. licenses and copyrights keys are omitted for files not detected as plain text.

  • blobs-origins.csv.zst a Zst-compressed CSV mapping of where license blobs come from. Each line in the index associate a license blob to one of its origins in the format SWHID<TAB>URL, for example:

      swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2  https://github.com/pombreda/Artemis

    Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.

    If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.

  • blobs-nb-origins.csv.zst a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a license blob to this count in the format SWHID<TAB>NUMBER, for example:

      swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2  2822260

    Two blobs are missing because the computation crashes:

      swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
      swh:1:cnt:8b137891791fe96927ad78e64b0aad7bded08bdc

    This issue will be fixed in a future version of the dataset

  • blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHID<TAB>EARLIEST_SWHID<TAB>EARLIEST_TS<TAB>OCCURRENCES, where:

    • SWHID: blob SWHID
    • EARLIEST_SWHID: SWHID of the earliest known commit containing the blob
    • EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a Unix time integer
    • OCCURRENCES: number of known commits containing the blob
  • replication-package.tar.gz: code and scripts used to produce the dataset

  • licenses-annotated-sample.tar.gz: ground truth, i.e., manually annotated random sample of license blobs, with details about the kind of information they contain.

Changes since the 2021-03-23 dataset

  • More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.

  • Values in the NAME column of license-blobs.csv.zst are quoted, as some file names now contain commas.

  • Replication package now contains all the steps needed to reproduce all artefacts including the licenseblobs/fetch.py script.

  • blobs-nb-origins.csv.zst is added.

  • blobs-origins.csv.zst is now generated using the first origin returned by swh-graph’s leaves endpoint, instead of its randomwalk endpoint. This should have no impact on the result, other than a different distribution of “random” origins being picked.

  • blobs-origins.csv.zst was missing ~10% of its results in previous versions of the dataset, due to errors and/or timeouts in its generation, this is now down to 0.02% (1254 of the 6859445 unique blobs). Blobs with no known origins are now present, with a blank instead of URL.

  • blobs-earliest.csv.zst was missing ~10% of its results in previous versions of the dataset. It is complete now.

  • blobs-scancode.csv.zst is generated with a newer scancode-toolkit version (31.2.1)

  • blobs-scancode.ndjson.zst is added.

Errata

A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 was present in the initial version of the dataset (published on 2022-11-07). It was removed on 2022-11-09 using these two commands:

pv blobs-fileinfo.csv.zst | zstdcat | grep -v "\.tmp" | zstd -19
pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12

The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.

Citation

If you use this dataset for research purposes, please acknowledge its use by citing one or both of the following papers:

References

The dataset has been built using primarily the data sources described in the following papers:

Notes

Other, possibly more recent, versions of the datasets can be found at https://annex.softwareheritage.org/public/dataset/license-blobs/

Files

Files (16.1 GB)

Name Size Download all
md5:445a312cc71a742cc3142860d14fc35a
498.4 MB Download
md5:f970b227f8104513bd6e78c27faab69c
169.1 MB Download
md5:1b4ad082bf7a19fe61707d2f4c8461e6
150.7 MB Download
md5:a2c714cd195f8efaa71d74f9c0a176fb
242.0 MB Download
md5:ecf08711ecf9f07fa589980012756a42
28.0 MB Download
md5:c5a1479ab874fb689116ee2de3c8317e
122.0 MB Download
md5:4df5dee73cc943aa55c692641f1f2744
835.7 MB Download
md5:70530b9177937e513c76ea625bc49cc1
13.7 GB Download
md5:d40a4f53c9a51183844e5723aff4b301
1.9 kB Download
md5:5f85770931445f7fca06bd2f354e43ae
302.2 MB Download
md5:6c0a875197dcb2e13cb96f852bd274f2
827.1 kB Download
md5:ceee66e82d14ca6b59d517fd31daca64
13.4 kB Download