1 Introduction
A
and file B
have the same program code, but the header of A
includes GPL-2.0 and B
includes Apache-2.01. We have applied our detection method to Debian 7.52, one of the Linux distributions, and discovered different reasons that caused license inconsistencies and determined whether they are potentially illegal or not.-
RQ1. What are the evolution patterns of license inconsistencies and the underlying reasons? Analyzing the evolution patterns of license inconsistencies might gives us insight on the reasons that caused them to appear and disappear. The findings are: license inconsistencies appear, persist and disappear due to different reasons. They appear mostly because the original author updates the license while the reusers still use the old version of the files; they persist mostly because the downstream project is not synchronized with the upstream project yet; and they disappear when the downstream project is synchronized with the upstream project.
-
RQ2. Is the issue of license inconsistencies properly handled by developers? The findings are: license inconsistencies are mainly caused by distribution latency, and they will disappear when the developers synchronize their projects from the upstream projects. They persist because the reusers are still using an old version of the files and do not perform the synchronization. We do not consider this as a legal issue.
2 Methodology
2.1 Obtain License Inconsistency Groups for Debian 7.5 and Debian 8.2
CCFinder
[7] to extract the normalized token sequences of each file. The normalized token sequences is a token sequence of the source code, removing comments and whitespaces and changing all user-defined identifiers to a special token. Note that, although CCFinder
itself is a clone detection tool, we do not utilize the full functionality of CCFinder
and we only use it to generate the normalized token sequences of source files. By computing and categorizing the hash value of these token sequences, we then create a group for files that have the same normalized token sequences. We call them license inconsistency groups, or group for short in the rest of this paper. Each group contains at least two different files; i.e., a unique file is not contained in any group.Ninka
[3] is used to identify the license(s) of each file. Ninka
identifies the license sentences in the comment parts of each file, and compares those with its license database. It can identify more than 110 different OSS licenses and their different versions with 93% accuracy. Meanwhile, it will report “NONE” if the file has no license and “UNKNOWN” if the license sentence of the file does not match the database. The result is a list of licenses for each file group.2.2 Compare the Difference of Groups
2.3 Investigate the Groups Manually
Number of packages | Debian 7.5 | Debian 8.2 |
---|---|---|
Source packages | 17,160 | 20,577 |
Total files | 6,136,637 | 13,124,700 |
.c files | 472,861 | 767,006 |
.cpp files | 224,267 | 335,269 |
.java files | 365,213 | 477,154 |
Number of groups | Debian 7.5 | Debian 8.2 |
---|---|---|
Intersection\(^\text {a}\)
| 4062 | 4062 |
Relative complement\(^\text {b}\)
| 2701 | 2947 |
Total | 6763 | 7009 |
3 Results
.c
, .cpp
and .java
files which are the supported types of our detection method. The detection result of license inconsistency groups in Debian 7.5 and 8.2 is shown in Table 2. In this table, intersection means that both two versions of Debian contain that group of license inconsistency; relative complement means only that version of Debian contain that group of license inconsistency. Thus there are 4062 groups of license inconsistencies detected in both versions; 2701 groups only in Debian 7.5; and 2947 groups only in Debian 8.2. By examining the groups that are only in Debian 7.5 we can find out how and why license inconsistencies disappeared in Debian 8.2; while examining the groups that are only in Debian 8.2 we can understand how and why license inconsistencies appeared. The intersection part indicates that these groups of license inconsistencies persisted in Debian 8.2.
3.1 Why Do License Inconsistencies Appear?
-
Internal copy-and-paste of source files but their licenses are different. If copies of the same file exist in one project (we call these internal copies), they should also exist in the final distribution (e.g. Debian 7.5). Thus a case of license inconsistency will be reported by our method if they contain different licenses.For example, in a project named
FreeMedForms
, some source files in aplugins
directory are under BSD3 license. These files are copied to other directories with their licenses changed to GPL-3.0+.This type of license inconsistency could exist for just a short time, if the difference of license was due to mistakes and was later on fixed by developers. On the other hand, if developers decided to distribute these source files under different licenses, it would exist permanently. -
Different versions of the same project are included in the same distribution. Similar to the previous reason, if different versions of the same project which causes license inconsistencies are included in the final distribution, those license inconsistencies will be reported by our method.For example, there is a project named
groovy
which contains files that had no license in version 1.7.2, and were then added a Apache-2.0 license in version 1.8.6. Both of these two versions of this project are included in Debian 7.5, thus this license inconsistency is reported. -
Upstream and downstream projects both exist in the same distribution. In this research, if project
B
reuses source files from projectA
by copy-and-paste, we call projectB
the downstream project and projectA
the upstream project. While the previous two reasons are about the same project, this reason involves multiple projects. Files from the upstream project are reused in the downstream project, and the license of these files were changed either by the original author or the reuser. If both of these projects are included in the same distribution, the license inconsistency will be reported by our method.
remake
and kbuild
project were originally from the make
project, where license upgrade occurred in year 2010. Debian 7.5 includes older versions of make
and remake
, where the license was still GPL-2.0, while the newer version of kbuild
contains the GPL-3.0 license. Thus license inconsistency is reported for this case.
3.2 Why Do License Inconsistencies Persist?
-
Source files are not yet synchronized, but license inconsistencies will eventually disappear when they are. This type of license inconsistencies occurred because the downstream project has not yet been synchronized with the upstream project where the license of source files were changed. However, since the developers of the downstream projects are still synchronizing the project from the upstream regularly, this type of license inconsistencies will be eliminated eventually.For example, in the case of the project
JSON-lib
andjenkins-json
described earlier, although license inconsistencies appeared in Debian 7.5 (wherejenkins-json
still uses source files under Apache-2.0/MIT dual license), they disappeared in Debian 8.2, where developers ofjenkins-json
project synchronized from the upstream project and the license all become Apache-2.0 only. Though this case of license inconsistency disappeared in Debian 8.2, we consider it as a typical example to explain why license inconsistencies would exist for only a period of time and disappear when the source files are synchronized. -
Downstream project no longer synchronizes from upstream, and license inconsistencies will likely exist permanently. In this case, developers of downstream project chose to no longer synchronize from the upstream project, thus the license inconsistencies are likely to exist forever, unless the synchronization resumes.As shown in Fig. 2, among the results we found a project named
Mockito
which copy-and-owned several source files from a project namedEasyMock
. The license of these files in the upstream project were changed from MIT to Apache-2.0 in year 2009, however theMockito
project still uses the original MIT license. Besides,Mockito
project made some changes to the source code of these files by their own, and never again synchronized fromEasyMock
. After checking the history of these files inMockito
project, we found that one of the commit in year 2007 contains the following commit message: “umbilical cord between mockito package and easymock package is cut!”, which implies that they will never synchronize from the upstream project. Thus this case of license inconsistency is likely to exist permanently, unlessMockito
project decides to synchronize fromEasyMock
again.
3.3 Why Do License Inconsistencies Disappear?
-
Downstream project synchronized from upstream project. When the downstream project synchronized from the upstream project, the license of the source files becomes the same, thus the license inconsistencies disappear.Again, from Fig. 1 we can see that Debian 8.2 updated all these three projects to a newer version where all of their licenses are upgraded to GPL-3.0, thus this license inconsistency disappears.The case of project
JSON-lib
andjenkins-json
discussed earlier also applies here. -
The source code that contained the license inconsistency was removed or changed—thus no longer identical. In this research since we only inspect identical files, only files that contain the same token sequences are considered that they are from the same origin. If the source code of a file changed dramatically which made their token sequences different from the corresponding files, or if the relevant source file was removed in the new version, then the license inconsistencies will no longer be reported by our method.For example, there is a file in project
icu
which is under IBM copyrights in both versions of Debian. This file was reused in projectopenjdk-7
but with a GPL-2.0 license in Debian 7.5. Thus this case of license inconsistency was reported in Debian 7.5. However, the source code of the file inopenjdk-7
was changed in Debian 8.2 while the license remained the same. Our method no longer not consider them as file clones since they have different token sequences, thus the license inconsistency disappears in the newer version of Debian.
4 Discussion
4.1 Revisiting the Research Questions
-
Appear. (i) Internal copy-and-paste of source files in a project but their licenses are different; (ii) Different versions of the same project are included in the same distribution; (iii) Upstream and downstream projects both exist in the same distribution.
-
Persist. (i) Source files are not yet synchronized, but license inconsistencies will eventually disappear when they are; (ii) Downstream project no longer synchronize from upstream, and license inconsistencies are likely to persist permanently.
-
Disappear. (i) Downstream project synchronized from upstream project; (ii) Source code which contained license inconsistency was removed or changed and we consider them as different files.
4.2 Effectiveness of This Approach
4.3 Threats to Validity
CCFinder
and the license identification from Ninka
. For file clone detection, we use normalized token sequences as the metric to decide clones. If source files are modified a lot (e.g. add/remove several statements), they might not be recognized as clones. We could use approaches that detect similar source files to mitigate this problem. Regarding license identification, Ninka
is state-of-the-art license identification tool which has an accuracy of 93% [3]. As shown in Sect. 3, our manual analysis also proves its high accuracy and precision.