1 Introduction
Android
, JavaScript
, and .NET
. Android is one of the largest and most successful software ecosystem with substantial software reuse (Mojica et al. 2014; Li et al. 2016; Sattler et al. 2018; Berger et al. 2014). The JavaScript
ecosystem distributes its packages through npm, which is by far the largest package manager with over 1.82M package distributions.1 The .NET
ecosystem has a package management system, nuget, that is moderately large with over 261K packages.[1] As such, our three selected ecosystems vary in their nature (apps versus packages), their programming languages (Java, JavaScript, and C#), and their sizes (in terms of their distribution platforms).git merge/rebase
that is used in 33 % of Android mainline-fork pairs, 11 % of JavaScript pairs, and 18 % of .NET pairs. We find that cherry picking is less frequently used, with only 9 %, 0.9 %, and 2.5 % of Android, JavaScript, and .NET pairs using it, respectively. Among the three pull request integration mechanisms we studied (merge, rebase, and squash), the most used pull request integration mechanism is the merge option in the direction of fork→ mainline, where 2.4 %, 7 %, and 11 % of the pairs in Android, JavaScript, and .NET use this strategy. We find that integrating commits using squashed or rebased pull requests is rare in all three ecosystems. Overall, we find that when code propagation occurs, it seems that fork developers perform this propagation directly through git and outside of GitHub’s built-in pull request mechanism. This observation implies that simply relying on pull requests to understand code propagation practices in divergent forks is not enough.-
We propose leveraging the main distribution platforms of three ecosystems to precisely identify divergent forks. We devise a technique to identifying families in these ecosystems by using data both from GitHub and the respective distribution platform.
-
In contrast to previous studies on code propagation strategies that either focused only on pull requests or on directly comparing commit IDs, we are the first to study code propagation while considering pull requests with the options of squash / rebase as well as git rebased and cherry-picked commits.
-
We analyze the prevalence of code propagation within software families as well as the types of propagation strategies used.
-
We synthesize implications of our results for code reuse tools.
-
We provide an online appendix (2020) containing our datasets, intermediate results, and the scripts to trace code propagation between any mainline-fork pair.
variant ownership
as well as more illustrative graph comparisons, we discuss the characteristics of the mainline–fork pairs across all three ecosystems.2 Background on Code Propagation Strategies
Pull Requests | Git Commands | |||||
---|---|---|---|---|---|---|
Metadata changed | Merge | Squash | Rebase | Cherry-pick | Merge | Rebase |
Commit ID | No | Yes | Yes | Yes | No | No |
Author Name | No | Yes | No | No | No | No |
Author Date | No | Yes | No | No | No | No |
Committer Name | No | Yes/No | Yes/No | Yes/No | No | No |
Committer Date | No | Yes | Yes | Yes | No | No |
Commit Message | No | Yes | No | No | No | No |
File details | No | No | No | No | No | No |
2.1 Propagation with GitHub Facilities
-
Merge pull request commits is the default. When the developer chooses this option, the commit history in the destination branch will be retained exactly as it is. As can be seen from Table 1, the metadata of the integrated commits from the source branch remain unchanged in the destination branch. However, a new merge commit will be created in the destination branch to “tie together” the histories of both branches (GitHub 2020).
-
Rebase and merge pull request commits: When the integrator selects the Rebase and merge option on a pull request on GitHub, all commits from the source branch are replayed onto the destination branch and integrated without a merge commit. From Table 1, we can see that using this integration technique, the commit metadata between source and destination preserves the author name, author date, and commit message but alters the commit ID, committer name, and committer date. The committer name becomes the name of the developer from the destination repository who rebased and merged the pull request. Note that if the developer who submitted the pull request is coincidentally the same as the developer who integrates it (e.g., because the developer works on both repositories), then the committer name will remain the same (GitHub 2020).
-
Squash and merge pull request commits: When the integrator selects the Squash and merge option on a pull request on GitHub, the pull request’s commits are squashed into a single commit. Instead of seeing all of a contributor’s commits from the source branch, the commits are squashed into one commit and included in the commit history of the destination branch. Apart from the file details, all other commit meta data changes. The committer name changes unless, similar to above, the original committer and the developer merging the pull request are the same (GitHub 2020).
2.2 Propagation with Git Facilities (Cherry Pick, Merge, and Rebase Commits)
-
Git cherry-pick commits: Cherry picking is the act of picking a commit from one branch and integrating it into another branch. Commit cherry picking can, for example, be useful if a mainline developer creates a commit to patch a pre-existing bug. If the fork developer cares only about this bug patch and not other changes in the mainline, then they can cherry pick this single commit and integrate it into their fork. As shown in Table 1, the author name, author date, commit message, and file details of the cherry picked commit remain the same in the destination branch. The commit ID, committer name, and committer date however do change. Note that the committer name may remain the same if the integrator is the same developer who performed the original commit in the source branch.
-
Git merge commits: Like in the pull request merge, git merge also preserves all the commit metadata and creates an extraneous new merge commit in the destination branch that ties together the histories of both branches.
-
Git rebase commits: Rebasing is an act of moving commits from their current location (following an older commit) to a new head (newest commit) of their branch (Chacon and Straub 2014b). Git rebase deviates slightly from rebasing pull requests on GitHub as it does not change the committer information. To better understand git rebase, let us explain it with an illustration based on the experiments we carried out. On the left-hand side of Fig. 1, we have a mainline repository and a fork repository where each repository made updates to the code through commits C3 and C4 in the mainline and commits F1 and F2 in the fork. The fork developer observes that the new updates in the mainline are interesting and decides to integrate them using rebasing. After rebasing, the commit history will look the right side of Fig. 1. Notice that the IDs and the order of the integrated commits C3 and C4 in the fork branch are unchanged. However, the IDs of commits F1 and F2 change to F1’ and F2’. In this case, Git rebase is like the fork developer saying “Hey, I know I started this branch last week, but other people made changes in the meantime. I don’t want to deal with their changes coming after mine and maybe conflicting, so can you pretend that I made [my changes] today?” (Vandehey 2019).
-
Other Git commands that rewrite commit history: Git has a number of other tools that rewrite commit history, including changing commit messages, commit order, or splitting commits (Chacon and Straub 2014a). These commands include:
git commit --amend
,git rebase -i HEAD~N
, andgit --squash
, etc. Most of these commands significantly change the history and the meta data of commits. If the integrator uses any of these commands in the destination repository, then there is no straightforward way to match the integrated commits across the two repositories (Chacon and Straub 2014a).
3 Methodology
3.1 Identifying Software Families
3.1.1 Identifying Android Families
AndroidManifest.xml
). Such manifest files also declare the app’s components, necessary permissions, and required hardware and Android version. As such, each Android app in a software family must have a unique package name, which excludes any forked repositories where the package name was not modified. More specifically, we identify Android families using a relatively conservative filtering approach as follows.
AndroidManifest.xml
file; (6) has a description or readme.md file; and (7) has a number of forks ≥ 2 to reduce the chance of finding student assignments (Munaiah et al. 2017).AndroidManifest.xml
file, we extract the app’s package name and check its existence on Google Play. In total, we find 7,423 mainline repositories representing an actual Google Play app (Businge et al. 2017).AndroidManifest.xml
files with the same package name. Such duplicates easily arise when an app’s source code is copied without forking. Since package names are unique on Google Play, only one of these duplicate repositories can actually correspond to the Google Play app. We manually select one repository from these duplicates by considering repository popularity (number of forks and stars on GitHub), repository and app descriptions on both GitHub and Google Play, as well as the developer name on GitHub and Google Play. In some cases, the Google Play app description conveniently linked to the GitHub repository. As a result of this step, we discard 1,232 repositories and are left with 6,191 mainline repositories.AndroidManifest.xml
from another app without changing the package name. This practice results in the forked app’s package name pointing to an app that exists on Google Play, but that is not the one hosted in the GitHub repository. We inspect the Readme.md
and unique commit messages in the GitHub repository and the respective Google Play description page. Eliminating all mismatched apps leaves a total of 38 app families comprising of 54 forked apps—our final dataset to answer the research questions.3.1.2 Identifying JavaScript and .NET Families
npm
or nuget
is similar. On both package managers, a package’s metadata include: source repository of the package (GitHub, GitLab, BitBucket), number of dependent projects/packages, number of dependencies, number of package releases, and the package contributors. Fortunately, most of the data of 37 package managers for different ecosystems can be found on one central location https://libraries.io/, which is a platform that periodically collects all data from different package managers. In addition to the metadata for a specific package on a given package manager, https://libraries.io/ also extends the package metadata with more information from GitHub. For example, it stores a Forkboolean
field, which indicates whether the corresponding repository of a package is a fork. Such a field Forkboolean
can help us identify forked repositories that have published their packages. Note that this is different from the Android ecosystem where such explicit traceability does not exist, which is why we first mine repositories from GitHub and then filter out those that are published on Google Play. In contrast, with .NET and JavaScript, we mine the families directly from https://libraries.io/. We extract the families from the latest https://libraries.io/ data dump release 1.6.0 that was released on January 12, 2020. The meta-model for the data on the Libraries.io
data dump can be found online.7 We extract .NET and JavaScript families from https://libraries.io/ with the following steps:
Platform
, we filter out the packages that are distributed on nuget and npm package managers.Forkboolean
to identify repositories that are forks, and use the field Fork Source Name with Owner
to identify the fork repository name as well as the parent repository (mainline). We extract all fork repositories that map to published packages on nuget and npm.3.2 Identifying Family Characteristics (RQ1)
3.2.1 General Characteristics
3.2.2 Identifying Maintenance Activities (JavaScript & .NET only)
imaeses / k-9
8 has releases. However, when we access the fork using the GitHub API for a list of releases9, we can see that it returns an empty list. To this end, we decided not to collect package releases for the variants in the Android ecosystem.3.2.3 Identifying Variant Ownership Characteristics
bot
) who merged a pull request in both repositories. This means that our ownership criteria relies on each variant merging at least one pull request. Since we have very few variant pairs in the Android ecosystem, this would reduce further the very small dataset of variant pairs. To this end, we apply the described method only on the variants of .NET and JavaScript ecosystems, which have moderately large to very large dataset of variant pairs and use a different criteria to identify the owners of Android variants that we explain later. Since all the variants are published in Google Play, then each variant has an owner. We identify only 89 of the 590 mainline–fork pairs in the .NET ecosystem where both the mainline and fork variant had any merged PR by a real developer. For the JavaScript ecosystem we identify only 89 of the 10,357 mainline–fork pairs where both the mainline and fork variant had any merged PR by a real developer.developer id
or dev id
, which is the name of the developer/company (owner) that uploads the variant on its updates on the marketplace.3.2.4 Identifying Variant Popularity
-
Android variants: For the variants in the Android ecosystem, we define two popularity metrics for the number of downloads on Google play, DownloadsMLV and DownloadsFV for the mainline and divergent fork respectively. We also define two popularity metrics for the number of reviews on Google play ReviewsMLV and ReviewsFV for the mainline and divergent fork, respectively.
-
JavaScript and .NET variants: For variants in these two ecosystems, we record the number of other packages on the JavaScript and .NET that depend on the mainline and the fork variants (DependentPackagesMLV and DepenedntPackagesFV respectively). We also record the number of other projects on GitHub that depend on the mainline and variant (DependentProjectsMLV and DependentProjectsFV respectively). All the variant’s dependent packages / projects are extracted from https://libraries.io/. The package and project dependents are a good way of measuring popularity since they give an indication of what other packages / projects are interested in the functionality provided by the variant.
3.3 Identifying Code Propagation (RQ2)
(mainline variant, fork variant)
pair in a family, we first identify common commits and then identify unique commits, as follows.
3.3.1 Identifying Common Commits
-
Inherited commits: The fork date is the point in time at which the fork variant is created. At that point, all commits in the fork are the same as those in the mainline, and we refer to them as InheritedCommits. In Fig. 3, the InheritedCommits are the purple commits 1, 2, and 3. To extract these commits for either variants, we collect all the commits since the first commit in the history until the fork date.
-
Pull-Request commits: We first collect the merged pull requests in each repository and identify the pull requests whose source and destination branches belong to the analyzed repository pair. The GitHub API
:owner/:repo/pulls/:pull_number
provides all the information of a given pull request. One can identify the source and destination branches using the pull request objects[‘head'][‘repo'][‘full_name']
and[‘base'][‘repo'][‘full_name']
from the returnedjson
response, respectively. Based on the source and destination information, we can always identify the direction of the pull request as fork→ mainline or mainline→ fork, as shown in Fig. 3. For each pull request, we collect the pull request commitspr_commits
using the GitHub API:owner/:repo/pulls/:pull_number/commits
. Regardless of how a pull request gets integrated, the commit information in the source repository is always identical to that inpr_commits
. Thus, we can always identify the pull request commits in the source repository by comparing the IDs of the commits inpr_commits
to those in the history of the source repository. The tricky part is identifying the integrated commits in the destination repository. Based on the information discussed in Section 2 and summarized in Table 1, we can identify the pull request commits in the destination repository as follows:-
Merged pull request commits: Based on Table 1, the commit IDs of pull request commits integrated using the default merge option do not change. Thus, to identify these commits, we simply compare the IDs of the
pr_commits
to those in the commit history of the destination repository. -
Rebased pull request commits: Recall from Table 1 that integrated commits from a rebased pull request have different commit IDs on the destination branch. Thus, we identify the rebased commits in the destination branch by comparing the remaining unchanged commit metadata, such as author name, author date, commit message, and file details.
-
Squashed pull request commits: As part of a squashed pull request’s meta data, GitHub records the ID of the squashed commit on the destination branch in the
merge_commit_sha
attribute.10 Using this ID, we can identify the exact squashed commit in the destination repository. For extra verification, we also compare the changed files of all commits in the pull request with the changed files in the identified squashed commit.
-
-
Git merged commits: After identifying all commits related to pull requests, we now analyze any remaining unmatched commits to identify if they might have been propagated directly through Git commands. Recall from Section 2 that this includes merged, rebased, and cherry-picked commits.
-
Git cherry-picked commits: We locate cherry-picked commits in the source and destination commit histories by comparing the following commit metadata:
commit ID
,author name
,author date
,commit message
andfilenames
andfile changes
. We can also identify the source and the destination branches of the cherry picked commits by looking at the committer dates of the matched commits. We mark the commit with the earlier committer date to be from the source branch and that with the later date to be in the destination branch. -
Git merged and Git rebased commits: At this point, we have already identified all integrated pull request commits as well as cherry picked commits. Thus, any remaining commits that have the same ID in the histories of both variants must have been propagated through git merge or git rebase. As shown in Table 1 and Fig. 1, any commits integrated through git rebase have exactly the same ID and meta data in both the source and destination branch. Similarly, commits integrated through git merge also have the same exact information. While we can differentiate git-merged and git-rebased commits by finding merge commits (those with two parents) and marking any commits between the merge commit and the common ancestor as commits that are integrated through git merge, this differentiation is not important for our purposes. We are only interested in marking both types of commits as propagated commits. Thus, for our purposes, we can identify commits integrated via Git rebase or Git merge, but do not differentiate between them. Similar to pull requests, both types of commits may be pulled from any of the branches to the other. However, unlike pull requests, it is not possible to identify which variant the propagated commit originated from. This is because of the nature of distributed version-control systems where commits can be in multiple repositories, but there is no central record identifying the commits’ origin. Since it is common for commits to be pulled from the mainline and pushed into the fork repository as a result of the fork trying to keep in sync with the new changes in the mainline, we make an assumption that all commits that we identify as integrated through git merge or git rebase are pulled from the mainline variant and pushed into the fork variant.
-
3.3.2 Identifying Unique Commits
compare
GitHub API11. The compare
GitHub API compares between the mainline branch and fork branch, as one of the items, return the diverged commits that comprise the number of commits a given branch (say mainline branch) is ahead of the other branch (fork branch) as well the number of commits the branch is behind the other. The commits that the mainline branch is ahead of the fork branch are the unique commits to the mainline, while the commits the mainline is behind the fork are the unique commits to the fork.3.3.3 Verifying our Commit Categorization Methods
-
(
dashevo / dash-wallet, sambarboza / dash-wallet
): The repositorysambarboza / dash-wallet
is a social fork. The mainlinedashevo / dash-wallet
has a total 445 PRs. Our scripts identifies that 74 of these 445 pull requests were integrated from the fork repositorysambarboza / dash-wallet
into the mainline repositorydashevo / dash-wallet
. We show the details of these 74 PRs in Table 2. Our technique identified that 3 of the 74 PRs were integrated using the PR merge option (all together having a total of 13 commits). There were 43 of the 74 PRs that were integrated using PR squash option (having a total of 194 commits), 2 of the 74 PRs used the PR rebase option having a total of 6 commits, and the integration option of the 26 PRs was unclassified (having a total of 167). We identified a total of 405 commits that were integrated using the git merge / rebase integration option and no commit was integrated using git cherry-pick option. -
(
flagbug / YoutubeExtractor, Kimmax / SYMMExtractor
): The repositoryKimmax / SYMMExtractor
is a variant fork. The mainlineflagbug / YoutubeExtractor
has a total of 32 pull requests. Our scripts identifies that 2 of the 32 PRs were integrated from the fork repositoryKimmax / SYMMExtractor
into the mainline repository (lagbug / YoutubeExtractor
(see details in Table 2). The two PRs were integrated using the merge PR option having a total of two commits that were integrated. We also identified a total of three commits that were integrated using the git merge / rebase integration option and 1 commit was integrated using git cherry-pick option. -
(
TerriaJS / terriajs, bioretics / rer3d-terriajs
): The repositorybioretics / rer3d-terriajs
is a variant fork. The forkbioretics / rer3d-terriajs
has a total of 10 pull requests. Our scripts identifies that 9 of the 10 pull requests were integrated from the mainlineTerriaJS / terriajs
into the forkbioretics / rer3d-terriajs
. The 9 PRs had a total of 101 commits. There were no commits integrated using the PR squash and PR rebase options. A total of 1,825 were integrated using the option git merge / rebase integration option and only 10 commits integrated using git cherry-pick option.
Technique | # PRs | # Commits | ||
---|---|---|---|---|
Android | ||||
dashevo / dash-wallet (D), | PR | Merged | 3 | 13 |
sambarboza / dash-wallet (S) | Squashed | 43 | 194 | |
Rebased | 2 | 6 | ||
Unclassified | 26 | 167 | ||
Git | Merge/rebase | 405 | ||
Cherry-pick | 0 | |||
Total | 74 | 785 | ||
.NET | ||||
flagbug / YoutubeExtractor (D), | PR | Merged | 2 | 2 |
Kimmax / SYMMExtractor (S) | Squashed | 0 | 0 | |
Rebased | 0 | 0 | ||
Unclassified | 0 | 0 | ||
Git | Merge/rebase | 3 | ||
Cherry-pick | 1 | |||
Total | 2 | 6 | ||
JavaScript | ||||
TerriaJS / terriajs (S), | PR | Merged | 9 | 101 |
bioretics / rer3d-terriajs (D) | Squashed | 0 | 0 | |
Rebased | 0 | 0 | ||
Unclassified | 0 | 0 | ||
Git | Merge/rebase | 1,825 | ||
Cherry-pick | 10 | |||
Total | 9 | 1,936 |
getodk / collect (D), lognaturel / collect (S)
] (lognaturel / collect
is a social fork), our script reveals that the commits in the pull requests numbered 3531, 3462 and 3434 were integrated using merging, squashing and rebasing, respectively. We manually verify that these pull requests have been in fact integrated using these techniques by looking at their commit metadata. Similarly, for the pair [dashevo / dash-wallet (D), sambarboza / dash-wallet (S)
] (sambarboza / dash-wallet
is a social fork), we verify that the commits in the pull requests number 421, 333, and 114 were integrated using merging, squashing, and rebasing, respectively. We also look at the results returned by integration outside GitHub (git merge/rebase
and git cherry-pick
). For example, our results indicate that the pair [FredJul/Flym (D), Etuldan/spaRSS (S)
] (Etuldan/spaRSS
is a variant fork), has no commits integrated using pull requests but had 34 and five commits integrated using git merge/rebase
and git cherry-picking
, respectively. We manually verify these five latter commits and confirm their correctness.dashevo / dash-wallet, sambarboza / dash-wallet
from Table 2 shows, there were some pull requests that our scripts were not able to classify. As part of our manual verification, we find that the GitHub API indicates that they are integrated into the destination repository since their merge_date
is not null
. On deeper investigation, we discover that all the unclassified pull request commits were integrated into a different branch from the master
branch. For example, pull requests 514 and 512 from the fork sambarboza / dash-wallet
were both integrated in the branch evonet-develop
on the mainline repository. We also observed that both pull requests had an integration build test failure (Travis CI
). This explains why the commits are missing in the history of the master
branch and why our scripts could not classify those integrated commits.3.3.4 Fork Variability Percentage
4 Variant Family Characteristics (RQ1)
Metric | Mean | Min | Median | Max | Description |
---|---|---|---|---|---|
FamilySize | |||||
Android apps | 2.4 | 2 | 2 | 7 | Number of variants in an Android family |
.NET apps | 2.1 | 2 | 2 | 7 | Number of variants in a .NET family |
JavaScript apps | 2.2 | 2 | 2 | 16 | Number of variants in a JavaScript family |
App Dependencies (.NET & JavaScript) | |||||
PackageDependenciesMLV | 40.4 | 0 | 26 | 140 | Number of mainline variant packages dependencies on Android |
2.3 | 0 | 1 | 49 | Number of mainline variant packages dependencies on .NET | |
11.8 | 0 | 7 | 267 | Number of mainline variant packages dependencies on JavaScript | |
PackageDependenciesFV | 22 | 0 | 22 | 81 | Number of of fork variant packages dependencies on Android |
2.0 | 0 | 1 | 25 | Number of of fork variant packages dependencies on .NET | |
9.8 | 0 | 6 | 605 | Number of fork variant packages dependencies on JavaScript | |
App Popularity (Android) | |||||
DownloadsMLV | 2,211K | 1 | 50K | 100M | Number of downloads of the mainline variant from Google Play |
DownloadsFV | 5,479K | 5 | 1K | 100K | Number of downloads of the fork variant from Google Play |
ReviewsMLV | 27K | 0 | 547 | 631K | Number of reviews of the mainline variant on Google Play |
ReviewsFV | 2.8K | 0 | 45 | 161K | Number of reviews of the fork variant on Google Play |
App Popularity (.NET & JavaScript ) | |||||
DependentPackagesMLV | 106 | 0 | 0 | 27K | Number of packages that depend on the mainline app on .NET |
80 | 0 | 2 | 26K | Number of packages that depend on the mainline app on JavaScript | |
DepenedntPackagesFV | 0.4 | 0 | 0 | 19 | Number of .NET packages that depend on the fork app on .NET |
1.7 | 0 | 0 | 2K | Number of JavaScript packages that depend on the fork app on JavaScript | |
DependentProjectsMLV | 133 | 0 | 0 | 33K | Number of .NET projects that depend on the mainline app on GitHub |
140 | 0 | 0 | 83K | Number of JavaScript projects that depend on the mainline app on GitHub | |
DependentProjectsFV | 0.5 | 0 | 0 | 82 | Number of .NET projects that depend on the fork app on GitHub |
2 | 0 | 0 | 5K | Number of JavaScript projects that depend on the fork app on GitHub | |
App Maintenance (.NET & JavaScript) | |||||
PackageReleasesMLV | 14.6 | 1 | 2 | 188 | Number of mainline variant packages dependencies on .NET |
15 | 1 | 8 | 1117 | Number of mainline variant packages dependencies on JavaScript | |
PackageReleasesFV | 3.6 | 1 | 2 | 54 | Number of of fork variant packages dependencies on .NET |
4 | 1 | 2 | 341 | Number of fork variant packages dependencies on JavaScript |
4.1 General Variant Characteristics
-
Variant Family FamilySize. Figure 4 shows the number of variants (i.e., family size) in each of the variant families of the three ecosystems we studied.We can see that the distributions of family sizes for all three ecosystems are right-skewed with most families having two members. Specifically, 28 (73%) of 38 software families, 7,731 (87%) of 8,837 software families, and 475 (90%) of 526 software families have only two variants. The three distributions also show that larger families are rather seldom in all three ecosystems, but that the largest family sizes we observe are part of the JavaScript ecosystem. When identifying variant families from the different ecosystems, we observe that although Android is considered one of the largest known ecosystems (Mojica et al. 2014; Li et al. 2016; Sattler et al. 2018), identifying its variant families is rather difficult compared to the software packaging ecosystems (JavaScript and .NET) we studied. In the Android ecosystem is not compulsory to record any source repository of an Android variant on Google Play. To this end, we went through the lengthy process described in Section 3.1.1, applying a number of heuristics on GitHub repositories to identify families.
-
Variant Package Dependencies: In Fig. 5, we present two scatter plots showing the graph of mainline dependencies versus the fork dependencies. Figures 5a to c show the scatter plots of the number fork variant package dependencies (y-axis) versus the number of mainline variant package dependencies (x-axis) for Android, .NET and JavaScript variants, respectively. A point in any of the scatter plots represents the number of package dependencies of a given fork variant (y-axis) and the number of package dependencies of the counterpart mainline variant (x-axis). In all scatter plots, its not surprising that the number of package dependencies for a fork and its corresponding mainline are correlated. This confirms that fork variants inherit the original dependencies of the mainline. However, we also observe points in all the scatter plots where one variant has more dependencies than the other. This means that the variant with more packages dependencies has functionality that is not included in the counterpart variant. Although the observation is more prominent for the mainline variant since we see many points below the diagonal lines for the two graphs (the forks do not keep in sync with the mainline), it is interesting that we also have some fork variants with more dependencies. Follow-up studies could investigate what and why new functionalities related to the used dependencies are being introduced in the variants.
-
Android variant categories:Figure 6 shows the distribution of variants in the different categories on Google Play. We can see that 12 of the 54 forks (22%) are listed in a different category from the mainline, which suggests that these variants serve different purposes. However, the majority of pairs include variants in the same category.
4.2 Variant Maintenance Activity (JavaScript & .NET)
4.3 Variant Ownership Characteristics
4.4 Variant Popularity Characteristics
-
Android variants: Figure 9a shows the variant downloads distribution for both the mainline and fork variants where each point on the x-axis represents a pair and we sort the pairs by the number of mainline downloads. We observe that the majority of the mainline variants are quite popular, 27 of the 38 mainline variants (71%) have ≥ 10K downloads. For fork variant popularity in terms of downloads, we observe that 10 of the 54 fork variants (19%) having ≥ 10K downloads. We believe it is natural that the mainline variants are more popular than their fork counterparts, since we assume they have been released first on Google Play12. Figure 9b shows the variant reviews distribution for both the mainline and fork variants where each point on the x-axis represents a pair and we sort the pairs by the number of mainline reviews. We observe a similar distribution for number of reviews like those observed in the number of downloads. This is not surprising since previous studies have found downloads and reviews to be correlated (Businge et al. 2019). Overall, the variant popularity we observe gives us confidence that our data set consists of real variants.
-
JavaScript and .NET variants: In Figs. 9c–f we present the popularity graphs for the variants in the two ecosystems of .NET and JavaScript. Figure 9c shows the dependent packages distributions for both the mainline and fork variants where each point on the x-axis represents a pair and we sort the pairs by the number of mainline dependent packages. We observe that the majority of mainline variants are quite popular, 6,157 of the 10,357 mainline variants (59 %) having at least two dependent packages. For fork variants, we observe that 1,624 of the 10,357 mainline variants (16 %) having at least two dependent packages. Figure 9d shows the dependent projects distributions for both the mainline and fork variants for the variants in the JavaScript ecosystem. Each point on the x-axis represents a pair and we sort the pairs by the number of mainline dependent project. We also observe a similar distribution for number of dependent projects such as that observed in the number of dependent packages. The remaining two graphs, Figs. 9e and f, show the same data for the .NET ecosystem, and both show similar trends to those observed for JavaScript.
mainline
and fork
we use the package_names
of the variants since repository names on GitHub were too long. In both tables, we present two interesting examples of variant pairs that we randomly picked: (1) abandoned mainlines: the first variant pair in each of the ecosystems has the fork variant more popular that the mainline. When we compared the last release dates of the variants in all the ecosystems, we observed that the mainlines seem to have been abandoned while the fork variant continued to evolve. This is the reason the fork variants are more popular. In Table 5 we can also see that the fork variants have more releases than the mainlines. (2) Co-evolution: the second pair in each of the ecosystems we present another interesting case of co-evolution of both the mainline and fork variant. are continuously being maintained and where both are popular. In this cases, it would be interesting co-evolution of the variants in both technical and social aspects. Technical: for example investigating if the variants are complementary or they are competing? Social: What can we learn about the variant communities?
mainline | fork | mainline | fork | mainline | fork |
---|---|---|---|---|---|
downloads | downloads | reviews | review | ||
TobyRich / | TailorToys / | 10K | 100K | 106 | 1,034 |
app-smartplane-android | app-powerup-android | ||||
opendatakit / | kobotoolbox / | 1,000K | 100K | 3,049 | 1,527 |
collect | collect |
mainline | fork | mainline | fork | mainline | fork | |
---|---|---|---|---|---|---|
dependent | dependent | package | package | |||
packages | packages | releases | releases | |||
.NET | Flurl.Signed | Flurl.Http.Signed | 3 | 10 | 6 | 10 |
Ninject | Portable.Ninject | 638 | 19 | 75 | 14 | |
JS | selenium | selenium-server | 97 | 2,046 | 2 | 51 |
gulp-istanbul | gulp-babel-istanbul | 5,867 | 11 | 24 | 14 |
5 Code Propagation in the Software Families (RQ2)
git merge
and git rebase
commits and that we assume that all integrated git merge
and git rebase
commits are in the direction mainline→ fork. This is why Tables 7 and 8 show only one metric gitPullMLV-FV to represent these two commit integration types. Tables 6–9 show the summary of the descriptive statistics of all the metrics we use to investigate code propagation at the commit level for all the three ecosystems of Android, JavaScript, and .NET.
Metric | Mean | Min | Median | Max | Description |
---|---|---|---|---|---|
Android variants | |||||
mergedPRsMLV-FV | 0.31 | 0 | 0 | 15 | Number of merged PR from the mainline to the fork variant. |
mergedPRsFV-MLV | 0.09 | 0 | 0 | 4 | Number of merged PR from a given the fork to the mainline variant. |
prMergedCommitsMLV-FV | 8.33 | 0 | 0 | 427 | Number of merged PR commits from the mainline to the fork variant. |
prMergedCommitsFV-MLV | 0.57 | 0 | 0 | 28 | Number of merged PR commits from the fork to the mainline variant. |
prSquashedMLV-FV | 0 | 0 | 0 | 0 | Number of squashed PR from the the mainline to the fork variant. |
prSquashedFV-MLV | 0 | 0 | 0 | 0 | Number of squashed PR from a given the fork to the mainline variant. |
prRebasedMLV-FV | 0 | 0 | 0 | 0 | Number of rebased PR from the the mainline to the fork variant. |
prRebasedFV-MLV | 0 | 0 | 0 | 0 | Number of rebased PR from a given the fork to the mainline variant. |
.NET variants | |||||
mergedPRsMLV-FV | 0 | 0 | 0 | 3 | Number of merged PR from the mainline to the fork variant. |
mergedPRsFV-MLV | 0.2 | 0 | 0 | 13 | Number of merged PR from a given the fork to the mainline variant. |
prMergedCommitsMLV-FV | 0.2 | 0 | 0 | 30 | Number of merged PR commits from the mainline to the fork variant. |
prMergedCommitsFV-MLV | 1.2 | 0 | 0 | 207 | Number of merged PR commits from the fork to the mainline variant. |
prSquashedMLV-FV | 0 | 0 | 0 | 0 | Number of squashed PR from the the mainline to the fork variant. |
prSquashedFV-MLV | 0 | 0 | 0 | 5 | Number of squashed PR from a given the fork to the mainline variant. |
prSquashedCommitsFV-MLV | 0.1 | 0 | 0 | 14 | Number of squashed PR commits from the fork to the mainline variant. |
prRebasedMLV-FV | 0 | 0 | 0 | 0 | Number of rebased PR from the the mainline to the fork variant. |
prRebasedFV-MLV | 0 | 0 | 0 | 0 | Number of rebased PR from a given the fork to the mainline variant. |
JavaScript variants | |||||
mergedPRsMLV-FV | 0 | 0 | 0 | 26 | Number of merged PR from the mainline to the fork variant. |
mergedPRsFV-MLV | 0.4 | 0 | 0 | 4 | Number of merged PR from a given the fork to the mainline variant. |
prMergedCommitsMLV-FV | 0.1 | 0 | 0 | 399 | Number of merged PR commits from the mainline to the fork variant. |
prMergedCommitsFV-MLV | 0.57 | 0 | 0 | 28 | Number of merged PR commits from the fork to the mainline variant. |
prSquashedMLV-FV | 0 | 0 | 0 | 2 | Number of squashed PR from the the mainline to the fork variant. |
prSquashedFV-MLV | 0 | 0 | 0 | 21 | Number of squashed PR from a given the fork to the mainline variant. |
prSquashedCommitsMLV-FV | 0.4 | 0 | 0 | 52 | Number of squashed PR commits from the mainline to the fork variant. |
prSquashedCommitsFV-MLV | 0 | 0 | 0 | 109 | Number of squashed PR commits from the fork to the mainline variant. |
prRebasedMLV-FV | 0 | 0 | 0 | 2 | Number of rebased PR from the the mainline to the fork variant. |
prRebasedFV-MLV | 0 | 0 | 0 | 3 | Number of rebased PR from a given the fork to the mainline variant. |
prRebasedCommitsMLV-FV | 0.4 | 0 | 0 | 4 | Number of rebased PR commits from the mainline to the fork variant. |
prRebasedCommitsFV-MLV | 0 | 0 | 0 | 25 | Number of rebased PR commits from the fork to the mainline variant. |
Mainline→ Fork | Fork→ mainline | ||||||
---|---|---|---|---|---|---|---|
Pairs | PRs | Commits | Pairs | PRs | Commits | ||
Android variants | |||||||
PR | Merged | 1 | 1 | 5 | 1 | 2 | 427 |
Rebased | 0 | 0 | 0 | 0 | 0 | 0 | |
Squashed | 0 | 0 | 0 | 0 | 0 | 0 | |
Unclassified | 0 | 0 | 0 | 0 | 0 | 0 | |
Git | Cherry-pick | 5 | n/a | 250 | 4 | n/a | 136 |
gitPullMLV-FV | 18 | n/a | 13,198 | n/a | n/a | n/a | |
.NET variants | |||||||
PR | Merged | 9 | 13 | 96 | 67 | 139 | 721 |
Rebased | 0 | 0 | 0 | 0 | 0 | 0 | |
Squashed | 0 | 0 | 0 | 13 | 21 | 72 | |
Unclassified | 0 | 0 | 0 | 3 | 3 | 9 | |
Git | Cherry-pick | 15 | n/a | 99 | 16 | n/a | 138 |
gitPullMLV-FV | 106 | n/a | 5,601 | n/a | n/a | n/a | |
JavaScript variants | |||||||
PR | Merged | 99 | 162 | 1,862 | 724 | 1,394 | 4,523 |
Rebased | 1 | 1 | 4 | 11 | 13 | 67 | |
Squashed | 5 | 6 | 72 | 132 | 250 | 1,048 | |
Unclassified | 7 | 10 | 33 | 23 | 32 | 134 | |
Git | Cherry-pick | 95 | n/a | 275 | 91 | n/a | 251 |
gitPullMLV-FV | 1,180 | n/a | 40,001 | n/a | n/a | n/a |
Metric | Mean | Min | Median | Max | Description |
---|---|---|---|---|---|
Android variants | |||||
gitCherrypickedMLV-FV | 4.6 | 0 | 0 | 168 | Number of git cherry-picked commits from the the mainline to the fork variant. |
gitCherrypickedFV-MLV | 2.5 | 0 | 0 | 75 | Number of git cherry-picked commits from the fork to the mainline variant. |
gitPullMLV-FV | 244 | 0 | 0 | 6567 | Number of git merged/rebased commits from the the mainline to the fork variant. |
.NET variants | |||||
gitCherrypickedMLV-FV | 1.5 | 0 | 0 | 42 | Number of git cherry-picked commits from the the mainline to the fork variant. |
gitCherrypickedFV-MLV | 0.4 | 0 | 0 | 148 | Number of git cherry-picked commits from the fork to the mainline variant. |
gitPullMLV-FV | 9.5 | 0 | 0 | 2,317 | Number of git merged/rebased commits from the the mainline to the fork variant. |
JavaScript variants | |||||
gitCherrypickedMLV-FV | 4.6 | 0 | 0 | 168 | Number of git cherry-picked commits from the the mainline to the fork variant. |
gitCherrypickedFV-MLV | 0 | 0 | 0 | 70 | Number of git cherry-picked commits from the fork to the mainline variant. |
gitPullMLV-FV | 3.7 | 0 | 0 | 6,035 | Number of git merged/rebased commits from the the mainline to the fork variant. |
Metric | Mean | Min | Median | Max | Description |
---|---|---|---|---|---|
Android variants | |||||
uniqueMLV | 1,122 | 0 | 228 | 18,961 | Number of unique commits in the mainline variant in a given mainline–fork pair. |
uniqueFV | 98.3 | 1 | 16 | 1,646 | Number of unique commits in the fork variant in a given mainline–fork pair. |
InheritedCommits | 1,884 | 10 | 755 | 29,110 | Number of common commits between a given fork and the mainline variant. |
VariabilityPercentage | 15 | 0 | 2.7 | 93.8 | Percentage of unique commits according to (1). |
.NET variants | |||||
uniqueMLV | 102.2 | 0 | 3 | 10,789 | Number of unique commits in the mainline variant in a given mainline–fork pair. |
uniqueFV | 16.2 | 0 | 5 | 605 | Number of unique commits in the fork variant in a given mainline–fork pair. |
InheritedCommits | 224.5 | 0 | 42.1 | 20,538 | Number of common commits between a given fork and the mainline variant. |
VariabilityPercentage | 20 | 0 | 11 | 99 | Percentage of unique commits according to (1). |
JavaScript variants | |||||
uniqueMLV | 33.5 | 0 | 3 | 10,223 | Number of unique commits in the mainline variant in a given mainline–fork pair. |
uniqueFV | 12.8 | 0 | 5 | 1,229 | Number of unique commits in the fork variant in a given mainline–fork pair. |
InheritedCommits | 111.5 | 14 | 32 | 66,861 | Number of common commits between a given fork and the mainline variant. |
VariabilityPercentage | 22.3 | 0 | 14 | 99 | Percentage of unique commits according to (1). |
5.1 Pull Request Propagation (Commit Integration Inside GitHub)
merge
pull request option, in the direction of mainline→ fork. In the same row, in the direction of fork→ mainline, we observe 1 mainline–fork pair that integrated 2 PRs, having a total of 427 commits, using the merge
pull request option, in the direction of fork→ mainline.merge
pull request option. We observe more or less similar trends for the mainline–fork variants pairs in the other two ecosystems. For the JavaScript mainline–fork variant pairs, we observe 99 of the 10,357 mainline—fork variant pairs (1 %) integrating commits, using the merge
pull request option, in the direction of mainline→ fork and 724 of the 10,357 mainline–fork pairs (7 %) in the direction of fork→ mainline. We observe very few mainline–fork variant pairs, in the JavaScript software packaging ecosystem, integrating commits using the pull request squash/rebase
options in either integration directions. For the mainline–fork variant pairs in the .NET ecosystem, we observe 9 of 590 mainline–fork pairs (1.5 %) and 67 of the 590 mainline–fork pairs (11.3 %) integrating commits, using the merge
pull request option, in the direction of mainline→ fork and fork→ mainline, respectively. We did not observe any commits integrated using the rebased
pull request option in either integration direction, while for the commits integrated using the squash
pull request option, we only observed integration in the direction of fork→ mainline accounting for 13 of the 590 mainline–fork pairs (2 %).
merge
pull request option is clearly the most frequently used in all integration directions and in all the three ecosystems. In all three software packaging ecosystems, the squash
and rebase
options are rarely used. However, comparing the two PR options, squash
and rebase
, we observe that the squash
PR option is used more often.
5.2 Git Propagation (Commit Integration Outside GitHub)
cherry-pick
and git merge
/ rebase
(gitPullMLV-FV). The summary statistics of these two commit integration techniques are presented in Table 8. In Table 7, the detailed results corresponding to the summary statistics in Table 8 are presented We first present the results of git cherry-pick
, and we follow with the results of git merge
/ rebase
.
-
git
cherry-pick
commit integration: Like we stated in Section 3.3 commits can be cherry-picked from mainline in two directions: mainline→ fork or fork→ mainline. The two metrics: gitCherrypickedMLV-FV and gitCherrypickedFV-MLV (in Table 8) corresponding to the two commit integration directions for the mainline→ fork and fork→ mainline, respectively, in the three ecosystems. In Fig. 11 we present boxplot distributions corresponding to the results in Table 8. We can see all the distributions only show outliers, meaning that most pairs do not have cherry-picked commits. The detailed statistics in Table 7 reveal the same results. For example, the upper part of Table 7 presenting the Android variants, we can see that there are only of 5 of the 54 mainline–fork pairs (9 %) that integrated a total of 250 commits in the direction of mainline→ fork. In the direction of fork→ mainline there were 4 of the 54 mainline–fork pairs (7.4 %) integrating a total of 136 commits. Like the results of pull request integration presented earlier, we can also clearly see that commit integration using gitcherry-pick
is rarely used in the mainline–fork variant pairs in all the three ecosystems we have studied. Unlike pull request integration where the developer has to sync upstream or downstream the new changes, with gitcherry-pick
the developer have to search for specific commits to integrate. This requires to first look into the pool of new changes and identify the ones of interest to cherry-pick. If the mainline and fork variant have diverged solving different problems, then finding the interesting commits in the new changes might be laborious. We hypothesize that this could be one of the reasons why there are few numbers of commits observed in mainline–fork variant pairs in the three ecosystems. A follow up study to confirm or refute this hypothesis would add value to this study.× -
git
merge
/rebase
commit integration: In Table 8 we can see metric gitPullMLV-FV representing the the gitmerge
/rebase
commit integration in the direction of mainline→ fork, in the three ecosystems. Again we can see that the all the medians for all the metric in all the three ecosystems are all zeros. Figure 11 shows three boxplots showing the distributions of gitPullMLV-FV metric for the mainline–fork variant pairs in the three ecosystems. From the boxplots, we can also observe that the medians are all zeros. In Table 7 we present the detailed statistics for the metric gitPullMLV-FV. For Android mainline–fork variant pairs, we observe 18 of the 54 mainline–fork pairs (33 %) with a total of 13,198 commits being integrated in the direction of mainline→ fork. For .NET mainline–fork variant pairs, we observe 106 of the 590 mainline–fork pairs (18 %) with a total of 5,601 commits being integrated in the direction of mainline→ fork. And finally for JavaScript mainline–fork variant pairs, we observe 1,180 of the 10,357 mainline–fork pairs (11 %) with a total of 40,001 commits being integrated in the direction of mainline→ fork. We can see that although gitmerge
/rebase
still rarely used in the mainline–fork variants in all the three ecosystems, it is more used than the other two options of pull requests and gitcherry-pick
. We can conclude thatgit merge / rebase
is the most used code integration mechanism between the variants in variant families. Again, we speculate that the lack of integration mainline–fork variant pairs could be as a results of the variants diverging to solve different problem from those being solved by their mainline counterparts.
5.2.1 Fork Variability Percentage
5.3 Summary
git merge/rebase
, which is used in 33 % of Android mainline-fork pairs, 11 % of JavaScript pairs, and 18 % of .NET pairs. For integration using pull requests, developers often integrate code in the direction of fork→ mainline compared to those in the direction of mainline→ fork, in all the mainline–fork variants. The code integration in the direction of mainline→ fork is often done using the merge
pull request option or git merge/rebase
outside GitHub. Moreover, the squash
and rebase
pull request options are less frequently used in mainline–fork variant pairs, although the squash
PR option is more used than the rebase
pull request option. Finally, by comparing the fork variability percentage, we observed a high percentage difference between the fork variants and their mainline counterparts, indicated by the higher number of unique commits. These results are consistent across all the variants of the three ecosystems (i.e., Android, JavaScript, and .NET) that we studied. Our findings potentially indicate that the fork variants are being created with the intention of diverging away from the mainline to solve a different problem (i.e., with no intention to sync in any way with the original mainline). Future studies could investigate the motivation behind fork variants’ creation and why there is a limited collaboration between mainline and fork variants.6 Discussion and Implications
Java
, PHP
, .NET
, Python
, and many more have their own package managers available that host hundreds of thousands of packages. More details on the package managers can be found on https://libraries.io/ which is a platform we have used to identify and extract details about variant families from the JavaScript and .NET ecosystem. https://libraries.io/ references packages from over 37 package managers where one can obtain software families in the different ecosystems.git rebase
is common, as per Observation 2–RQ2. Rebasing complicates the git history and empirical studies that do not consider rebasing may report skewed, biased and in accurate observations (Paixão and Maia 2019). Thus, in addition to looking beyond pull requests when studying code propagation, studies must also consider rebased commits. In this paper, we contribute reusable tooling for identifying these rebased commits.git merge / rebase
may not be the best when integrating changes in variant forks since they involve syncing upstream / downstream all the changes missing in the current branch. Alternatively, cherry picking is probably more suitable for bug fixes since the developer can choose the exact commits they want to integrate. However, GitHub’s current setup does not make it easy to identify commits to cherry-pick with out digging through the branch’s history to identify relevant changes since the last code integration. As a result of the difficulty of finding commits to cherry-pick, developers may end up fixing the same bugs, which would result in duplicated effort and wasted time. To check if a possible duplication of effort occurs in our data set, we looked at the unique commits of the variants and indeed found that developers independently update files shared by the variants. For example, in the mainline–fork variant pair (k9mail / k-9, imaeses / k-9
) the shared file ImapStore.java
13 has been touched by 15 different developers in 142 commits in the mainline variant while in the fork variant it has been touched by one developer in 9 different commits. It is possible that these developers could be fixing similar bugs existing in these shared artifacts. Moreover, the study of Jang et al. (2012) reports that during the parallel maintenance of cloned code, a bug found in one clone can exist in other clones, thus, it needs to be fixed multiple times. Furthermore, as a result of different developers changing shared files, it is possible that these developers do not integrate code because of “fear of merge conflict.” In relation to this conjecture, several studies have reported that merging diverged code between repositories is very laborious as a result of merge conflicts (Stanciulescu et al. 2015; Brun et al. 2011; de Souza et al. 2003; Perry et al. 2001; Sousa et al. 2018; Mahmood et al. 2020; Silva et al. 2020). To this end, it would be interesting for future research to interview the developers of our forks (and further forks) to determine whether the lack of support for cherry picking bug fixes or specific functionality does indeed contribute to the lack of code propagation. In that case, developing a patch recommendation tool that can inform developers of possible interesting changes as soon as they are introduced in one variant and recommend them to other variants in a family can help save developers’ efforts. The recent work by Ren et al. (2018) that focused on providing the mainline with facilities to explore non-integrated changes in forks to find opportunities for reuse is one step towards this direction. Our work opens up more opportunities for applying such tools since, as mentioned above with respect to identifying divergent forks, we provide a technique for identifying such forks by combining information from GitHub and the ecosystem’s main delivery platform as well as we mention various other ecosystems where a similar strategy can be adopted. Finally, the limited sharing of changes can give rise to quality issues. We did not specifically investigate the propagation of test cases, which might not be propagated as well. Developing techniques for propagating test cases within families could significantly enhance the quality of variants within families. The potential of test-case propagation has recently been pointed out in a preliminary study by Mukelabai et al. (2021).7 Related Work
7.1 Variant Forking
SourceForge
, before the advent of social coding environments (Nyman et al. 2012; Robles and González-Barahona 2012; Viseur 2012; Nyman and Lindman 2013; Laurent 2008; Nyman and Mikkonen 2011). These studies reported controversial perceptions around variant forks in the pre-GitHub days (Chua 2017; Dixion 2009; Ernst et al. 2010; Nyman and Mikkonen 2011; Nyman 2014; Raymond 2001). However, Zhou et al. (2020) recently report that these perceptions have changed with the advent of GitHub. In the Pre-GitHub days, variant forks were frequently considered as risky to projects, since they could fragment a community and lead to confusion of developers and users. Jiang et al. (2017) state that, although forking is controversial in the traditional open source software (OSS) community, it is encouraged and is a built-in feature in GitHub. The authors further report that developers carry out social forking to submit pull requests, fix bugs, add new features, and keep copies. Zhou et al. (2020) also report that most variant forks start as social forks. Robles and González-Barahona (2012) comprehensively study a carefully filtered list of 220 potential forks of different projects that were referenced on Wikipedia. The authors assume that a fork is significant if a reference to it appears in the English Wikipedia. They found that technical reasons and discontinuation of the original project were the most common reasons for creating variant forks, accounting for 27.3% and 20% respectively. More recently, Zhou et al. (2020) interviewed 18 developers of variant forks on GitHub to understand reasons for forking in more modern social coding environments that explicitly support forking. The authors report that the motivations they observed align with the above prior studies.7.2 Code Propagation Practices
7.3 Other Studies About Forking
LibreOffice
project, which is a fork from the OpenOffice
project. They wanted to understand how Open Source software communities were affected by a forking. The authors undertook an analysis of the LibreOffice
project and the related OpenOffice
and Apache OpenOffice
projects by reviewing documented project information and a quantitative analysis of project repository data as well as a first hand experiences from contributors in the LibreOffice
community. Their results strongly suggested a long-term sustainable LibreOffice
community that had no signs of stagnation in the LibreOffice
project 33 months after the fork. They also reported that good practice with respect to governance of Open Source software projects is perceived by community members as a fundamental challenge for establishing sustainable communities. Nyman (Nyman 2014) interviewed developers to understand their views on forking. His findings from the interviews differentiate good forks, which are those that (i) revive abandoned programs, (ii) experiment with and customize existing programs, or (iii) minimize tyranny and resolve disputes by allowing involved parties to develop their own versions of the program, vs. bad forks, which are those that (i) create confusion among users or (ii) add extra work among developers (including both duplication of efforts and increased work if attempting to maintain compatibility).8 Threats to Validity
anzhi, apkmirror, appsapk
do not implement this strategy which means we cannot easily identify the correct app for a given GitHub repository. Therefore, we intentionally focus only on Android apps that are distributed on Google play store, which limits the number of Android families we are able to identify.9 Conclusion
Android
, JavaScript
, and .NET
. As part of our study, we designed a systematic method to identify real variant forks as well.SourceForge
. In the future, it would be interesting to investigate a middle ground between the variant forks and social forks. For example, one could investigate if the practices observed in the variant forks are different from those of social forks.