1 Introduction
-
(RQ1) What kinds of repositories does a Newcomer OSS-Candidate target?Kalliamvakou et al. (2014) showed that most repositories hosted on GitHub are non-software. However, since Newcomer OSS-Candidates have the intention to later onboard a software project, we would like to test the assumption that (H1) Newcomer OSS-Candidates are more likely to target software repositories. Since GitHub users can either create their own upstream repositories or fork existing repositories, we compare these two kinds of repositories. We observe that 66% of Newcomer OSS-Candidates target software based repositories. The statistical test indicates that hypothesis H1 is established. Furthermore, Experimental and Documentation are the most frequently targeted software repository kinds for fork and upstream repositories, i.e., 24% and 21%, respectively.
-
(RQ2) What are the kinds of first contributions that come from Newcomer OSS-Candidates? Hattori and Lanza (2008) showed that OSS projects constantly add new content to software (i.e., development) more frequently than maintaining existing code. Hence, for this RQ, our motivation is to understand whether or not Newcomer OSS-Candidates are more likely to add new content or maintain the repository. Hence, by studying these two types of contributions, we test the hypothesis that (H2) Contributions to GitHub repositories from Newcomer OSS-Candidates are more likely to do development activities. We analyze two kinds of GitHub contributions, either a direct contribution through a commit, or a submitted Pull Request (PR). For the first commit contributions, we find that 74% of contributions from Newcomer OSS-Candidates are related to development activities. For the first PR contributions, our results show that 60% of contributions are associated with management activities. The statistical tests confirm that our hypothesis H2 is established in first commit contributions, while is not established in first PR contributions.
-
(RQ3) To what extent do Newcomer OSS-Candidates practice social coding with their first contributions? Since GitHub is a social coding platform, we would like to explore the extent to which a Newcomer OSS-Candidate is likely to make a social contribution as their first contribution. Specifically, we analyze whether or not a Newcomer OSS-Candidate shares code, which is measured by single or multiple authorship on a file. Hence, similar to RQ3, we explore the commit and PR contributions to test the hypothesis (H3) Newcomer OSS-Candidates are more likely to contribute to a file with multiple authorship. Our results show that after joining GitHub, a majority of Newcomer OSS-Candidates (i.e., 73% of first commits and 59% of PRs) do not share code with other authors. Moreover, the statistical tests validate that our hypothesis H3 is not established for both first commit and first PR contributions.
-
(RQ4) What is the proportion of Newcomer OSS-Candidates that eventually onboard an OSS project? In accordance with our definition, we explore the extent to which these Newcomer OSS-Candidates eventually onboard an OSS project. We would like to explore the proportion of Newcomer OSS-Candidates who eventually onboard an OSS project. Additionally, we validate what kinds of barriers that Newcomer OSS-Candidates face when onboarding OSS repositories. Our quantitative analysis shows that 30% of Newcomer OSS-Candidates eventually onboarded engineered OSS repositories. Complementary, a follow-up user survey shows that 70% of studied participants ended up making contributions to an OSS repository. Newcomer OSS-Candidates strongly agreed that they face the barrier of finding a way to start, while social interaction received the most mixed responses as a barrier.
2 Identifying newcomer OSS-candidates
"git log --pretty=format:%ae"
3 on Contributors.md file provided by the community and were able to get 17,507 respondent candidates. We sent our online survey invitation4 to reach up to 4,000 respondent candidates through email and a slack channel.5 Our survey was open from March 3, 2020 to March 31, 2020 (around a four-week period). We received 208 responses, allowing us to mine their repositories and contributions by providing their GitHub IDs. In the survey, we validate the definition of our Newcomer OSS-Candidate by asking two questions. The two questions are presented in Table 1. Besides, respondents were also asked about their interests, and their perception rank of their programming skills.
Survey Questions for Newcomer OSS-Candidate |
---|
Q1) What is your motivation to make a contribution to GitHub? |
(a) Learning to Code. |
(b) Assignment or Experiment Project. |
(c) Intend to contribute to an Open Source. |
(d) Use to showcase my programming skills. |
(e) Others. |
Q2) Did you have prior experience contributing to an OSS before GitHub? |
(Yes/No) |
Have you had any prior OSS experience? | Percent | |
No | 85% | |
Yes | 15% | |
(a) Answers to Q1 of the survey | ||
What is the motivation to contribute? | Percent | |
(a) Learning to Code. | 58% | |
(b) Assignment or Experiment Project. | 21% | |
(c) Intend to contribute to an Open Source. | 82% | |
(d) Use to showcase my programming skills. | 42% | |
(e) Others | 5% | |
(b) Answers to Q2 of the survey |
3 Findings
3.1 Target repositories (RQ1)
-
(Software) Application Software: systems that provide functionalities to end-users, like browsers and text editors.
-
(Software) System Software: systems that provide services and infrastructure to other systems, like operating systems, middleware, servers, and databases.
-
(Software) Web libraries and frameworks.
-
(Software) Non-web libraries and frameworks.
-
(Software) Software tools: systems that support software development tasks, like IDEs, package managers, and compilers.
-
(Software) Documentation: repositories with documentation, tutorials, source code examples.
-
(Software) Experimental: repositories include demos, samples, test code, and tutorial examples.
-
(Non-Software) Storage: category includes repositories documents and files for personal use, such as presentation slides, resumes, e-books, music files etc.
-
(Non-Software) Academic: class and university research projects come under this category.
-
(Non-Software) Web: under this category we classify websites and blogs.
-
(Others) No longer accessible/Empty: repositories that gave 404 error, containing only a license file, a gitignore file, a README file, or no files at all were placed under this category.
Category | Percent (%) | Fork & Upstream (%) |
---|---|---|
Software | 66 | Upstream (52) |
Fork (48) | ||
Non-Software | 24 | Upstream (55) |
Fork (45) | ||
Others | 10 | - |
- |
3.2 Kinds of contributions (RQ2)
-
Development (forward engineering and non-software): based on the forward-engineering type proposed by Hattori and Lanza (2008), the development activities relate to incorporation of new features and implementation of new requirements for both software and non-software. Examples of development for non-software repositories include adding new content for websites or documentation.
-
Repository Initializing (sub-category of development): derived from the forward-engineering category, we identify any first commits as the initializing commits to a new repository.
-
Re-engineering: maintenance activities are related to refactoring, redesign and other actions to enhance the quality of the code without properly adding new features.
-
Corrective Engineering: maintenance activities handle defects, errors and bugs in the software.
-
Management: maintenance activities are those unrelated to codification, such as formatting code, cleaning up, and updating documentation.
First Contributions | Kinds | Percent (%) | Code (%) | Doc (%) |
---|---|---|---|---|
First Commit : | Development | 31 | 98 | 2 |
Repository Initializing | 43 | 77 | 23 | |
Re-engineering | 7 | 100 | 0 | |
Corrective Engineering | 2 | 100 | 0 | |
Management | 13 | 5 | 95 | |
Others | 4 | 100 | 0 | |
sum | 100 | |||
Pull Request : | Development | 9 | 89 | 11 |
Repository Initializing | 3 | 33 | 67 | |
Re-engineering | 17 | 76 | 24 | |
Corrective Engineering | 6 | 100 | 0 | |
Management | 60 | 45 | 55 | |
Others | 4 | 100 | 0 | |
sum | 100 |
3.3 Social coding in terms of multiple authorship (RQ3)
git-blame
10 command on each contained file in the commit to check whether the files receive changes from more than one author (lines 3–4 in Algorithm 1). Considering that one PR may include multiple commits, we analyze all commits inside each PR with Algorithm 1. Specifically, we found that 21 out of 97 PRs (22%) have multiple commits.
Social coding practice (First Commit) | Percent (%) | |
multiple | 27 | |
single | 73 | |
Social coding practice (Pull Request) | Percent (%) | |
multiple | 41 | |
single | 59 |
3.4 Onboarding of newcomer OSS-candidates (RQ4)
-
Fork an OSS repository. The first step for any Newcomer OSS-Candidate is to fork an OSS repository. Hence, we extracted 936 fork repositories out of a total of 2,392 repositories from the D1 dataset. Then, to identify whether this repository is an engineered software project, we matched each fork repository against a curated dataset by Munaiah et al. (2016).
-
Identify contributions. During step one, we found that many participants who only fork the repository, without contributing back to either the fork or upstream repository. Hence, we performed an in-depth analysis through two particular ways of onboarding i.e., either the fork or upstream repositories.
Match to the Munaiah(2016) dataset | Onboarding Steps | Count (#) | Percent (%) |
---|---|---|---|
Started Onboarding | |||
Process : | 81 | 49 | |
Fork an OSS repository (51%) | |||
Contribute to fork OSS repository (22%) | |||
Eventually Onboarded: | Contribute to original OSS repository (30%) | ||
Not Onboard: | 85 | 51 | |
Sum | 166 | 100 |