After discussing our target project, we now proceed to detailing our study. We first state our goal and research questions, then describe collected metrics and finally the procedure of the two parts of our study, namely repository mining and survey.
4.1 Goal and research questions
To design our study, we followed the goal-question-metric (GQM) paradigm (
Basili et al. 1986). Therefore, we first specify our goal using the GQM template and derived research questions. Our goal is detailed next.
To understand the factors that influence code review in distributed software development, characterize and evaluate the relationship between different influence factors and code review effectiveness from the perspective of the researcher as code review is performed by software developers in a single project study.
Based on this goal, we derived a set of research questions, each associated with one of the influence factors investigated in our study. There are both technical and non-technical factors. As said, although some have been investigated in the past, it is our goal to analyze them in the context of DSD. For short, we refer to our investigated scenario as distributed code review. Our research questions are listed as follows.
RQ-1: Does the number lines of code to be reviewed influence the effectiveness of distributed code review? RQ-2: Does the number of involved teams influence the effectiveness of distributed code review? RQ-3: Does the number of involved development locations influence the effectiveness of distributed code review? RQ-4: Does the number of active reviewers influence the effectiveness of distributed code review?
We investigate the number of teams and locations separately because the former captures distribution among teams, allowing us to analyze the impact of involving reviewers that have different project priorities and goals (possibly conflicting) and limited interaction, while the latter additionally captures the impact of geographic distribution.
4.2 Influence factors and outcomes
Each research question is associated with an influence factor to be investigated, with respect to their impact on the effectiveness of distributed code review. However, there is no unique metric to measure review effectiveness. Therefore, we consider a set of outcomes of code review, which are measured. They can be used as indicators of the review effectiveness. Before detailing these outcomes, we next further specify our influence factors—which are listed following the order of our research questions—detailing how they are measured.
Patch Size (LOC) The patch size (LOC) is used to refer to the number of lines of code added or modified in a commit and thus need to be reviewed. This lines of code considered are those present in the final version of the code, after going through the reviewing process. Teams Teams refer to the number of distinct teams associated with the author and invited reviewers. If the author and all reviewers belong to the same team, the value associated with this influence factor is 1. Locations Locations refer to the number of distinct geographically distributed development sites associated with the author and invited reviewers. If the author and all reviewers work in the same development site, the value associated with this influence factor is 1. Active Reviewers Actives reviewers are those that actually participate in the reviewing process—with comments or votes—from those invited. Although this can be seen as an outcome of the review, given that there is no control of how many of the invited reviewers will actually participate, we aim to explore if the number of active reviewers influence other outcomes, such as duration. Therefore, active reviewers are investigated as an influence factor, consisting of the number of reviewers that contributed to the review.
Now we focus on describing the analyzed code review outcomes that indicate the review effectiveness. Code review is effective when it achieves its goals, which can be untimely to identify defects in the code, issues related with code maintainability and legibility, or even to disseminate knowledge. However, these goals might include constraints regarding the impact in the development process and invested effort.
It is not trivial to evaluated whether these goals are achieved. For example,
Bosu et al. (2015) created a model to evaluate whether the comments of a code review are useful based on the text of the given comments. This measurement, however, may not be precise. In our work, we focus on measurements that are more objective.
We thus selected four objective outcomes, described as follows. The first is related to project time constraints, while the remaining three are related to the input from other developers (reviewers) leading to possibly less failures, code quality improvement and knowledge dissemination.
Duration (DUR) Duration counts how many days the code review process lasted, from the day that the source code is available to be reviewed to the day that it received the last approval of a reviewer. Participation (PART) Participation consists of the fraction of invited reviewers that are active, ranging from 0% (no invited reviewer participates) to 100% (all invited reviewers participate). Automated reviewers are not taken into account. Comment Density (CD G) Instead of simply counting the number of review comments, we take into account the amount of code to be reviewed. Therefore, comment density refers to the number of review comments divided by the number of groups of 100 LOC under review, thus giving the average number of review comments for each 100 LOC. Review comments can be any form of interaction, e.g. approval, rejection, question, idea or other types of comments made by any reviewer—votes count as comments because they are a form of input and have a particular meaning. A multiplying factor of 100 is used to avoid small fractioned numbers, which are harder to compare and less intuitive. Comments from automated reviewers are ignored, as this type of feedback is a constant, regardless of human interactions. Comment Density by Reviewer (CD R) It is expected that the higher the number of reviewers or teams, the higher the number of comments. Therefore, CD G alone can lead to the wrong conclusion that discussions were productive when many reviewers are involved. We thus also analyze comment density by reviewer, given by the division of the comment density by the number of active reviews (without taking into account automated reviewers).
As we are interested in the effectiveness of code review, we next describe what we consider an effective code review based on the outcomes considered in the study.
Too short or too long code review. There are studies (
Kemerer and Paulk 2009;
Ferreira et al. 2010) that suggest time constraints for code review activities, limitation on the number of lines reviewed per hour and also the total amount of hours spent doing code review in a single day. Such limitations are imposed because the code review may become error-prone or even consume more time and resources to be finished due to tiredness. Moreover, if the review takes too long (i.e. high duration) to be completed, developers may be prevented to continue their work and also work does not get done. Therefore, shorter code reviews are preferred. However, if such review is too short, it may also mean that reviewers have not properly analyzed the change.
Low reviewer participation. When reviewers are invited to participate in the review, it is expected that they contribute. However, not all participate. Therefore, the higher the participation of reviewers, the better. Nevertheless, we do not expect that participation is 100%, given that there are developers that are invited automatically and may not be relevant reviewers anymore.
Few contributions from reviewers. Reviewers may contribute in different ways, ranging from a simple vote to long discussions. We assume that the higher the number of comments made by reviewers, the more fruitful the discussion and consequently the more effective the review. However, as explained, we do not consider the absolute number of comments, but its density considering the amount of code to be reviewed. Moreover, we consider the amount of contribution generally (CR G) and by reviewer (CR R). For both, the higher, the better.
Although in some situations a low number of comments (either generally or by reviewer) is enough—for example, when a low number of comments helped to improve the code, or the change to be reviewed is minor—note that these might be not the usual case. Because we analyze a high number of code reviews, these exceptional cases do not significantly impact the results. Moreover, votes count as comments; consequently, even if there is no need for long discussions, it is important to have at least the acknowledgement of the reviewer in the form of a vote, i.e. a comment. Finally, we also analyze duration and participation, which complement the analysis of the effectiveness of code review.
4.3 Procedure: repository mining
In short, the procedure of the first part of our study consists of the extraction of information from a code review database and analyses of the relationship between influence factors and outcomes. In this section, we detail how we operationalized this.
Our data is extracted from Gerrit
1, a tool that provides the management of Git repositories with fine-grained control over the permissions for users and groups. It also provides a mechanism to implement code review, with its associated approvals, allowing votes, comments, and edition of the source code. Every interaction among authors and reviewers is recorded, including the comments and votes of the
review bots, which are automated reviews. It also provides a sophisticated query mechanism to get information about all open and closed reviews. We next describe the steps taken to obtain, process and filter the data for this study using Gerrit.
Getting raw data from the code review database First, we fixed a time frame in the past so we can get data completed code reviews. Gerrit provides a query mechanism (
Google 2017b) that can be used to get structured information about code reviews in JSON format. One query for each week had to be made due to the limitation of obtaining at most 500 results per query.
Parsing and filtering code review information The retrieved JSON files provided part of our required data. The remaining data had to be computed from the raw data obtained from the internal Gerrit database model. The resulting data was filtered, discarding some reviews of some types of modules, described next.
1
Documentation. Some repositories are used exclusively for internal documentation of the project (processes and products), and have a different workflow and time constraints.
2
Third-party software. Some repositories are maintained by open source communities or component vendors. Local internal copies of these repositories exist due to traceability and to avoid downloading them multiple times. Reviewers do not review code in these repositories.
3
Binary artifacts. Some repositories contain binary files, e.g. images and libraries. These files are not reviewed and, if considered, would (incorrectly) increase the patch size.
Representing and analyzing data Given that we have four research questions with four associated influence factors as well as four outcomes, there is a large amount of data to be analyzed. Our data consists essentially of continuous or discrete positive numbers, with different scales and ranges. For example, there are only four involved locations while the patch size can be up to approximately 4 KLOC. To deal with these discrepancies, we adopted an approach similar to that of Baysal et al. (
2016). We clustered data in groups, representing the variance of outcomes in each group using box plots. Additionally, we performed statistical tests to identify groups that are significantly different from each other.
4.4 Procedure: survey
Our survey collected anonymous data from participants. The questionnaire they were given includes four main parts: (i) presentation of the study and consent to participate; (ii) demographic data of participants; (iii) participant background and experience; and (iv) questions about how outcomes of code review are affected by influence factors.
In the first part of the questionnaire, we briefly introduced modern code review and stated the goal of our survey. We explicitly declared that the participation was voluntary and informed participants that any information that could reveal the their identity would be kept confidential. Next, we presented the adopted terminology to ensure that terms used in the questions would be correctly understood. The introduced terms are: teams, locations, authors, reviewers and active reviewers. Next, participants were asked about their demographic information as well as background and expertise, as shown in Table
2.
Finally, we asked participants to assess how they evaluate the relationship between influence factors and review outcomes. The considered influence factors are the same as above. However, as outcomes, we considered duration, participant and number of comments. The reason to not distinguish total comments and comments by reviewer is due to our assumption that it is hard for a participant, without data, to assess these separately. Questions associated with each outcome followed the prototype presented in Table
3, replacing X by the name of the outcome.
Table 3
Prototype of the main questions of the survey
Patch size (LOC) | | | | | |
Teams | | | | | |
Locations | | | | | |
Active reviewers | | | | | |
For each influence factor, we used a 5-point Likert scale, so participants could provide one of six possible answers: (1) much worse, (2) worse, (3) no influence, (4) better, (5) much better, or “I am unable to inform.” Our questions assumed that there is a directly or inversely proportional relationship between influence factors and outcomes. If participants believe it was not the case, they were asked to state that they were unable to inform and describe their opinion in an open-ended question. Our questionnaire was validated with a pilot study.