4.1 Motivation for conducting the replication
RE2’s motivation was overcoming the potential limitations posed by BE’s and RE1’s threats to validity on results. Also, learning the effect of systematic experimental design changes on results. For this, we went over the replications’ designs and reflected on their shortcomings and points for improvement.
For example, BE’s and RE1’s research question (i.e., do programmers produce higher quality programs with TDD than with ITL?) tests the hypothesis that TDD is superior to ITL. However, to the best of our knowledge, there is no theory stating so. Besides, the evidence in the literature is conflicting (i.e., as positive, negative, and neutral results have been obtained). Thus, to avoid running into a threat to conclusion validity due to the directionality of the effect (as the difference in performance between TDD and ITL with a one-tailed test is more likely to be statistically significant than with a two-tailed test if TDD outperforms ITL (Cumming
2013)), we removed the directionality of the effect in the research question.
Asides, BE’s subjects were divided into three groups (low, medium, or high) based on their skills before being randomized to either ITL or TDD. This was made to balance out the distribution of the subjects’ skills to the treatments. Unfortunately, BE’s authors did not report which skills they measured—or how they measured them. Besides, there were no clear cutoff points between different categories (low, medium, or high) or a clear definition of how skill levels were combined to classify a subject within a specific category. RE1 overcame this threat to internal validity by assigning the subjects to the treatments without considering a preliminary set of skills (i.e., by means of full randomization). However, they incurred into another threat to internal validity (i.e., confounding), as some participants were grouped into pairs due to a lack of computer stations. We avoided both threats to validity by making the participants code alone and applied both treatments twice.
Nuisance factors outside the researchers’ control could have affected BE’s results—as the participants were allowed to work outside the laboratory. RE1 overcame this threat to internal validity by only allowing the participants to work inside the laboratory. Our replication follows such improvement.
Another difference across the replications is the maximum allowed time given to the participants to implement the task: 8 h or more in BE, 3 h in RE1, and 2 h and 25 min in RE2. By reducing the experimental session’s length, it is possible avoiding fatigue’ potential effect on results. Such time reduction was possible due to the provision of stubs in RE1 and RE2.
BE’s and RE1’s thresholds for considering a user story as delivered (i.e., 50% of the user story’s assert statements) are artificial and may have impacted quality results. We address this threat to construct validity in RE2 as we set no threshold for measuring quality (see below).
BE’s user stories’ weighting is another threat to construct validity—as such weights are subjective and dependent upon the test suite implementer. User stories’ weighting was removed in RE1 and R2.
BE’s and RE1’s participants only code one task. This results in a low external validity of results. In RE2, we improve the external validity of the results by using four tasks. This will allow studying the effectiveness of TDD for different tasks.
BE’s and RE1’s participants only apply one treatment. A threat to internal validity named compensatory rivalry threat might have materialized (i.e., loss of motivation due to the application of the less desirable treatment, in this case ITL). RE2’s subjects applied both treatments (ITL and TDD) instead of only one. This rules out the possible influence of the compensatory rivalry threat.
RE1’s subjects were trained in both TDD and ITL before the experimental session. This might pose a threat to internal validity: leakage from one development approach to another may materialize if subjects apply a mixed development approach in either the TDD or ITL group. This issue was addressed in BE (as subjects were only trained in their assigned treatment before the experimental session). We also addressed such threat to validity in RE2 by training the subjects only in the treatment to be immediately applied afterwards.
BE’s participants are undergraduate students. This poses a threat to the generalization of the results to other types of developers. RE1’s participants are a mixture of undergraduate and graduate students. Although this affords greater external validity, it poses a threat to internal validity due to confounding with subject type. All RE2’s participants were graduate students instead of undergraduate students: this may increase the external validity of results.
Table
2 shows the threats to validity of the experiments that were overcome in further replications. Rows with positive sign (+) represent improvements upon the experimental settings marked with negative sign (−).
Table 2Baseline experiment and replications’ threats to validity
Internal | Compensatory rivalry | TDD/ITL (−) | TDD/ITL (−) | TDD and ITL (+) |
Confounding | No (+) | Pair programming (−) | No (+) |
Confounding | No (+) | Graduate/undergraduate (−) | No (+) |
Task execution | Lab and remote work (−) | Lab (+) | Lab (+) |
Stubs | No (−) | Yes (+) | Yes (+) |
Task duration | +8 h (−) | 3 h (+) | 2.25 h (+) |
Allocation | Stratified (i.e., skill) (−) | Full randomization (+) | Full randomization (+) |
Leakage | No (+) | Yes (−) | No (+) |
Construct | Operationalization | 50% threshold (−) | 50% threshold (−) | - (+) |
Operationalization | Weighting (−) | None (+) | None (+) |
External | Mono-operation bias | BSK (−) | BSK (−) | BSK, MR, MF, SDK (+) |
Subject type | Undergraduate (−) | Graduate/undergraduate (−) | Graduate (+) |
As Table
2 shows, RE1 still suffers from threats to validity present in BE. Besides, RE1 also falls into a construct and internal threat that was not present in BE. RE2 overcame validity issues from both experiments (Gómez et al.
2014).
4.3 Changes to the previous replications
According to the classification suggested by Gomez et al. (
2014), RE2 modified all BE’s and RE1’s dimensions:
Operatationalization: because the response variable’s (i.e., quality) operationalization was changed.
Population: because the population changed from undergraduate to graduate students.
Protocol: because the tasks and session lengths were modified.
Experimenter: because the replications were run and analyzed by different researchers.
In the following, we provide greater detail on the changes made.
4.3.1 Research question and response variable
We are not aware of any theory indicating that TDD produces higher quality software than ITL. Thus, we removed the directionality of BE’s and RE1’s RQ. In particular, we restate RE2’s RQ as:
RE2’ response variable is quality. We measured quality with acceptance test suites that we (i.e., the experimenters) developed. We used a standardized metric to measure quality: functional correctness. Functional correctness is one of the sub-characteristics of quality defined in ISO 25010 and is described as “the degree to which a system provides the correct results with the needed degree of precision” (ISO/IEC 25010:2011
2011). We measure functional correctness as the proportion of passing assert statements (#assert(pass)) over the total number of assert statements (#assert(all)). Specifically:
$$ \mathrm{QLTY}=\frac{\#\mathrm{Assert}\left(\mathrm{Pass}\right)}{\#\mathrm{Assert}\left(\mathrm{All}\right)}\ast 100 $$
We regard this metric as more straightforward than that used in BE and RE1. First, because it does not require any subjective threshold (e.g., 50% of assert statements passing to consider a functionality as delivered). Second, because QLTY is no longer bounded between 50 and 100%, and, instead, can vary across the whole percentage interval (0–100%). Third, because our metric measures overall quality and not the quality of the delivered functionality. Thus, subjects delivering smaller amounts of high quality functionality are “penalized” compared with those delivering larger amounts of high quality functionality. A 1-week seminar on TDD was held at the Universidad Politécnica de Madrid (UPM) in March 2014. A total of 18 graduate students took part in the seminar. They all had a varying degree of experience in software development and unit testing skills. All subjects were studying for a MSc. in Computer Science or Software Engineering at the Universidad Politécnica de Madrid (UPM). Master’s students were free to join the seminar to earn extra credits for their degree program. The seminar was not graded.
Participants were informed that they were taking part in an experiment, that their data were totally confidential, and that they were free to drop out of the experiment at any time.
Before the experiment was run, the participants filled in a questionnaire. Such questionnaire asked the participants about their previous experience with programming, Java, unit testing, JUnit, and TDD. Specifically, subjects were allowed to select one of four experience values: no experience (< 2 years); novice (≤ 2 and < 5 years); intermediate (≤ 5 and < 10 years); expert (≤ 10 years). We code such experience levels with numbers between 1 and 4 (novice,..., expert). Table
3 shows RE2’s participants’ experience levels.
Table 3RE2 subjects’ experience
Programming experience | 2 | 3 | 1 | 3 |
Unit testing experience | 1 | 1 | 1 | 3 |
Java experience | 2 | 2 | 1 | 3 |
JUnit experience | 1 | 1 | 1 | 2 |
TDD experience | 1 | 1 | 1 | 2 |
As Table
3 shows, most of the subjects had 5 to 10 years of experience with programming (mode = 3) and from 2 to 5 years of experience with Java (mode = 2). Besides, the participants had little experience with unit testing or JUnit: fewer than 2 years (mode = 1). Their experience with TDD was also limited (fewer than 2 years).
4.3.2 Design
RE2’s experiment was structured as 4 sessions within-subjects design. Within-subjects designs over advantages over between-subjects designs (Brooks
1980): (1) reduced variance, and thus, greater statistical power because of the study of within subjects rather than across subject differences; (2) increased number of data points, and thus, greater statistical power as each subject has as many measurements as experimental sessions; (3) subject abilities—over or below the norm—have the same impact on all the treatments (as all subjects are exposed to all the treatments).
RE2’s 18 subjects applied ITL and TDD twice in non-consecutive sessions (ITL was applied on the first and third day, whereas TDD on the second and fourth). Thus, up to a total of 18 experimental units multiplied by four sessions (72 experimental units) could be potentially used to study ITL vs. TDD. Subjects were given training according to the order of application of the treatments. Subjects only worked in the laboratory.
4.3.3 Artifacts
The participants coded four different tasks (i.e., BSK, SDK, MR, and MF).
BSK’s specifications were reused from Fucci and Turhan (
2013).
Appendix A shows BSK’s specification.
SDK is a greenfield programming exercise that requires the development of various checking rules against a proposed solution for a Sudoku game. Specifically, subjects must deal with string and matrix operations and with embedding such functionalities inside a single API method. We provide the SDK’s specifications in
Appendix B.
MR is a greenfield programming exercise that requires the development of a public interface for controlling the movement of a fictitious vehicle on a grid with obstacles. MR is a popular exercise used by the agile community to teach and practice unit testing.
Appendix C contains MR’s specifications.
MF is an application intended to run on a GPS-enabled, MP3-capable mobile phone. It resembles a real-world system with a three-tier architecture (graphical user interface, business logic, and data access). The system consists of three main components that are created and accessed using the singleton pattern. Event handling is implemented using the observer pattern. Subjects were given a description of the legacy code, including existing classes, their APIs, and a diagram of the system architecture (see
Appendix D).
Table
4 shows the number of user stories, test cases, and asserts for each task’s test suite.
Table 4Number of user stories, test cases, and asserts per task
SDK | 6 | 11 | 13 |
BSK | 13 | 48 | 56 |
MR | 11 | 52 | 89 |
MF | 11 | 45 | 123 |
4.3.4 Context variables
The experiment was run in a laboratory with computers running a virtual machine (VM) (Oracle
2015) with the Eclipse IDE (
2016), JUnit (Massol and Husted
2003), and a web browser. Due to time restrictions, subjects received a Java stub to help them jump start with the implementation.
4.4 Analysis approach
We run the data analysis with IBM SPSS Statistics Version 24. First, we provide descriptive statistics and box plots for QLTY. Then, we analyze the data with a linear marginal model (LMM). LMMs are linear models in which the residuals are not assumed to be independent of each other or have constant variance (West et al.
2014). Instead, LMMs can accommodate different variances and covariances across time points (i.e., each of the experimental sessions). LMMs require normally distributed residuals. In the absence of normality, data transformations can be used (e.g., Box-Cox transformations (Vegas et al.
2016)).
Particularly, we fitted a LMM with the following factors: development approach, task, and development approach by task. We included the factor task and its interaction with the development approach to reduce the unexplained variance of the model. After fitting various LMMs with different variance-covariance matrix structures, we selected the unstructured matrix
2 as the best fit to the data. This was done according to the criterion of lower 2 log likelihood and to West et al.’s suggestion (West et al.
2014).
We report the differences in quality across development approaches, tasks, and development approaches within tasks. Afterwards, we check the normality assumption of the residuals with the Kolmogorov-Smirnov test and the skewness and kurtosis z-score (Field
2009).
3 Finally, we use QQ plots to check the residuals’ normality assumption.
We complement the statistical results with Hedge’s
g effect sizes (Cook et al.
1979; Hedges and Olkin
2014) (Cohen’s
d small sample size correction (Cohen
1977)) and their respective 95% confidence intervals (95% CIs). This may facilitate the incorporation of the results in further meta-analyses (Borenstein et al.
2011). We report Hedge’s
g due to its widespread use in SE (Kampenes et al.
2007) and its intuitive interpretation: the amount of standard deviations that one experimental group’s mean deviates from another.
4.6 Threats to validity
In this section, we report RE2’s threats to validity following Wohlin et al.’s conventions (Wohlin et al.
2012). We prioritize the threats to validity according to Cook and Campbell’s guidelines (Cook et al.
1979).
Conclusion validity concerns the statistical analysis of results (Wohlin et al.
2012).
We provide visual and numerical evidence with regard to the validity of the required statistical assumptions. We performed data transformations (i.e., Box-Cox transformations) so as to assess the robustness of the findings. As the results were consistent across statistical analyses, for simplicity’s sake, we interpreted the untransformed data’s statistical analysis. Interested readers in the analysis of the transformed data may request them contacting the authors.
The random heterogeneity of the sample threat might have materialized, since the software development experience of the participants ranged from a few months to 10 years. This might have biased the results towards the average performance in both populations, thus resulting in non-significant results.
Internal validity is the extent to which the results are caused by the treatments and not by other variables beyond researchers’ control (Wohlin et al.
2012).
A threat to internal validity results from the participants’ usage of a non-familiar programming environment (e.g., OS and IDE). However, we tried to mitigate this threat by making all the participants use the same environment during the experiment. We expect, thus, that the environment has an equal impact on both treatments—and, thus, does not affect results.
There is a potential maturation threat: the course was a 5-day intensive course on TDD and contained multiple exercises and laboratories. As a result, factors such as tiredness or inattention might be at work. To minimize this threat, we offered the students to choose the schedule that best suited their needs before starting the experiment. We also ensured that subjects were given enough breaks. However, this threat might have materialized due to the drop in quality observed with TDD in the last session (Friday).
Training leakage effect may have distorted results. Even though this training leakage effect was out of the question in the first session (as the subjects were only trained in the development approaches when necessary), it was a possibility in the second, third, and fourth sessions. In particular, subjects might have applied a mixed development approach when they had knowledge of both development approaches. They may have also applied their preferred technique. To mitigate this threat to validity, we encouraged subjects to adhere to the development approaches as closely as possible in every experimental session.
There was also the possibility of a diffusion threat: since subjects perform different development tasks in each experimental session, they could compare notes at the end of the sessions. This would give them knowledge in advance about the tasks to code in the following days. This could lead to an improvement in their performance. To mitigate this threat, we encouraged subjects not to share any information on the tasks until the end of the 5-day training course. Furthermore, we informed the subjects that their performance was not going to have an impact on their grades. Thus, we believe that the participants did not share any information as requested. Since quality dropped in the second application of TDD, we are confident that this threat did not materialize.
Additionally, our experiment was exposed to the attrition threat (loss of two participants).
Construct validity refers to the correctness in the mapping between the theoretical constructs to be investigated and the operationalizations of the variables in the study.
The study suffers from the mono-method bias threat since only one metric was used to measure the response variable (i.e., quality). This issue was mitigated by interpreting the results jointly with BE and RE1 (see below).
The concepts underlying the cause construct used in the experiment appear to be clear enough to not constitute a threat. The TDD cycle was explained according to the literature (Beck
2003). However, some articles point out that TDD is a complex process and might not be consistently applied by developers (Aniche and Gerosa
2010). Conformance to the development approaches is one of the big threats to construct validity that might have materialized in this and most (if not all) other experiments on TDD. However, we tried to minimize this threat to validity by supervising the students while they coded and encouraging them to adhere as closely as possible to the development approaches taught during the laboratory.
There are no significant social threats, such as evaluation apprehension: all subjects participated on a voluntary basis in the experiment and were free to drop out of the sessions if they so wished.
External validity relates to the possibility of generalizing the study results beyond the study’s objects and subjects (Wohlin et al.
2012).
The experiment was exposed to the selection threat since we could not randomly select participants from a population; instead, we had to rely on convenience sampling. Convenience sampling is an endemic threat in SE experiments. This issue was taken into account when reporting the results, acknowledging that the findings are only valid for developers with no previous experience in TDD.
Java was used as the programming language for the experimental sessions and measuring the response variable with acceptance test suites. This way, we addressed possible threats regarding the use of different programming languages to measure the response variable. However, this limits the validity of our results to this language only.
Three out of the four tasks (MR, BSK, and SDK) used in the experiment were toy greenfield tasks. This affects the generalizability of the results and their applicability in industrial settings. The task domain might not be representative of real-life applications, and the duration of the experiment (2 h and 15 min to perform each task) might have had an impact on the results. We acknowledge that this might be an obstacle to the generalizability of the results outside the artificial setting of a laboratory. We take this into account when reporting our findings, as they are only valid for toy tasks.
We acknowledge as a threat to validity the use of students as subjects: however, this threat was minimized as they are graduate students close to the end of their educational programs. Even so, this still limits the generalization of our results beyond novice developers.