1 Introduction
2 Background
Sequence | Participant | Period 1 | Period 2 |
---|---|---|---|
Group | ID | ||
S
G
1
|
j
|
y
1,1,j
= μ
j
+ τ
A
|
y
2,2,j
= μ
j
+ π + τ
B
+ λ
A
|
(technique A) | (technique B) | ||
S
G
2
|
k
|
y
2,1,k
= μ
k
+ τ
B
|
y
1,2,k
= μ
k
+ τ
A
+ π + λ
B
|
(technique B) | (technique A) |
3 Goals and Methodology
4 The Non-Standardized Effect Sizes for Crossover Studies and their Variances
4.1 Non-Standardized Effect Sizes of the AB/BA Crossover Model
-
τ A is the effect of technique A.
-
τ B is the effect of technique B.
-
τ A B = τ A − τ B is the difference between the effect of technique A and technique B. It is the non-standardized mean technique effect size.
-
τ B A is the difference between the effect of technique B and technique A where τ B A = −τ A B .
-
π is the period effect size which is the difference between the outcome of using a technique in the first time period and the second time period.
-
λ B is the period by technique interaction due to using technique A after technique B.
-
λ A B = λ A − λ B = −λ B A is the mean period by technique interaction effect size.
-
μ i is the average outcome for participant i.
-
The group of participants that use technique A first is called sequence group S G 1, the group of participants that use technique B first are called sequence group S G 2.
Sequence | Participant | Cross-over | Period | Participant |
---|---|---|---|---|
Group | Difference | Difference | Total | |
S
G
1
|
j
|
τ
A
B
− π − λ
A
|
π + λ
A
− τ
A
B
| 2μ
j
+ π + τ
A
+ τ
B
+ λ
A
|
S
G
2
|
k
|
τ
A
B
+ π + λ
B
|
τ
A
B
+ π + λ
B
| 2μ
k
+ π + τ
A
+ τ
B
+ λ
B
|
Sequence | Mean crossover | Mean period | Mean participant |
---|---|---|---|
Group | Difference | Difference | Total |
S
G
1
|
\(\hat {\tau }_{AB} - \hat {\pi } - \hat {\lambda }_{A}\)
|
\(\hat {\pi } + \hat {\lambda }_{A} -\hat {\tau }_{AB} \)
|
\(2\hat {\mu } + \hat {\tau }_{AB}+ \hat {\pi } + \hat {\lambda }_{A}\)
|
S
G
2
|
\(\hat {\tau }_{AB}+\hat {\pi }+ \hat {\lambda }_{B}\)
|
\( \hat {\tau }_{AB} + \hat {\pi } + \hat {\lambda }_{B}\)
|
\(2\hat {\mu } + \hat {\tau }_{AB} + \hat {\pi } + \hat {\lambda }_{B}\)
|
4.2 Non-Standardized Effect Size Variances and t-tests
4.2.1 The technique effect size variance
-
β s,i which is the effect due to participant i in sequence group s where s = S G 1 or s = S G 2.
-
ζ i,s,t which is the within participant error.
4.2.2 The period by technique interaction effect
4.2.3 Handling non-stable variances and non-normal data
-
The least useful option is to base the estimate of \(s^{2}_{IG}\) solely on the n 1 participants in the first period control condition. This is not really useful because for crossover designs n 1 is likely to be relatively small, so the estimate is likely to be inaccurate.
-
Estimate \(s^{2}_{IG}\) and \(s^{2}_{diff}\) allowing the within cells variance to be different. However, the implications of this approach, such as the relationship between \(s^{2}_{diff}\), \(s^{2}_{IG}\) and \(\hat {\rho }\) are not clear.
-
Use a robust, ranked-based analysis. This is the most straightforward option and also protects against non-normal data, such as skewed data and/or data with outliers.
5 Standardized Effect Sizes for Crossover Studies and their Variances
5.1 Formulas for the Standardized Effect Sizes
5.2 Choosing the Appropriate Standardized Effect Size
5.3 Standardized Effect Size Variances
5.3.1 The basic principle
5.3.2 Formulas to estimate the medium sample size variance of standardized effect sizes
5.3.3 The approximate variance for large sample sizes
6 Calculating Effect Sizes and their Variances
6.1 Example 1: Scaniello’s Data
ID | Comp_Level.AM | Comp_Level.SC | Comp_Diff | Comp_Sum | SequenceGroup |
---|---|---|---|---|---|
P3 | 0.82 | 0.77 | 0.05 | 1.59 | SG1 |
P4 | 0.60 | 0.70 | − 0.10 | 1.30 | SG2 |
P7 | 0.80 | 0.93 | − 0.13 | 1.73 | SG1 |
P8 | 0.93 | 0.90 | 0.03 | 1.83 | SG2 |
P11 | 0.70 | 0.83 | − 0.13 | 1.53 | SG1 |
P12 | 0.90 | 0.96 | − 0.06 | 1.86 | SG2 |
P15 | 0.67 | 0.83 | − 0.16 | 1.50 | SG1 |
P16 | 0.77 | 0.66 | 0.11 | 1.43 | SG2 |
P19 | 0.80 | 0.70 | 0.10 | 1.50 | SG1 |
P20 | 1.00 | 0.85 | 0.15 | 1.85 | SG2 |
P23 | 0.76 | 0.57 | 0.19 | 1.33 | SG1 |
P24 | 0.87 | 0.66 | 0.21 | 1.53 | SG2 |
Sequence | Statistic | AM | SC | CODiff | Participant |
---|---|---|---|---|---|
Group | Total | ||||
S
G
1
| Mean | 0.7583 | 0.7717 | − 0.0133 | 1.53 |
Variance | 0.0037 | 0.0155 | 0.0214 | 0.0171 | |
Num Obs | 6 | 6 | 6 | 6 | |
S
G
2
| Mean | 0.845 | 0.7883 | 0.0567 | 1.6333 |
Variance | 0.0201 | 0.0173 | 0.0148 | 0.06 | |
Num Obs | 6 | 6 | 6 | 6 |
Statistic | Equation | Value |
---|---|---|
Number | ||
\(\hat {\tau }\)
| 18 | 0.0217 |
\(\hat {\pi }\)
| 20 | 0.035 |
\(\hat {\lambda }_{AB}\)
| 23 | − 0.1033 |
s
I
G2 | 25 | 0.0142 |
s
d
i
f
f2 | 26 | 0.0181 |
s
w2 | 27 | 0.009 |
\(\hat {\rho }\)
| 28 | 0.3613 |
\( var(\hat {\tau })\)
| 31 | 0.001508 |
\(se_{\hat {\tau }}\)
| 32 | 0.03884 |
t
| 33 | 0.5581 |
6.2 Simulated Data Example
-
There are 15 participants in each sequence group.
-
The average outcome across different participants is μ = 50. We note that many of the papers used effectiveness measures based on a scale from 0 to 1 based on the proportion of questions answered correctly (see, for example, Scanniello et al. 2014; Abrahao et al. 2013). We chose a value of 50 which is equivalent to 50% of correct answers rather than a value between 0 and 1, so the effects would be clearer in the analysis.
-
Users of technique 1 achieve an average of 10 units more than users of technique 2, that is τ = 10. For a metric scale based on the number of correct answers to 10 questions, this would be equivalent to increasing the number of correct answers by one.
-
Users achieve an average of 5 units more in period 2 than in period 1, that is, π = 5.
-
There is no period by technique interaction effect built into the simulation (i.e. λ A B = 0).
-
The variance among participants using a specific technique in a specific time period is σ 2 = 25. This means the variance is unaffected by period or technique.
-
The correlation between outcomes for an individual participant is ρ = 0.75. We chose the value 0.75 because (Dunlap et al. 1996) reported that such values are to be expected for test-retest reliabilities of psycho-metrically sound values. In the software engineering literature, Laitenberger et al. (2001) reported values of r varying from 0.78 to − 0.0216 for the correlation between outcomes from teams. However, it would be reasonable to expect correlations based on individuals to be greater than those based on teams.
Sequence | Statistic | Technique | Technique | CODiff | Participant |
---|---|---|---|---|---|
Group | 1 | 2 | Total | ||
S
G
1
| Mean | 61.3772 | 57.8119 | 3.5653 | 119.1891 |
Variance | 12.4561 | 11.7601 | 7.7316 | 40.7007 | |
Num Obs | 15 | 15 | 15 | 15 | |
S
G
2
| Mean | 65.2768 | 51.1486 | 14.1282 | 116.4254 |
Variance | 12.2649 | 26.4595 | 14.9214 | 62.5274 | |
Num Obs | 15 | 15 | 15 | 15 |
Parameter | Sample | Theoretical | Percent relative |
---|---|---|---|
Estimate | Value | Error | |
\(\hat {\tau }\)
| 8.8467 | 10 | 11.5325 |
\(\hat {\pi }\)
| 5.2814 | 5 | − 5.6284 |
\(s^{2}_{IG}\)
| 15.7351 | 25 | 37.0595 |
\(s^{2}_{diff}\)
| 11.3265 | 12.5 | 9.388 |
\({s^{2}_{w}}\)
| 5.6632 | 6.25 | 9.388 |
\(\hat {\rho }\)
| 0.6401 | 0.75 | 14.6548 |
\( var(\hat {\tau })\)
| 0.3775 | 0.4167 | 9.4072 |
\(se_{\hat {\tau }}\)
| 0.6145 | 0.6455 | 3.1046 |
t
| 14.3978 | 15.4919 | 7.5992 |
6.3 Using R to Calculate Non-Standardized Effect Sizes and their Variances
6.3.1 Analyzing Scaniello’s data
ID | TimePeriod | Technique | Comp_Level |
---|---|---|---|
P3 | R1 | AM | 0.82 |
P3 | R2 | SC | 0.77 |
P4 | R1 | SC | 0.70 |
P4 | R2 | AM | 0.60 |
6.3.2 Analyzing the simulated data
6.4 Calculating Standardized Effect Sizes and their Variances
Effect | Scaniello data | Simulation | Theoretical | Percent |
---|---|---|---|---|
Size | Estimate | data | Value | Relative |
SC-AM | T2-T1 | Error | ||
d
R
M
| − 0.2278 | − 3.7175 | − 4 | 7.0625 |
d
I
G
| − 0.183 | − 2.2268 | − 2 | − 11.3384 |
Effect | Adjustment | Scaniello data | Adjustment | Simulation |
---|---|---|---|---|
Size | Scaniello Data | Estimate | Sim Data | Estimate |
c(10) | Revised | c(28) | Revised | |
g
R
M
| 0.9231 | − 0.2103 | 0.973 | − 3.617 |
g
I
G
| 0.9231 | − 0.169 | 0.973 | − 2.1666 |
Statistic | Equation number | Scanniello data | Simulation data |
---|---|---|---|
v
a
r(d
R
M
) | 49 | 0.2117 | 0.3412 |
v
a
r(d
I
G
) | 51 | 0.1366 | 0.1224 |
v
a
r(g
R
M
) | 50 | 0.1804 | 0.3231 |
v
a
r(g
I
G
) | 52 | 0.1164 | 0.1159 |
v
a
r(d
R
M
)
A
p
p
r
o
x
| 53 | 0.1699 | 0.3318 |
P
R
E
v
a
r(d
R
M
)
A
p
p
r
o
x
| 57 | 19.7557% | 2.763% |
v
a
r(d
I
G
)
A
p
p
r
o
x
| 55 | 0.1096 | 0.1191 |
P
R
E
v
a
r(d
I
G
)
A
p
p
r
o
x
| 57 | 19.7557% | 2.763% |
v
a
r(g
R
M
)
A
p
p
r
o
x
| 54 | 0.1689 | 0.3003 |
P
R
E
v
a
r(g
R
M
)
A
p
p
r
o
x
| 57 | 6.3835% | 7.0461% |
v
a
r(g
I
G
)
A
p
p
r
o
x
| 56 | 0.109 | 0.1077 |
P
R
E
v
a
r(g
I
G
)
A
p
p
r
o
x
| 57 | 6.3835% | 7.0461% |
7 Discussion
7.1 Impact of Incorrect Analysis on Effect Sizes and their Variances
Statistic | Scaniello’s data | Simulated data |
---|---|---|
\(s^{2}_{IG}\)
| 0.0142 | 15.7351 |
\(s^{2}_{IGbiased}\)
| 0.01393 | 22.9002 |
d
I
G
| − 0.0183 | − 2.2268 |
d
I
G
b
i
a
s
e
d
| − 0.01835 | − 1.849 |
c(d
f) | 0.9231 | 0.973 |
c(d
f
W
r
o
n
g) | 0.9655 | 0.9870 |
g
I
G
| − 0.169 | − 2.1666 |
g
I
G
b
i
a
s
e
d
| − 0.177 | − 1.8247 |
v
a
r(g
I
G
) | 0.1164 | 0.1159 |
v
a
r(g
I
G
B
i
a
s
e
d
) | 0.1709 | 0.09720 |
7.2 Standardized Effect Sizes and their Variances
7.3 Implications for Planning Experiments
7.4 Implications for Meta-Analysis
7.5 Non-Normality and Unstable Variances
8 Conclusions
-
To provide equations for non-standardized and standardized effect sizes. We explain the need for two different types of standardized effect size, one for the repeated measures design and one that would be equivalent to an independent groups design.
-
To provide formulas for both the small sample size effect size variance and the medium sample size approximation to the effect size variance, for both types of standardized effect size.
-
To explain how the different effect sizes can be obtained either from standard descriptive statistics or from information provided by the linear mixed model package lme4 in R.
-
Previous research has suggested that ρ is greater than zero and preferably greater than 0.25.
-
There is either strong theoretical argument, or empirical evidence from a well-powered study, that the period by technique interaction is negligible.