1 Introduction
2 Implementation
2.1 Data Retrieval and Processing
2.2 On-screen Dashboards on Kibana
-
Organizational performance (number of pull requests or issues still open, time to close pull requests or issues).
-
Project performance (number of pull requests or issues still open, time to close pull requests or issues).
-
Relationship between time to answer pull requests or issues and the time to close them.
-
Comparison of organizations contributing to a project by their pull requests or issues closing time, or existing open backlog (what they have opened).
2.3 VR Dashboard on BabiaXR
HTML
with new entities allowing to build 3D scenes as if they were HTML
documents, using techniques common to any front-end web developer. A-Frame is built on top of Three.js,7 which uses the WebGL API available in all modern browsers.HTML
document. It will include one or more of these data retrieval components (babia-queryes
), and some other BabiaXR components that consume the retrieved data by building 3D visualizations, such as babia-barsmap
, babia-bars
, babia-pie
, and babia-doughnut
.2.4 Visualizations
-
At the level of the entire analyzed project, showing the total number of items open and closed for the entire project, the total number of items open per organization, the median time of days open and the 80-percentil of days open for all items, the evolution of the total number of open and closed items over time, and the evolution of the number of submitters over time.
-
At the organization level, showing the number of items per organization, the number of submitters per organization, the number of assignees per organization and the average number of days open for items per organization.
-
At the submitter level, showing the number of items per submitter, the number of repositories per submitter and the average number of days open for items per submitter.
-
At the repository level, showing the number of items per submitter, the number of submitters per repository, the number of assignees per repository, and the average number of days open for items per repository.
-
At the subproject level (subprojects within the general project), showing the number of items for each subproject, the number of submitters per subproject, the number of repositories per subproject and the average number of days open for items per project.
3 Experiment
3.1 Goal
3.2 Research Questions and Hypothesis
This question tests the hypothesis that presenting visualizations in VR, where available space is much more abundant (you can have visualizations all around you, placing them in different heights), will allow for a better and faster understanding. The hypothesis is disputable because there are factors that work against it, such as the difficulties that perspective and distance may cause to the adequate perception of magnitudes.RQ: “Is comprehension of software development processes, via the visualization of their metrics, at least as good as in 2D screens when presented in VR scenes?”.
RQ1: “Do the answers obtained in VR provide similar correctness compared to those obtained on-screen?”
The correctness is measured by comparing the difference between the answer provided and the right answer, and the time spent in answering by the period of time needed to produce the answer in time units (i.e., in seconds).RQ2: “Do the answers obtained in VR provide similar time to completion compared to those obtained on-screen?”
3.3 Participants
Exp PRG
), in using data visualization tools (i.e., Exp Dataviz
), with the CHAOSS project (i.e., Exp CHAOSS
), with the OpenShift project (i.e., Exp OPENSHIFT
), in software visualization (i.e., Exp Softvis
), and in using VR devices (i.e., Exp VR
), as self-declared in an interview. Figure 11 summarizes the years of experience in programming (i.e., Exp PRG
) and in using data visualization tools (i.e., Exp Dataviz
).3.4 Datasets and Tools
3.5 Variables
-
Independent variable: Name: Group Description: Group that the subject belongs to. Scale: Categorical: “First in VR with CHAOSS project”, “First in on-screen with CHAOSS project”, “First in VR with OpenShift project” or“First in on-screen with OpenShift project”. Operationalization: Subject answers after interacting with visualizations on a 2-D screen, or after interacting with visualizations immerse in virtual reality.
-
Dependent variable “Time to complete”: Name: TimeToComplete Description: Time to complete the task Scale: Integer (seconds) Operationalization: Number of seconds from the moment the subject states that the task starts, to the moment the subject states that the task is done.
-
Dependent variable “Correctness”: Name: Correctness Description: Normalized value in the range 0–1, with 1 being “completely correct”, and 0 being “completely incorrect”. Scale: Float, range 0–1 Operationalization: Operationalized for each task, according to the correct answer and the specifics of the answer.
CHAOSS | OpenShift | |
---|---|---|
Issues and pull requests | 9000+ | 215000+ |
Submitters | 650+ | 7000+ |
Assignees | 60+ | 1500+ |
Organizations | 30+ | 80+0 |
Repositories | 20+ | 350+7 |
-
Confounding variable “Project”: Name: Project Description: Project from which the data used in the task comes from. Scale: Categorical: “CHAOSS”, “OpenShift”. Operationalization: Determination by inspection of the specific task performed by the subject.
-
Confounding variable “Period”: Name: Period Description: Period in which the subject performed the task (all subjects perform tasks in the first round in one environment, with one project, and then in the other environment with the other project). Scale: Categorical: “First”, “Second”. Operationalization: Determination by inspection of the specific task performed by the subject.
-
Confounding variable “Environment”: Name: Environment Description: Environment (setting) in which the question was answered by the subject Scale: Categorical: “on-screen ” or “VR”. Operationalization: Subject answers after interacting with visualizations on a 2-D screen, or after interacting with visualizations immerse in virtual reality.
-
Confounding variable “Experience with Kibana ”: Name: ExpKibana Description: Overall experience with Kibana dashboards Scale: Categorical: “None”, “Beginner”, “Knowledgeable”, “Advanced”, “Expert”. Operationalization: Self-estimation by the subject, via a question in the demographics survey, with the categories as possible answers.
-
Confounding variable “Experience in data visualization”: Name: ExpDataviz Description: Overall experience in using data visualization tools Scale: Categorical: “None”, “Beginner”, “Knowledgeable”, “Advanced”, “Expert” Operationalization: Self-estimation by the subject, via a question in the demographics survey, with the categories as possible answers.
-
Confounding variable “Experience with VR”: Name: ExpVR Description: Overall experience with VR devices Scale: Categorical: “‘None”, “Beginner”, “Knowledgeable”, “Advanced”, “Expert” Operationalization: Self-estimation by the subject, via a question in the demographics survey, with the categories as possible answers.
-
Confounding variable “Experience in programming”: Name: ExpPRG Description: Overall experience in programming Scale: Integer (years) Operationalization: Self-estimation by the subject, via a question in the demographics survey, with the number of years of experience as answer.
-
Confounding variable “Experience with CHAOSS”: Name: ExpCHAOSS Description: Overall experience with the CHAOSS project Scale: Categorical: “‘None”, “Beginner”, “Knowledgeable”, “Advanced”, “Expert” Operationalization: Self-estimation by the subject, via a question in the demographics survey, with the categories as possible answers.
-
Confounding variable “Experience with OpenShift ”: Name: ExpOpenShift Description: Overall experience with the OpenShift project Scale: Categorical: “‘None”, “Beginner”, “Knowledgeable”, “Advanced”, “Expert” Operationalization: Self-estimation by the subject, via a question in the demographics survey, with the categories as possible answers.
-
Confounding variable “Position”: Name: JobPosition Description: Job position of the subject. Scale: Categorical: “Practitioner”, “Academic”, “Student” Operationalization: Self-declaration by the subject, via a question in the demographics survey, with open text as answer, which is later mapped to one of the categories.
-
Confounding variable “Gender”: Name: Gender Description: Self-perceived gender of the subject. Scale: Categorical: “Male”, “Female”, “Other” Operationalization: Self-declaration by the subject, via a question in the demographics survey, with open text as answer, which is later mapped to one of the categories.
-
Confounding variable “Age”: Name: Age Description: Age of the subject. Scale: Integer (years) Operationalization: Self-declaration by the subject, via a question in the demographics survey, with the number of years of age as answer.
3.6 Training
-
Interaction. In both trainings the participant learned how to interact with the visualizations, in order to obtain the maximum information from them (e.g., pointing to the visualization). In the case of the on-screen training, the limitations of the use of Kibana were also explained, avoiding features that the BabiaXR dashboard does not provide (e.g., field filtering by clicking).
-
Movement. In both trainings the participant learns how to move around in the dashboard. In the case of the VR environment, the participant learns how to walk and uses the teleport feature for moving to the visualizations around. In the case of the on-screen environment, the participant learns how to move through the dashboards and how to reach all the visualizations that are not available at first sight.
-
Time range switch. In both trainings the participant learned how to change the time range for the data, an important action for the experiment. In the case of the VR environment, the participant learns how to change the time range using the controllers, by hiding/showing the corresponding options, and clicking on them. In the case of the on-screen environment, the participant learns where the time range option is located, and is informed about the only ones permitted in the experiment (Kibana has more time range options than BabiaXR).
Task | Task description & purpose | Category |
---|---|---|
T\(_{1}\) | Description. During the LAST YEAR tell me: The name of the TOP 3 ORGANIZATIONS by number of issues. For each of those 3 organizations, the NUMBER of pull requests for the same period. | Correlation |
Purpose. Identify the most important organizations of the project in terms of the number of pull request and issues and identify if the correlation between issues and pull request is meaningful at the Organization level. | ||
T\(_{2}\) | Description. For issues, during the LAST 90 DAYS, tell me: The number of opened and closed and when is the higher time open? (time open as median in days) For pull request, during the LAST 90 DAYS, tell me: The number of opened and closed and when is the higher time open? (time open as median in days) | Analysis |
Purpose. The quarter is a common measurement system, so the purpose is to know the number of issues and pull request of the entire project in that quarter, identifying the highest point that remained open. | ||
T\(_{3}\) | Description. During the LAST 5 YEARS: For pull requests submitters, who are the top three?. For each of them, for how long their issues stayed open (days on average) | Correlation |
Purpose. Identify the most important submitters of the project in terms of the number of pull request and identify if the correlation between the pull request submissions and the time that the issues remain open is meaningful at the submitter level. | ||
T\(_{4}\) | Description. During the LAST 2 YEARS tell me: The name of the TOP 3 REPOSITORIES by number of pull requests SUBMITTERS. The name of the TOP 3 REPOSITORIES by number of issues. | Analysis |
Purpose. Identify the most important repositories of the project in terms of the number of pull request Submitters and issues and identify which repositories are receiving the most activity in terms of different submitters. Compare the results with the repositories with more issues. | ||
T\(_{5}\) | Description. During the LAST 6 MONTHS, tell me: The name of the TOP 3 SUBMITTERS by the longest time to resolve their issues (average days open). For each of those 3 submitters, the NUMBER of pull requests submitted. | Correlation |
Purpose. Identify the core of the community in the last 6 months, identify who are the users for whom their issues stayed longer opened, and compare this with the number of pull requests that they submit. |
3.7 Tasks and Data Collection
-
Answers. To provide answers to questions in each task, all participants had to speak aloud. For each task, participants were also asked to assess the level of difficulty (five levels, from “Strongly Disagree” to “Strongly Agree”).
-
Efficiency. The supervisor tracked the time that each participant spent on each task. For each task, the participant notified the supervisor both the start and the finish moments.
-
Correctness. After running the experiment, the supervisor checked the answers of the task, comparing them with the correct values. This check was validated by one of the authors of the paper.
ID | Description |
---|---|
S\(_{1}\) | In what environment did you find it easier to complete the tasks? Why? |
S\(_{2}\) | In what environment has it taken you the shortest to complete tasks (what is your feeling)? Why? |
S\(_{3}\) | Tell us which parts of each environment are useful to answer the tasks, and tell us which parts make it more difficult to answer the tasks. (Advantages and disadvantages) |
C\(_{1}\) | Overall, did you find the experiment difficult? (choose one: strongly agree, agree, don’t know, disagree, strongly disagree) Please explain. |
C\(_{2}\) | Do you have any suggestions or comments? |
4 Changes from the Registered Report
-
Dashboards. In the registered report we defined a set of five Kibana dashboards developed by Bitergia that analyze software development processes, which we intended to use for the experiment. Finally, for this study, we decided to focus only on two dashboards of that set, those designed to explore the timing of pull requests and issues. When designing the details of the execution, we realized that using all five dashboards meant a longer experiment, and the risk of spreading shallow over a large number of aspects of the project. Instead, we decided to reduce the number of dashboards and focus the study on the analysis of two similar and usually related software development processes: timing of pull requests and issues.
-
Tasks. In the registered report we stated that we would define a set of tasks, and each subject would be presented with half those tasks in one environment (on-screen or VR), and the other half in the other, with data from different projects in each environment. All tasks would be different, and a single subject would not repeat a task in both environments. However, when defining the final version of the experiment, we decided not to divide the set of tasks, but to repeat them for each subject in the second environment, with data from a different project. The main reason is that this way we can analyze in depth if having performed tasks in one system serves as learning when repeating it in the other environment, and thus has some impact on performance. In addition, we decided that having two different sets of tasks could be detrimental when comparing results for a sample of participants that is not very large (in our case, 32 participants).
-
Projects. Even when we originally planned to use data from two similar projects, finally we decided to have two very different cases. The main reason was that, not interfering with the results, thus would allow us to analyze differences (if any) due to the different characteristics of the projects. All participants are assigned to both projects in a random order, and in random environments, so we can still analyze the overall impact of both environments, which is the main aim of the study. However, we can also control for differences in performance due to the nature of the projects. In particular, we wanted to check if the very different amount of data to visualize for each of the projects produced any measurable difference.
-
Correctness dependent variable. In the registered report, we defined “Error” as one of the two dependent variables. However, when designing the final version of the study, we decided to study correctness instead of error. In the end, both variables capture the same characteristic (how good the answer provided by the subject is, when comparing it with the correct answer). But we found that the analysis seemed more natural when mapping answers to a scale in the range 0–1, according to how close they were to the true value than estimating the error.
-
Confounding variables. For the final version of the analysis, we performed a more detailed analysis of confounding variables, and we decided to make some changes to the list proposed in the registered report. On the one hand, we substituted “Experience in software development” with a more ample variable, “Experience in programming”, which should capture the same kind of abilities, but is better suited to our expected demography of participants. On the other hand, we added several new confounding variables that we thought could have an impact:
-
Variables related to the experiment: “Project” and “Round”. This way, we could analyze the impact of the characteristics of the project, and the possible “learning effect”.
-
Variables related to previous experience: In addition to “Experience with Kibana ” and “Experience with VR”, already present in the registered report, and “Experience in programming” (substituting “Experience in software development”), we include “Experience with data visualization” (to check more broadly for experience with tools related to the experiment), and “Experience with CHAOSS” and “Experience with OpenShift” (to discard the bias of subjects that could already be familiar with the data shown in the experiment, or with the underlying projects).
-
Demography variables: In addition to “Position”, already present in the registered report, we added “Gender” (although we could not analyze it, because we lack enough diversity in subjects for this dimension), and “Age”, as potential causes of bias.
-
-
Analysis of confounding variables. Instead of presenting a separate analysis for all confounding variables, we focused on “Project” and “Round”, because we considered that they could be the more determinant for the results of the experiment, due to the way in which the experiment itself was decided. Therefore, in Section 5 (devoted to presenting results), the first sections analyze dependent variables in the context of these two confounding variables, with great detail. The rest of the confounding variables are analyzed together, more briefly, in Section 5.3.
-
Training. In the original report we designed a process that included some training for subjects before they performed the tasks in the experiment. However, when designing the final version, we decided to make more emphasis on this training, making it more specific and a bit longer than initially intended. The reason to do so was because of our experience in other experiments, where we learned the importance of letting people understand the basic mechanisms needed to perform the tasks, so that we could exclude most of the learning curve from the experiment itself.
5 Results
5.1 Correctness (\(\textbf{RQ}_{1}\))
-
0: If the time range is wrongly selected (all tasks require the subject to correctly select a time range).
-
1: If everything is correct.
-
\(\textbf{T}_{1}\) asks to report the top three organizations by number of issues, and then for those organizations the number of pull requests submitted. Values for correctness:
-
0.5: If all three organizations are identified correctly, but the number of pull requests is wrong, or if at least one of the organizations was wrong, but the number of pull requests for the identified organizations is correct.
-
-
\(\textbf{T}_{2}\) asks to report four items: the number of opened and closed issues (1), and when is the highest value in the “time open as median in days” visualization (2), and then the same for pull requests instead of issues (3,4). Values for correctness:
-
0.25: If only one item is correct.
-
0.5: If two items are correct.
-
0.75: if three items are correct.
-
-
\(\textbf{T}_{3}\) asks first to report the top three submitters by number of pull requests during a certain period, and then for those submitters the number of days on average that their issues stayed open during the period. Values for correctness:
-
0.5: If all three submitters are correctly reported but the number of days open for their issues was wrong, or if at least one identified submitter is wrong but the number of days open for the issues of the identified submitters is correct.
-
-
\(\textbf{T}_{4}\) first asks to report the top three repositories by number of pull request submitters during a certain period, and then the top three repositories by number of issues during the same period. Values for correctness:
-
0.5: If all three repositories by number of pull request submitters were correctly reported, but the top three repositories by number of issues were wrong, or the other way around.
-
-
\(\textbf{T}_{5}\) asked to report the top three submitters by the longest time to resolve their submitted issues during a certain period, and then for those submitters, the number of pull requests submitted during the same period. Values for correctness:
-
0.5: If the three submitters were correctly reported but their number of pull requests was wrong, or the other way around
-
5.1.1 Results by Group of Subjects
Effect size | Delta |
---|---|
Small | |d| \(\approx \) 0.2 |
Medium | |d| \(\approx \) 0.5 |
Large | |d| \(\approx \) 0.8 or higher |
P1 AVG | P2 AVG | |||||
---|---|---|---|---|---|---|
Group | Coefficient | p-value | Eff Size | Coefficient | p-value | Eff size |
P1 - SC Openshift | − 0.076 | 0.024 | − 1.59 | − 0.066 | 0.025 | − 1.58 |
P2 - VR Chaoss | ||||||
P1 - VR Chaoss | − 0.054 | 0.036 | − 1.48 | − 0.036 | 0.069 | − 1.28 |
P2 - SC Openshift | ||||||
P1 - VR Openshift | − 0.016 | 0.667 | − 0.30 | − 0.016 | 0.551 | − 0.42 |
P2 - SC Chaoss |
-
Average on Round 1. For P1 SC Openshift - P2 VR Chaoss, the coefficient value of -0.076 suggests that, on average, the P1_AVG for this group is 0.076 units lower than the reference group. This difference is statistically significant (p < 0.05), as indicated by the corresponding p-value. Similarly, for P1 VR Chaoss - P2 SC Openshift, the coefficient value of -0.054 suggests that, on average, the P1_AVG for this group is 0.054 units lower than the reference group. This difference is also statistically significant (p < 0.05). On the other hand, for P1 VR Openshift - P2 SC Chaoss, the coefficient value of -0.016 suggests that there is no statistically significant difference in the P1_AVG between this group and the reference group (p > 0.05). The module of the values of Cohen’s d for “P1 SC Openshift - P2 VR Chaoss vs P1 SC Chaoss - P2 VR Openshift” (-1.5946) and “P1 VR Chaoss - P2 SC Openshift vs P1 SC Chaoss - P2 VR Openshift” (-1.4845) indicate large effect sizes, suggesting substantial differences between these groups in terms of P1_AVG. On the other hand, Cohen’s d value for “P1 VR Openshift - P2 SC Chaoss vs P1 SC Chaoss - P2 VR Openshift” (-0.3041) suggests a small effect size, indicating a smaller and less substantial difference between these groups. Overall, these results suggest that there are statistically significant differences in P1_AVG between the groups of participants compared to the reference group. The effect sizes, as measured by Cohen’s d, indicate that the differences are particularly notable for P1 SC Openshift - P2 VR Chaoss and P1 VR Chaoss - P2 SC Openshift groups, while the difference for P1 VR Openshift - P2 SC Chaoss is relatively smaller.
-
Average on Round 2. For P1 SC Openshift - P2 VR Chaoss, the coefficient value of -0.066 suggests that, on average, the P2_AVG for this group is 0.066 units lower than the reference group. This difference is statistically significant (p < 0.05). For P1 VR Chaoss - P2 SC Openshift, the coefficient value of -0.036 suggests that, on average, the P2_AVG for this group is 0.036 units lower than the reference group. Although the p-value (0.069) is slightly above the significance threshold (p \(=\) 0.05), it is still worth noting as it approaches statistical significance. For P1 VR Openshift - P2 SC Chaoss, the coefficient value of -0.016 suggests that there is no statistically significant difference in the P2_AVG between this group and the reference group (p > 0.05). The module of the values of Cohen’s d for “P1 SC Openshift - P2 VR Chaoss vs P1 SC Chaoss - P2 VR Openshift” (-1.5858) and “P1 VR Chaoss - P2 SC Openshift vs P1 SC Chaoss - P2 VR Openshift” (-1.2854) indicate large effect sizes, suggesting substantial differences between these groups in terms of P2_AVG. Cohen’s d value for “P1 VR Openshift - P2 SC Chaoss vs P1 SC Chaoss - P2 VR Openshift” (-0.4211) suggests a moderate effect size, indicating a relatively smaller but still noticeable difference between these groups. Overall, these results suggest that there are statistically significant differences in P2_AVG between the groups of participants compared to the reference group. The effect sizes, as measured by Cohen’s d, indicate that the differences are particularly notable for P1 SC Openshift - P2 VR Chaoss and P1 VR Chaoss - P2 SC Openshift groups, while the difference for P1 VR Openshift - P2 SC Chaoss is relatively smaller but still observable.
TOTAL AVG | |||
---|---|---|---|
Group | Coefficient | p-value | Eff size |
P1 - SC Openshift | − 0.072 | 0.017 | − 1.68 |
P2 - VR Chaoss | |||
P1 - VR Chaoss | − 0.047 | 0.021 | − 1.63 |
P2 - SC Openshift | |||
P1 - VR Openshift | − 0.017 | 0.665 | − 0.30 |
P2 - SC Chaoss |
5.2 Completion Time (\(\textbf{RQ}_{2}\))
5.2.1 Results by Group of Subjects
P1 TOTAL | P2 TOTAL | |||||
---|---|---|---|---|---|---|
Group | Coefficient | p-value | Eff size | Coefficient | p-value | Eff size |
P1 - SC Openshift | 152.625 | 0.091 | 1.19 | 137.5006 | 0.005 | 1.99 |
P2 - VR Chaoss | ||||||
P1 - VR Chaoss | 89.750 | 0.316 | 0.71 | 47.125 | 0.476 | 0.50 |
P2 - SC Openshift | ||||||
P1 - VR Openshift | 163.750 | 0.061 | 1.32 | 43.375 | 0.518 | 0.45 |
P2 - SC Chaoss |
-
Total times on Round 1. For P1 SC Openshift - P2 VR Chaoss, the coefficient value of 152.625 suggests that the P1 TOTAL time for this group is 152.625 units higher than the reference group. However, this difference is not statistically significant (p > 0.05). For P1 VR Chaoss - P2 SC Openshift, the coefficient value of 89.750 suggests that the P1 TOTAL time for this group is 89.750 units higher than the reference group. Again, this difference is not statistically significant (p > 0.05). For P1 VR Openshift - P2 SC Chaoss, the coefficient value of 163.750 suggests that the P1 TOTAL time for this group is 163.750 units higher than the reference group. Although this difference shows a trend towards significance (p \(=\) 0.061), it does not reach the conventional threshold for statistical significance (p > 0.05). The module of the values of Cohen’s d for “P1 SC Openshift - P2 VR Chaoss vs P1 SC Chaoss - P2 VR Openshift” (1.1969) and “P1 VR Openshift - P2 SC Chaoss vs P1 SC Chaoss - P2 VR Openshift” (1.3249) suggest moderate effect sizes, indicating observable differences between these groups in terms of P1 TOTAL times. Cohen’s d value for “P1 VR Chaoss - P2 SC Openshift vs P1 SC Chaoss - P2 VR Openshift” (0.7092) indicates a smaller effect size, suggesting a relatively smaller difference between these groups. Overall, these results suggest that there may be some differences in P1 TOTAL times between the groups of participants compared to the reference group. However, the statistical significance of these differences is limited, with only the difference for P1 VR Openshift - P2 SC Chaoss showing a trend towards significance. The effect sizes, as measured by Cohen’s d, suggest that the differences, if present, are moderate in magnitude for P1 SC Openshift - P2 VR Chaoss and P1 VR Openshift - P2 SC Chaoss, while the difference for “P1 VR Chaoss - P2 SC Openshift” is relatively smaller.
-
Total times on Round 2. For P1 SC Openshift - P2 VR Chaoss, the coefficient value of 137.500 suggests that the P2 TOTAL time for this group is 137.500 units higher than the reference group. This difference is statistically significant (p < 0.05), indicating that there is a significant effect of P1 SC Openshift - P2 VR Chaoss on the P2 TOTAL time. For P1 VR Chaoss - P2 SC Openshift, the coefficient value of 47.125 suggests that the P2 TOTAL time for this group is 47.125 units higher than the reference group. However, this difference is not statistically significant (p > 0.05), indicating that there is no significant effect of P1 VR Chaoss - P2 SC Openshift on the P2 TOTAL time. For P1 VR Openshift - P2 SC Chaoss, the coefficient value of 43.375 suggests that the P2 TOTAL time for this group is 43.375 units higher than the reference group. Again, this difference is not statistically significant (p > 0.05), indicating that there is no significant effect of P1 VR Openshift - P2 SC Chaoss on the P2 TOTAL time. The positive value of Cohen’s d for “P1 SC Openshift - P2 VR Chaoss vs P1 SC Chaoss - P2 VR Openshift” (1.9909) suggests a large effect size, indicating a substantial difference between these groups in terms of P2 TOTAL time. Cohen’s d values for “P1 VR Chaoss - P2 SC Openshift vs P1 SC Chaoss - P2 VR Openshift” (0.5039) and “P1 VR Openshift - P2 SC Chaoss vs P1 SC Chaoss - P2 VR Openshift” (0.4570) indicate smaller effect sizes, suggesting relatively smaller differences between these groups. In summary, these results suggest that there is a statistically significant difference in P2 TOTAL time between the group “P1 SC Openshift - P2 VR Chaoss” and the reference group “P1 SC Chaoss - P2 VR Openshift”. The effect size, as measured by Cohen’s d, indicates a large difference between these two groups. However, there are no significant differences in P2 TOTAL time between the groups P1 VR Chaoss - P2 SC Openshift and “P1 VR Openshift - P2 SC Chaoss” compared to the reference group. The effect sizes for these comparisons are relatively smaller.
TOTAL | |||
---|---|---|---|
Group | Coefficient | p-value | Eff size |
P1 - SC Openshift | 290.125 | 0.027 | 1.57 |
P2 - VR Chaoss | |||
P1 - VR Chaoss | 136.875 | 0.324 | 0.70 |
P2 - SC Openshift | |||
P1 - VR Openshift | 207.125 | 0.075 | 1.26 |
P2 - SC Chaoss |
5.3 Effect of Confounding Variables
-
JOB_POSITION (Job position): The Kendall Tau coefficient for JOB_POSITION is close to zero, indicating a very weak or no monotonic relationship between JOB_POSITION and the total time to finish all the tasks. The p-value is greater than the typical significance level of 0.05, suggesting that this correlation is not statistically significant.
-
AGE (Age): The Kendall Tau coefficient for AGE is also close to zero, indicating a very weak or no monotonic relationship between AGE and the total time to finish all the tasks. The p-value is greater than 0.05, suggesting that this correlation is not statistically significant.
-
EXP_PRG (Experience in Programming): The Kendall Tau coefficient for EXP_PRG is positive and larger in magnitude, indicating a moderate positive monotonic relationship between EXP_PRG and the total time to finish all the tasks. The p-value is less than 0.05, indicating that this correlation is statistically significant at the 5% level. This suggests that as the value of EXP_PRG increases, the value of the total time to finish all the tasks tends to increase as well.
-
EXP_DATAVIZ (Experience in Data visualization aplications): The Kendall Tau coefficient for EXP_DATAVIZ is close to zero, indicating a very weak or no monotonic relationship between EXP_DATAVIZ and the total time to finish all the tasks. The p-value is greater than 0.05, suggesting that this correlation is not statistically significant.
-
EXP_VR (Experience in Virtual Reality): The Kendall Tau coefficient for EXP_VR is close to zero, indicating a very weak or no monotonic relationship between EXP_VR and the total time to finish all the tasks. The p-value is greater than 0.05, suggesting that this correlation is not statistically significant.
Variable | Kendall Tau | p-value |
---|---|---|
JOB_POSITION | -0.050410 | 0.708797 |
AGE | 0.027671 | 0.830563 |
EXP_PRG | 0.336019 | 0.015252 |
EXP_DATAVIZ | -0.086642 | 0.537350 |
EXP_VR | 0.054354 | 0.693977 |
Variable | Kendall Tau | p-value |
---|---|---|
JOB_POSITION | 0.104167 | 0.492507 |
AGE | 0.165881 | 0.253402 |
EXP_PRG | 0.012381 | 0.936668 |
EXP_DATAVIZ | -0.154337 | 0.328973 |
EXP_VR | -0.322301 | 0.038110 |
-
JOB_POSITION (Job position): The Kendall Tau coefficient for JOB_POSITION is positive, indicating a weak positive monotonic relationship with the average of correctness. However, the p-value is greater than 0.05, suggesting that this correlation is not statistically significant. Therefore, we cannot conclude that there is a significant association between JOB_POSITION and the average of correctness.
-
AGE (Age): The Kendall Tau coefficient for AGE is positive, indicating a weak positive monotonic relationship with the average of correctness. However, the p-value is greater than 0.05, indicating that this correlation is not statistically significant. Therefore, we cannot conclude that there is a significant association between AGE and the average of correctness.
-
EXP_PRG (Experience in Programming): The Kendall Tau coefficient for EXP_PRG is close to zero, indicating a very weak or no monotonic relationship with the average of correctness. Additionally, the p-value is greater than 0.05, indicating that this correlation is not statistically significant. Therefore, there is no evidence of a significant association between EXP_PRG and the average of correctness.
-
EXP_DATAVIZ (Experience in Data visualization aplications): The Kendall Tau coefficient for EXP_DATAVIZ is negative, indicating a weak negative monotonic relationship with the average of correctness. However, the p-value is greater than 0.05, suggesting that this correlation is not statistically significant. Therefore, we cannot conclude that there is a significant association between EXP_DATAVIZ and the average of correctness.
-
EXP_VR (Experience in Virtual Reality): The Kendall Tau coefficient for EXP_VR is negative, indicating a weak negative monotonic relationship with the average of correctness. The p-value is less than 0.05, indicating that this correlation is statistically significant at the 5% level. Therefore, there is evidence of a significant association between EXP_VR and the average of correctness, suggesting that as the value of EXP_VR decreases, the value of the average of correctness tends to increase.
5.4 Feedback
5.4.1 Easier and Faster?
5.4.2 Advantages and Disadvantages
-
Advantages: The habit of the use of on-screen application, and the “everyday” interaction with them stand out again. Participants also mentioned that information is in front of you, and in a more compressed manner, that in general it is easier to use and the letters are seen more clearly.
-
Disadvantages: The most highlighted disadvantage is that information is spread in two dashboards, which makes it more difficult to correlate data from visualizations in different dashboards. In general, participants also found that charts are more difficult to find, because they are all together, and are quite similar to each other. Some participants mentioned that dashboards are too crowded with charts, and on-screen space is very limited.
-
Advantages: One of the greatest advantages highlighted by participants is the use of space, having all visualizations in the same place. Also, the use of colors for the difference between pull requests and issues improves interaction and data correlation. The “museum with shelves” metaphor was also highlighted as a positive aspect, specifically the “shelves” that allow organizing the different graphs in a very intuitive way. It is noteworthy to point out how participants noted that they felt more focused in the VR environment, because there are no distractions around and everything they saw had to do with the experiment.
-
Disadvantages: The use of the VR headset and the discomfort with it (eye strain, lack of experience in their use, etc.) with it is a point that several participants noted as negative. Another negative aspect is the size of the texts and the legends, which some participants reported to be difficult to read. Finally, the performance and the drop in frames per second in some circumstances is another aspect that they detailed as negative.
5.4.3 Control Questions
6 Discussion
7 Threats to Validity
7.1 Internal Validity
-
Subjects. We ensured that all participants had experience in different relevant topics about programming using a questionnaire, focusing to recruit people with job positions related to the software development (including academia and industry), reducing the threat that they were not competent enough. Moreover, we asked for their experience in the relevant topics to mitigate the threat that participants’ experience was not distributed fairly. However, their training for the environment of their experiment (i.e., Kibana or BabiaXR) was not uniform, with persons participating in the VR experiment being much less experienced in VR environments than on-screen participants in on-screen environments.
-
Tasks. The choice of tasks may have been biased in favor of Kibana or BabiaXR. We mitigated this threat by using Kibana dashboards validated by Bitergia, replicating the same visualizations in the BabiaXR dashboard. Moreover, in the two environments, we have exactly the same tasks, so the level of difficulty was as similar as possible. We also included tasks that put both modes at a disadvantage: Tasks focused on precision could be easier on-screen, while tasks focused on locality could be easier in VR. Not controlled aspects (e.g., the external environment of the BabiaXR scene) could have an influence on the results as well.
-
Training. In both environments (i.e., Kibana and BabiaXR) the text to be followed for performing the tasks explains how the tool is used and how the interaction with the elements works. No participant had relevant previous experience with Kibana or BabiaXR. Moreover, an optional tutorial about the first steps with the VR device was proposed to them (a generic starter tutorial included in the Oculus Quest 2). This could balance a bit the situation for VR participants, but given the extensive on-screen experience, this would hardly make them more efficient. It remains to be investigated whether a practical tutorial on how to interact with a VR headset could reduce the experience gap between VR and on-screen, improving the correctness of VR activities.
-
Fatigue and Learning Factors. The experiment design introduced a potential threat related to the influence of fatigue and learning factors on participants’ performance. Due to the sequential nature of the tasks, participants might experience fatigue as they progress through the experiment, which could impact their cognitive abilities, attention, and task performance. Moreover, the learning effect could influence participants’ performance over time, as they become more familiar with the tasks and the specific environments. The order of the tasks and the repetition of the tasks in different environments may interact with the learning and fatigue factors, potentially affecting the validity of the results.
-
Repeated Measures Design. The use of a repeated measures design, where participants are measured under different conditions, introduces potential threats to validity. One potential threat is order effects, where the order in which conditions are presented may impact participants’ performance or responses. To mitigate this threat, counterbalancing was employed, ensuring that participants experienced the conditions in different sequences. Another potential threat is carryover effects, where the experience of one condition may influence participants’ performance in subsequent conditions. To address this, appropriate rest periods were provided between conditions to minimize carryover effects. Also, we analyzed the learning factor, mitigating the threat.
-
Influence of Virtual Scene Design. The design of the virtual scene, including its structure and photorealism, may introduce confounding variables that affect participants’ performance and perception. Factors such as layout, color schemes, and object placement could impact participants’ cognitive processes, engagement, and sense of presence. The level of realism and visual design elements within the virtual scene may influence participants’ interpretation and interaction with the data. To mitigate this threat, we made efforts to create a representative virtual scene, but variations in responses due to individual differences and preferences may still exist. Future experiments will address the influence of virtual scene design as a potential confounding variable by carefully considering design elements and gathering participant feedback to better understand and control for these factors.
7.2 External Validity
-
Sample Size. The number of participants in the experiment is somewhat limited, which may affect the generalizability of the findings. Increasing the sample size would enhance the statistical power and reliability of the results. However, it should be noted that the current sample size is in the range commonly observed in similar experiments.
-
Subjects. We employed a combination of convenience sampling and targeted recruitment strategies to ensure subject representativeness. Convenience sampling allowed us to efficiently gather accessible and willing participants. Additionally, we actively recruited individuals meeting specific criteria related to job position and years of experience in programming topics. This involved reaching out to professional organizations, academic institutions, online communities, and industry networks. Our aim was to achieve a balanced mix of academics and professionals, ensuring diverse perspectives. By implementing these strategies, we sought to mitigate biases and enhance the representativeness of our subject sample.
-
Target System. Another threat is represented by the choice of the projects: CHAOSS and OpenShift. Participants did not know them in advance, except for one who knew the OpenShift project as a “Beginner”. We cannot assess how appropriate or representative CHAOSS and OpenShift are for the software development processes tasks we designed, but the consistent variations in solutions for the same task in both VR and on-screen environments signal that results could be extensible to other systems. Said this, our experimental approach has been validated with experience and expertise from Bitergia, so that we can be sure that the tasks are commonly performed in real, industry settings.
7.3 Contruct Validity
-
Time Measurement. To ensure accurate time measurement and mitigate potential inaccuracies in task completion times, we implemented specific strategies in our experiments. Firstly, a supervisor was present during each experiment run to record the time taken by participants to complete tasks. This provided a reliable and independent source of time measurement. Additionally, participants were instructed to verbally communicate their task completion to the supervisor, serving as a double-check for the recorded completion time. Moreover, the use of the Kibana and BabiaXR environments facilitated real-time task completion without the need for manual recording on paper, particularly advantageous in the VR environment where paper-based methods can be cumbersome. These measures helped minimize any potential errors or delays in time measurement, enhancing the internal validity of our study.
-
Experimenter Effect. One of the experimenters is one of the authors of BabiaXR, which may have influenced the experiment. For example, task solutions may not have been graded correctly. To mitigate this threat, this author did not interfere in the experiment, and if he had to interfere, the results were canceled. The experimenter built a model of the responses based on previous experiments in the literature (e.g., (Wettel et al. 2011; Romano et al. 2019)). Even if we tried to mitigate this threat extensively, we cannot exclude all possible influences on the results of the experiment.