In this case study, we have followed the implementation of Tim, a tool for test results exploration and visualization (TREV) at Westermo, a company that develops embedded systems for industry networking applications. The main findings of this paper are (i) four patterns for TREV: filtering, aggregation, previews and comparisons; (ii) the eight views implemented for TREV in Tim; as well as (iii) the identification of six challenges with respect to TREV: expectations, anomalies, navigation, integrations, hardware details and plots. These findings can serve as a starting point, or be of relevance more generally for other researchers or practitioners that strive to implement a TREV tool in a similar or slightly different context. For the case company, the Tim tool is already in use and continued improvements are planned.
Similar studies, such as those stemming from Q-Rapids
6 (discussed on more detail in Sect.
5.3.2), have found similar challenges as we have, which implies that these are not unique to the case company. When it comes to the choices of views, many tools and visual elements are possible. In their theory of distances, Bjarnason et al. [
3] argue that software development practices increase, decrease or bridge distances between actors. Using their terminology one could argue that TREV strives to decrease or bridge cognitive and navigational distances between, on the one hand, actors in the software development process (software developers, test framework developers, project managers, etc.) and on the other hand artifacts such as aggregated, split and/or plain test results and log files.
5.1 Revisiting the industry problems and process
The old system had three main problems. First, to counter the problem of an unknown match between implementation and
user needs, the same kanban process that is already in use in the test framework team was used also for Tim (including, e.g., requirements, development and testing of functional increments). Requirements were elicited in the form of user stories or tasks, both in our previous work [
48] and during the implementation. Prioritization and scoping of tasks was done with the reference group and the test framework team manager.
The second problem was related to
scalability in that visual elements would not fit on screen and that the performance had degraded. The database performance issues were addressed by refactoring the database layout (see Fig.
5), improving the architecture (Fig.
6) and separating the test environment for the tool from the environment in with it is used. This separation facilitated the implementations of test data generators to allow rapid testing of database calls in the backend, i.e., testing of both the functional correctness and the performance of the backend and its database queries. Furthermore, we logged actual usage of Tim and conclude that 98.4% of backend calls were faster than 10 seconds (Fig.
15), which is acceptable for the needs of the company. Second, to address poor scalability of the user interface, we implemented Tim incrementally with a reference group that also evaluated the views of the tool (Figs.
7,
8,
9,
10,
11,
12,
13, and
14).
The challenge of the too wild
technological flora was addressed with technological trials, in particular in the pre-alpha period, which also meant that several person months per person were spent on learning the new tools and languages alone. (Some more details on the contents of the technical flora are discussed in Sect.
4.1.)
The views in Tim are enablers for TREV in different phases in the development process. During the implementation, and in the user documentation and training material for using Tim, we used the three phases of daily work, merge time and release time, when explaining the views. In short, the first views are mainly targeting daily work, when a developer desires to dive deep into data and explore individual verdicts, debug messages, or perhaps timing within individual log files. In the case of debugging, a user might want to know if an issue is present also in other code branches, on other systems, if it has been present over time and if it is intermittently failing, which motivates the heatmap view. As features have been implemented, a new perspective might be needed: Do we have the desired test coverage? Have the non-functional aspects degraded? Is the main branch we want to merge into as stable as our branch? These questions are addressed by the measurements and compare branch views. Finally, when a new WeOS version is about to be released, the analyze branch view can answer if any test case has failed on this branch in the last few days, and a user can drill down into where (on which test system) and when (in time) the failures occurred. This view also shows the test intensity the branch has received in terms of number of test systems used and test cases executed (not shown in heatmap).
In Table
3, we map the views with the steps in the process, for example: The measurements view can be seen as primarily giving support for work and decisions at merge time, but can also be somewhat useful in daily work and at release time.
Table 3
Mapping of the steps in the software development process—daily work (D), branch merge (M) and release (R)—with implemented views in Tim
Start | Y | – | – | |
Outcomes | Y | – | – | |
Outcome | Y | – | – | |
Session | Y | – | – | |
Heatmap | Y | (Y) | – | |
Measurements | (Y) | Y | (Y) | |
Compare branch | – | Y | (Y) | |
Analyze branch | – | (Y) | Y | |
5.2 Validity analysis
In this section, we discuss the validity of the findings in terms of rigor, relevance, generalizability, construct validity, internal validity and reliability.
At the core of
rigor are carefully considered and transparent research methods [
23,
40,
42,
54]. This study was both planned and also conducted as a case study, based on the well-known guidelines written by Runeson et al. [
40]. The study involves both quantitative and qualitative data that were collected and analyzed in a systematic manner.
One way for research to be
relevant is for the research party and the industry party to share a common understanding of the problem and to be able to communicate [
12,
19,
22,
42]. This can be a challenge, as Sannö et al. point out [
42], because these two parties typically have differences in perspective with respect to problem formulation, methodology and result. The constructs of this study are rather straightforward—challenges, patterns and views pose no major threats to
construct validity. It is of course possible that we, in academic communication, “speak another language” than the participants of the reference group, which could lead to threats to construct validity. One could argue that the prolonged involvement and frequency of reference group meetings are part of the mitigation for the threats to both relevance and construct validity.
In a paper titled “...Generalizability is Overrated,” Briand et al. argue just that [
7]. Similarly, Hevner et al. argue that one ought to make work in an environment which may decrease generalizability [
20].
Generalizability explores to what extend the findings are applicable to other researchers, practitioners or domains. Very often case studies claim limited generalizability, and this study is no exception. One could argue that developing a similar tool as Tim for a more general audience (perhaps as part of an open-source tool for unit-level testing, etc.) would have improved the study’s generalizability. However, that might not have incorporated the complexities of working in the industry context (with test selection, hardware selection, parallel branches, etc.), which are at the core of our work on Tim.
Ralph et al. argue that, for action research, it is essential to cover: the evaluation of the intervention, the reactions from the reference group, as well as a chain of evidence from observations to findings [
36]. These are all related to the causality in the study and the
internal validity. As we have discussed above, the motivation for implementing Tim was driven by problems with requirements, technological flora and scalability. To summarize, desirables for the new system were defined, both in previous work and with the reference group. Implementation was done iteratively and evaluated at reference group meetings. Data were collected at meetings and from logging use of Tim. However, one could ask if we implemented a certain view because it enables test results exploration better than any other view, because the reference group wanted it, or because we as researchers, for some other reason, desired to implement it to see what would happen. Furthermore, during one of the reference group meetings a project manager requested “In general, start migrating existing functionality from the old system into the new system, then work with improvements,” which implies the cognitive bias of anchoring—the users (and researchers) were used to the old system and other systems. In other words, the views in Tim are not free from bias.
Threats to reliability can be summarized as “would another researcher in this setting produce the same results?” Implementing visualizations as a researcher is very dependent on the skills a researcher has in a tool—a researcher already skilled in a JavaScript framework other than Vue/Vuex might have favored that instead, and a researcher very skilled in native MacOS GUI development would perhaps have implemented a desktop application for Apple computers, whereas a researcher with expert knowledge in pie charts would have favored those, etc. We speculate, however, that those views would have had many similarities to the ones we produced—perhaps a “pie-stack” with top-pies, sub-pies and sunburst charts could all have be implemented with the same patterns we observed. Perhaps this hypothetical pie-stack would work just as well or better than Tim? In short, other researchers might have produced other views, but we argue that at least some of the patterns would have been similar or the same. For example, one would still have to filter and aggregate at the least, and both comparisons and previews would most likely be useful as well, even for a pie-stack.
Related to validity are the two principles in research ethics on
scientific value, that “research should yield fruitful results for the good of society, and not be random and unnecessary” and
researcher skill, “the researchers should have adequate skills” [
46]. If we, as researchers, implemented views at random with poor or no skills in, e.g., web development, then the research would not only have poor validity, it would also be unethical. To combat these ethical threats, many person months have been invested in technological skills (learning database, backend and frontend programming) and researcher skills (participating in a research school, etc.), and we have included what is valuable to society (in particular the case company) when prioritizing implementation.
To conclude, there are validity and ethical threats to this study, but as suggested by Merino et al., Munzner, Runeson et al., Strandberg [
27,
28,
40,
46], and many others, we have made efforts to mitigate risks one could have expected by means of triangulation (collecting data using diversity), prolonged involvement (both in terms of knowing the domain, by conducting the study over several months, by collecting data from 201 days of use) and by using member checking, etc.