1 Introduction
-
To stimulate the development of new efficient and practical runtime verification tools and the maintenance of the already developed ones.
-
To produce benchmark suites for runtime verification tools, by sharing case studies and programs that researchers and developers can use in the future to test and to validate their prototypes.
-
To discuss the measures employed for comparing the tools.
-
To compare different aspects of the tools running with different benchmarks and evaluating them using different criteria.
-
To enhance the visibility of presented tools among different communities (verification, software engineering, distributed computing and cyber security) involved in software monitoring.
2 Phases and rules of the competition
2.1 Collection of benchmarks
-
A program package containing the program source code (the program to be monitored), a script to compile it, a script to run the executable, and an English description of the functionality of the program.
-
A specification package containing a collection of files, each describing a property: an English description of the property, a formal representation of it in the logical system supported by the team, instrumentation information, and the expected verdict (the evaluation of the property on the program execution).
-
A specification package containing a collection of files describing a property: an English description of the property, a formal representation of it in the logical system supported by the team, and the expected verdict (the evaluation of the property on the trace).
2.2 Training phase and monitor collection phase
2.3 Benchmark evaluation phase
3 Participating teams and tools
3.1 C track
3.1.1 RiTHM
3.1.2 E-ACSL
valid
(p) which indicates that the pointer p points to a memory location that the program can write and read).e_acsl_assert
. By default, this function stops the program execution with a precise error message if a is 0 (i.e., false) and just continues the execution otherwise. The generated code is linked against a dedicated memory library which can efficiently compute the validity of complex memory-related properties (e.g., use-after-free or initialization of variables) [56, 64].3.1.3 RTC
3.2 Java track
3.2.1 Larva
3.2.2 jUnit\(^\mathrm {RV}\)
Tool | References | Contact person | Affiliation |
---|---|---|---|
RiTHM2
| [71] | B. Bonakdarpour | McMaster Univ. and U. Waterloo, Canada |
STePr
| N. Decker | ISP, University of Lübeck, Germany |
3.2.3 JavaMOP
3.2.4 Monitoring at runtime with QEA (MarQ)
Forall
or Exists
quantifications with an optional Where
constraint restricting the considered values. States can be accept
states, indicating that a trace is accepted if any path reaches an accept
state. There are two other state modifiers: skip
indicates that missing transitions are self-looping; next
indicates that missing transitions implicitly go to the failure
state. The failure state is an implicit non-accept state with no outgoing transitions; once the failure state has been reached, success (for this binding) is not possible.3.3 Offline track
3.3.1 RiTHM-2
3.3.2 MonPoly
Tool | Available at (URL) |
---|---|
E-ACSL (ver. 0.4.1) | |
JavaMop (ver. 4.2)
| |
jUnit
\(^{RV}\)
| |
Larva
| |
MonPoly
| |
QEA(MarQ)
| |
RiTHM/RiTHM2
| |
RTC
| |
STePr
|
3.3.3 STePr
3.3.4 Monitoring at runtime with QEA (MarQ)
Participating tool | User-enabled | Built-in | Propositional events | Parametric events | Automata-based | Logic-based | Regular Expressions-based | Logical-time | Real-time | Own instrumentation | Relies on AspectJ | Relies on another technique | C programs | Java programs | Traces | Time triggered | Event triggered |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Input requirement specification | Instrumentation | Monitored systems | Monitoring mode | ||||||||||||||
RiTHM
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
| ||||||
E-ACSL
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
| |||||||||
RTC
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
| |||||||||||||
Larva
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
| ||||||||||
jUnit
RV
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
| ||||||||
JavaMop
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
| |||||||
MonPoly
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
| |||||||||
STePr
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
| ||||||||||||
MarQ
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
3.4 Summary
4 Benchmarks for the monitoring competition
4.1 C track
4.1.1 Maximum chunk size in Dropbox connections
4.1.2 Changes in the chunk size of Dropbox connections
4.1.3 Maximum bandwidth of Youtube connections
4.1.4 Changes in the bandwidth of Youtube connections
4.1.5 Allowed operations on files and sockets
strace
utility in Linux to produce a system call trace for a set of processes.4.1.6 Binary search
key
in a sorted array a
of a given length
.
main
function of the program calls this function 3 times on an array of 2,000,000 elements in order to:-
search an existing key and check that the return index is correct;
-
search an unknown key and check that the function returns
-1
; -
search an existing well-chosen key in a wrongly sorted array (1 element is misplaced) and check that the function incorrectly returns
-1
.
binary_search
:-
it takes as input a positive
length
and a fully allocated sorted array of at leastlength
elements; -
it returns either an index
idx
such thata[idx] == key
; or-1
if there is no such index.
requires
clause states that each cell of the array must be correctly allocated, the second one states that the array must be sorted, and the third one indicates that the length must be positive. Then, the first behavior says that if the searched key
exists in the array, the result of the function must be an array index corresponding to this key, while the second behavior says that the function returns -1
if there is no such index.
4.1.7 Merging arrays
main
function of the program calls both functions with one array of 6000 elements and another one of 4000 elements.
-
takes as input two positive lengths
l1
andl2
and two sorted arrayst1
andt2
of lengthl1
andl2
, respectively; -
modifies
t3
such that it is a sorted array of lengthl1+l2
where each element belongs to eithert1
ort2
. Reciprocally, each element oft1
andt2
must belong tot3
.
l1
and l2
of the input arrays t1
and t2
must be positive, while the two other requirements say that these arrays must be sorted. If the requirements of the function are satisfied, it must ensure 4 postcondition provided by the ensures
clause. The first one indicates that the output array t3
must be sorted. The second (resp. third) postconditions indicates that each element of t1
(resp. t2
) must also belong to t2
, while the last postcondition states the reverse condition which is that each element of t3
belongs to either t1
or t2
.4.1.8 Quicksort
array
between two indexes left
and right
: This implementation uses two helper functions partition
and swap
. Only the first one is given below. The second one which swaps the contents of two array cells is straightforward. The main
function of the program contains 2 calls to the quicksort
function on an unsorted array of 10,000 elements. The first call is a correct one, but the second call gives 10,001 as the right bound instead of 10,000 (at most).
quicksort
is called on a fully allocated array up to the given length
.left
and right
must be valid (non-null and points to a memory location that the program is allowed to write).
4.1.9 Accesses to arrays without off-by-one nor out-of-bounds
4.1.10 Absence of buffer overflow in a palindrome generator
4.1.11 Absence of negative pointers in function calls
4.2 Java track
4.2.1 Gold users of the financial transaction system
4.2.2 Initialization in the financial transaction system
4.2.3 Negative balance in the financial transaction system
4.2.4 Unique account in the financial transaction system
4.2.5 Reactivation in the financial transaction system
4.2.6 Incrementing a counter
4.2.7 Request and response
4.2.8 Locking critical resources
4.2.9 Velocity of an object
4.2.10 Non-crossing routes
4.2.11 HasNext on DaCapo Avrora
4.2.12 Safe usage of locks
t
, respectively. Events \( acquire(t,l) \) and release(t,l)
map, respectively, to a call to methods lock() and unlock() on lock l
in a thread t
.4.2.13 Calling methods the same number of times
a
, b
and c
map to the calls to the methods A(), B(), and C(), respectively. The event done
corresponds to the end of a program.4.2.14 Safe usage of maps with iterators
4.2.15 HasNext on full DaCapo
4.2.16 SafeSyncCollection on full DaCapo
sync
event relates to create a synchronized collection, the asyncCreateIter
and syncCreateIter
events relate to creating iterators with and without the collection’s lock respectively and the accessIter
event is the ‘bad’ access without holding the lock.4.2.17 UnsafeIterator on full DaCapo
4.2.18 UnsafeMapIterator on full DaCapo
4.2.19 Combination of properties on DaCapo suite
4.2.20 HasNext on DaCapo Batik
unsafe
and safe
iteration, respectively, i.e., when it is valid to call next() (mapped to donext
to distinguish it from the state modifier). A hasnext
event with a true
result fires a transition from the unsafe
to safe
state. A donext
event in the unsafe
state leads to failure
.
hasnext
and donext
events relate to the corresponding events in java.util.Iterator
. For example, the following traces satisfy the property4.2.21 Safe iterators on DaCapo Batik
iterate
state is entered when the iterator is created and a donext
event can occur as long as the number of previous donext
events is no more than csize
. As iterate
is a next
state, if the transition cannot be taken due to the guard csize> 0
failing, then there is a transition to the failure
state.
iterator
event on a call to java.util.Collection.iterator, a call to size() is required to extract csize
. For example, the following traces satisfy the property4.2.22 Persistent hashcodes on DaCapo Batik
hashCode
of an object remains the same while the object is inside the collection. The property can be captured by the following QEA. The states in
and out
indicate whether the quantified object o
is in a hashing structure. The counter count
is used to count the number of occurrences. Note that this relies on the fact that these structures are set-like and o
cannot belong to a collection more than once. Without this restriction, the QEA would need to also have a quantification over collections. The hashCode
on insertion is recorded in h
and checked later; note the use of a next
state to ensure that only valid transitions are taken in the out
state.
add
, remove
, and observe
events relate to the successful corresponding (add, put, remove, get, contains, containsKey) methods in java.util.HashSet or4.2.23 Lock ordering on DaCapo Avrora
4.3 Offline track
4.3.1 Maximum chunksize of Dropbox connections
4.3.2 Evolution of the chunksize of Dropbox connections
4.3.3 Maximum bandwidth for Youtube connections
4.3.4 Changes in the bandwidth of Youtube connections
4.3.5 Closing opened files by processes
strace
utility in Linux. The log file of strace
is preprocessed to produce an XML file in the format described in Sect. 2.1. The entries of the log files contain the process IDs, file descriptors, and the details of the operations performed on the files.4.3.6 Reporting financial transactions of a banking system
4.3.7 Authorizing financial transactions in a banking system
4.3.8 Approval policy of business reports within a company
4.3.9 Withdrawals of users over time
4.3.10 Data usage in Nokia’s Lausanne data-collection campaign
4.3.11 Early alarm of machine operations
log
. That is, between consecutive traces there is an entry such as the following. Single positions within a trace are represented by events with name step
such as, for example, The property is to be evaluated separately on the traces represented in the log file.
4.3.12 Duration of machine operations
4.3.13 Maximal error rates of machines operations
4.3.14 Ordering of machine operations
-
\(P_1\): Groups do not overlap
-
\(P_2\): Ending groups must have started
-
\(P_3\): Phases are included in groups
-
\(P_4\): All processes must have finished before the next phase
-
\(P_5\): Init before run
-
\(P_6\): Run before finish
-
\(P_7\): Init after finish
-
\(P_8\): Init, run, finish not outside group
-
\(P_9\): Group duration is less than 480 s
group_start
, group_end
or phase_start
is not explicitly specified in the log, the respective value is assumed to be false. We now express the constraints above by the formulae4.3.15 Existence of a leader Rover
ping
followed by ack
. The quantifications are non-trivial. As expected, there is an existential quantification followed by a universal quantification and the constraint that r1
and r2
are not equal. The Join
statement captures the fact that the domains of quantification for r1
and r2
should be equal. This is important as domains of quantification are extracted from the trace using the alphabet of the automaton and without this declaration the domains may not be equal.
4.3.16 Granting resources to tasks
r
and uses two free variables t1
and t2
to check mutual exclusion of the task holding the resource. Note that in the granted
state any grant
event leads to failure and a cancel
event with a different task will lead to failure as this is a next
state.
4.3.17 Nested command messages
command
and success
messages. The property states that if command with identifier B starts after command with identifier A has started, then command B must succeed before command A succeeds. It can be assumed that every command is started and succeeds exactly once, i.e., this is a property that has been checked separately.c1
and c2
are quantified over and the states none
, startedOne
and startedTwo
indicate whether command 1 or 2 has started, respectively. Note that the property is symmetric, so there are two instances for each pair of commands reflecting the two orderings of commands.
4.3.18 Resource lifecycle
-
A resource may be requested
-
A requested resource may be denied or granted
-
A granted resource may be rescinded or cancelled
-
A resource may only be requested when not currently requested or granted
-
A granted resource must eventually be cancelled
r
and captures the lifecycle in the automaton structure.
4.3.19 Respecting conflicts of resources
r1
and r2
and detects a conflict declaration between these two resources. After this point, there is a mutual exclusion between the two resources.
5 Evaluation: calculating scores
-
the correctness score \(C_{i,j}\),
-
the overhead score \(O_{i,j}\), and
-
the memory-utilization score \(M_{i,j}\).
-
Since several benchmarks are provided in each track, we wanted to provide participants with the possibility to compete on a benchmark or not. We allocated a maximum number of points that could be gained on a benchmark. In our opinion, it limited the influence of the failure or success on a benchmark and rewarded the overall behavior of tools on the benchmarks in a track.
-
We gave an important emphasis on the correctness of monitoring verdicts. As such, the scoring mechanism gives more priority to correctness of verdicts in that performance is evaluated on a benchmark only when a tool provides the correct verdict and negative points are assigned on a benchmark when a tool produces a false verdict or crashes.
-
Within a benchmark, scores are assigned to participants/tools based on how better they perform compared to each other. Moreover, the proportion of points in benchmark assigned to a tool depends on a performance ratio comparing to the average performance of other tools. The average performance of other tools is computed with the geometric mean (because we dealt with normalized numbers [49]).
5.1 Correctness score
-
\(C_{i,j} = 0\), if the property associated with benchmark \(B_j\) cannot be expressed in the specification language of \(T_i\).
-
\(C_{i,j} = -10\), if, in case of online monitoring, the property can be expressed, but the monitored program crashes.
-
\(C_{i,j} = -5\), if, in case of online monitoring, the property can be expressed and no verdict is reported after \(10 \times E_j\).
-
\(C_{i,j} = -5\), if, in case of offline monitoring, the property can be expressed, but the monitor crashes.
-
\(C_{i,j} = -5\), if the property can be expressed, the tool does not crash, and the verification verdict is incorrect.
-
\( C_{i,j} = 10\), if the tool does not crash, it allows to express the property of interest, and it provides the correct verification verdict.
5.2 Overhead score
-
In the case of offline monitoring, for the overhead, we consider the elapsed time till the property under scrutiny is either found to be satisfied or violated. If monitoring (with tool \(T_i\)) of the trace of benchmark \(B_j\) executes in time \(V_{i}\), then we define the overhead as$$\begin{aligned} o_{i,j} = {\left\{ \begin{array}{ll} {\frac{1}{V_i}} &{}\text{ if } C_{i,j} > 0,\\ 0 &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$
-
In the case of online monitoring (C or Java), the overhead associated with monitoring is a measure of how much longer a program takes to execute due to runtime monitoring. If the monitored program (with monitor from tool \(T_i\)) executes in \(V_{i,j}\) time units, we define the overhead index asIn other words, the overhead index for tool \(T_i\) evaluated on benchmark \(B_j\) is the geometric mean of the overheads of the monitored programs with all tools over the overhead of the monitored program with tool \(T_i\).$$\begin{aligned} o_{i,j} = {\left\{ \begin{array}{ll} {\frac{\root N \of {\prod _{l=1}^N{V_{l,j}}}}{V_{i,j}}} &{}\text{ if } C_{i,j} > 0, \\ 0 &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$
5.3 Memory-utilization score
-
In the case of offline monitoring, we consider the maximum memory allocated during the tool execution. If monitoring (with tool \(T_i\)) of the trace of benchmark \(B_j\) uses a quantity of memory \(D_i\), then we define the overhead as:That is, the memory-utilization index for tool \(T_i\) evaluated on benchmark \(B_j\) is the geometric mean of the memory utilizations of the monitored programs with all tools over the memory utilization of the monitored program with tool \(T_i\).$$\begin{aligned} m_{i,j} = {\left\{ \begin{array}{ll} {\frac{1}{D_i}} &{}\text{ if } C_{i,j} > 0,\\ 0 &{} \text{ otherwise, } \end{array}\right. } \end{aligned}$$
-
In the case of online monitoring (C or Java tracks), memory utilization associated with monitoring is a measure of the extra memory the monitored program needs (due to runtime monitoring). If the monitored program uses \(D_i\), we define the memory utilization as$$\begin{aligned} m_{i,j} = {\left\{ \begin{array}{ll} {\frac{\root N \of {\prod _{l=1}^N{D_{l,j}}}}{D_{i,j}}} &{}\text{ if } C_{i,j} > 0, \\ 0 &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$
5.4 Final score
6 Results
6.1 Scores for the C track
RiTHM
|
E-ACSL
|
RTC
| |||||||
---|---|---|---|---|---|---|---|---|---|
Reference to benchmark description | Verdict | Mem (MB) | Ovhd (s) | Verdict | Mem (MB) | Ovhd (s) | Verdict | Mem (MB) | Ovhd (s) |
v-score | m-score | o-score | v-score | m-score | o-score | v-score | m-score | o-score | |
Section 4.1.1 | F | 1012756 | 0.68 | N/A | N/A | N/A | N/A | N/A | N/A |
10 | 10 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | |
Section 4.1.2 | F | 1012756 | 0.68 | N/A | N/A | N/A | N/A | N/A | N/A |
10 | 10 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | |
Section 4.1.3 | F | 614168 | 0.42 | N/A | N/A | N/A | N/A | N/A | N/A |
10 | 10 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | |
Section 4.1.4 | F | 614168 | 0.42 | N/A | N/A | N/A | N/A | N/A | N/A |
10 | 10 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | |
Section 4.1.5 | F | 647696 | 0.69 | N/A | N/A | N/A | N/A | N/A | N/A |
10 | 10 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | |
Section 4.1.6 | F | 11916 | 0.01 | F | 12980 | 0.30 | F | 37040 | 0.19 |
10 | 4.46 | 9.59 | 10 | 4.10 | 0.16 | 10 | 1.44 | 0.14 | |
Section 4.1.7 | F | 5388 | 0.001 | F | 4320 | 4.87 | F | 5984 | 0.01 |
10 | 3.18 | 9.09 | 10 | 3.96 | 0 | 10 | 2.86 | 0.29 | |
Section 4.1.8 | F | 4236 | 0.01 | F | 4628 | 2.66 | F | 5856 | 0.19 |
10 | 3.79 | 9.52 | 10 | 3.47 | 0.03 | 10 | 2.74 | 0.27 | |
Section 4.1.9 | F | 4212 | N/A | F | 4312 | N/A | F | 5792 | 0.01 |
10 | 3.70 | 0 | 10 | 3.61 | 0 | 10 | 2.69 | 10 | |
Section 4.1.10 | F | 4212 | N/A | N/A | N/A | N/A | F | 1344 | N/A |
10 | 2.42 | 0 | 0 | 0 | 0 | 10 | 7.58 | 0 | |
Section 4.1.11 | F | 4216 | N/A | F | 6804 | N/A | F | N/A | 0.23 |
10 | 6.17 | 0 | 10 | 3.83 | 0 | 10 | 0 | 10 |
Rank name | Team score | Correctness score | Overhead score | Memory score | Total score |
---|---|---|---|---|---|
1 |
RITHM-1
| 110 | 78.19 | 73.72 | 261.92 |
2 |
E-ACSL
| 100 | 50.19 | 68.97 | 219.16 |
3 |
RTC
| 70 | 20.70 | 17.31 | 108.01 |
6.2 Scores for the Java track
Reference to benchmark description |
Larva
| jUnit \(^{RV}\) |
JavaMOP
|
QEA
| ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Verdict | Mem (MB) | Ovhd (s) | Verdict | Mem (MB) | Ovhd (s) | Verdict | Mem (MB) | Ovhd (s) | Verdict | Mem (MB) | Ovhd (s) | |
v-score | m-score | o-score | v-score | m-score | o-score | v-score | m-score | o-score | v-score | m-score | o-score | |
Section 4.2.1 | F | 0.55 | 1.54 | F | 1.92 | 6.08 | F | 1.94 | 0.17 | F | 2.65 | 0.20 |
10 | 1.95 | 2.66 | 10 | 2.70 | 0.14 | 10 | 2.68 | 4.96 | 10 | 1.96 | 4.35 | |
Section 4.2.2 | F | 0.58 | 1.56 | F | 7.75 | 6.86 | F | 2.59 | 0.21 | F | 2.70 | 0.18 |
10 | 2.60 | 3.03 | 10 | 1.02 | 0.13 | 10 | 3.04 | 4.33 | 10 | 2.92 | 4.96 | |
Section 4.2.3 | F | 0.66 | 1.56 | F | 1.92 | 3.74 | F | 9.03 | 0.22 | F | 5.24 | 0.24 |
10 | 4.54 | 2.11 | 10 | 4.99 | 0.28 | 10 | 1.06 | 4.66 | 10 | 1.83 | 4.40 | |
Section 4.2.4 | F | 0.69 | 1.56 | F | 1.92 | 8.63 | F | 9.04 | 0.32 | F | 2.00 | 0.18 |
10 | 9.05 | 0.89 | 10 | 4.20 | 0.12 | 10 | 0.89 | 3.35 | 10 | 4.02 | 5.84 | |
Section 4.2.5 | F | 0.95 | 1.57 | F | 1.92 | 8.57 | F | 9.69 | 0.73 | F | 2.64 | 0.22 |
10 | 10.97 | 0.83 | 10 | 4.76 | 0.17 | 10 | 0.94 | 2.05 | 10 | 3.46 | 6.83 | |
Section 4.2.6 | T | 0.59 | 1.57 | T | 27.94 | 1.56 | T | 1.94 | 0.20 | T | 3.30 | 0.23 |
10 | 5.16 | 1.85 | 10 | 0.34 | 0.59 | 10 | 4.92 | 4.70 | 10 | 2.89 | 4.11 | |
Section 4.2.7 | T | 0.65 | 1.58 | T | 308.67 | 22.93 | T | 1.94 | 0.20 | T | 2.65 | 0.25 |
10 | 7.10 | 1.36 | 10 | 0.03 | 0.04 | 10 | 4.97 | 5.19 | 10 | 3.64 | 4.12 | |
Section 4.2.8 | T | 0.05 | 2173.26 | T | 244.27 | 49.94 | T | 32.24 | 26.27 | T | 5.85 | 25.99 |
10 | 162.39 | 0.29 | 10 | 0.19 | 2.06 | 10 | 1.46 | 3.92 | 10 | 8.06 | 3.97 | |
Section 4.2.9 | T | 0.64 | 1.55 | T | 291.57 | 24.79 | T | 1.94 | 0.20 | T | 4.59 | 0.23 |
10 | 5.16 | 2.08 | 10 | 0.04 | 0.04 | 10 | 5.55 | 4.98 | 10 | 2.34 | 4.34 | |
Section 4.2.10 | T | 1.13 | 1.56 | T | 680.19 | 100.57 | T | 2.58 | 0.24 | T | 110.33 | 1.26 |
10 | 7.76 | 2.45 | 10 | 0.03 | 0.02 | 10 | 7.35 | 7.45 | 10 | 0.17 | 1.40 | |
Section 4.2.11 | F | 0.10 | 48.06 | F | N/A | 2.98 | F | 5.17 | 0.79 | F | 4.53 | 2.04 |
10 | 40.93 | 0.56 | 10 | 0 | 1.59 | 10 | 4.41 | 5.99 | 10 | 5.04 | 2.32 | |
Section 4.2.12 | N/A | N/A | N/A | F | N/A | 0.51 | F | 5.81 | 2.23 | F | 8.41 | 3.24 |
0 | 0 | 0 | 10 | 0 | 7.21 | 10 | 5.91 | 1.65 | 10 | 4.09 | 1.14 | |
Section 4.2.13 | F | 0.10 | 35.78 | F | N/A | 0.36 | F | 7.10 | 25.22 | F | 5.20 | 25.33 |
10 | 3.87 | 4.37 | 10 | 0 | 9.63 | 10 | 2.38 | 0.14 | 10 | 3.25 | 0.14 | |
Section 4.2.14 | F | 0.56 | 1.57 | F | N/A | 1.61 | F | 2.58 | 0.18 | F | 3.23 | 0.22 |
10 | 2.58 | 3.57 | 10 | 0 | 0.55 | 10 | 3.57 | 4.94 | 10 | 2.86 | 3.95 | |
Section 4.2.15 | F | 0.03 | 2606.58 | F | N/A | 7.58 | F | 647.19 | 87.00 | F | 837.29 | 190.89 |
10 | 841.28 | 3.03 | 10 | 0 | 8.85 | 10 | 3.93 | 0.77 | 10 | 3.04 | 0.35 | |
Section 4.2.16 | F | 0.06 | 15393.22 | T | N/A | N/A | F | 1001.69 | 164.00 | F | 829.59 | 217.27 |
10 | 721.39 | 3.86 | −5 | 0 | 0 | 10 | 2.78 | 5.66 | 10 | 3.36 | 4.28 | |
Section 4.2.17 | T/O | 0 | N/A | T | 717.92 | 88.35 | T | 801.15 | 242.00 | T | 844.54 | 288.08 |
-5 | N/A | 0 | 10 | 3.64 | 5.98 | 10 | 3.26 | 2.18 | 10 | 3.10 | 1.83 | |
Section 4.2.18 | T/O | 0 | N/A | T | 697.42 | 113.63 | T | 795.49 | 237.00 | T | 819.92 | 889.63 |
-5 | N/A | 0 | 10 | 3.67 | 6.22 | 10 | 3.21 | 2.98 | 10 | 3.12 | 0.79 | |
Section 4.2.19 | T/O | 0 | N/A | T | N/A | N/A | F | 649.83 | 94.00 | F | 820.52 | 170.31 |
-5 | N/A | 0 | -5 | 0 | 0 | 10 | 5.58 | 6.44 | 10 | 4.42 | 3.56 | |
Section 4.2.20 | F | 0.16 | 27.22 | F | N/A | 3.68 | F | 58.27 | 3.16 | F | 23.07 | 0.62 |
10 | 135.98 | 1.08 | 10 | 0 | 1.22 | 10 | 2.53 | 1.42 | 10 | 6.39 | 7.20 | |
Section 4.2.21 | T | 0.29 | 70.12 | T | 86.04 | 8.8 | T | 267.45 | 5.64 | T | 142.74 | 5.29 |
10 | 160.62 | 2.18 | 10 | 4.06 | 2.30 | 10 | 1.31 | 3.59 | 10 | 2.45 | 3.82 | |
Section 4.2.22 | T | 0 | 6882.58 | T | 300.2 | 49.9 | T | 309.00 | 6.08 | T | 156.66 | 5.59 |
10 | 267.43 | 2.24 | 10 | 2.00 | 0.55 | 10 | 1.94 | 4.53 | 10 | 3.82 | 4.92 | |
Section 4.2.23 | N/A | N/A | N/A | F | N/A | 2.92 | F | 39.87 | 1.58 | F | 28.71 | 0.72 |
0 | 0 | 0 | 10 | 0 | 1.45 | 10 | 4.19 | 2.67 | 10 | 5.81 | 5.88 |
Rank | Team name | Correctness score | Overhead score | Memory score | Total score |
---|---|---|---|---|---|
1 |
QEA
| 230 | 84.50 | 82.01 | 396.51 |
1 |
JavaMOP
| 230 | 88.56 | 77.89 | 396.45 |
2 |
jUnit
\(^{RV}\)
| 200 | 49.15 | 31.67 | 280.82 |
3 |
Larva
| 165 | 7.79 | 38.43 | 211.22 |
6.3 Scores for the offline track
Reference to benchmark description |
RiTHM
| MonPoly |
StePr
|
QEA
| ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Verdict | Mem (MB) | Ovhd (s) | Verdict | Mem (MB) | Ovhd (s) | Verdict | Mem (MB) | Ovhd (s) | Verdict | Mem (MB) | Ovhd (s) | |
v-score | m-score | o-score | v-score | m-score | o-score | v-score | m-score | o-score | v-score | m-score | o-score | |
Section 4.3.1 | F | 993 | 0.60 | F | 13 | 0.13 | F | 29.53 | 0.90 | F | 4.535 | 0.24 |
10 | 0.03 | 1.12 | 10 | 2.31 | 5.32 | 10 | 1.02 | 0.75 | 10 | 6.64 | 2.81 | |
Section 4.3.2 | F | 993 | 0.60 | F | 1228 | 8.40 | F | 645.17 | 8.87 | F | 33.30 | 3.58 |
10 | 0.30 | 7.67 | 10 | 0.24 | 0.55 | 10 | 0.46 | 0.52 | 10 | 8.99 | 1.28 | |
Section 4.3.3 | F | 614 | 0.98 | F | 13 | 0.12 | F | 31.26 | 0.91 | F | 4.53 | 0.19 |
10 | 0.05 | 0.63 | 10 | 2.32 | 5.41 | 10 | 0.97 | 0.68 | 10 | 6.66 | 3.28 | |
Section 4.3.4 | F | 614 | 0.98 | F | 1696 | 15.80 | F | 622.30 | 21.10 | F | 517.07 | 12.22 |
10 | 2.83 | 8.41 | 10 | 1.02 | 0.52 | 10 | 2.79 | 0.39 | 10 | 3.36 | 0.68 | |
Section 4.3.5 | F | 628 | 0.99 | F | 544 | 5.09 | F | 274.29 | 7.77 | F | 32.94 | 3.71 |
10 | 0.43 | 6.29 | 10 | 0.49 | 1.24 | 10 | 0.97 | 0.80 | 10 | 8.11 | 1.68 | |
Section 4.3.6 | N/A | N/A | N/A | F | 36.00 | 5.95 | F | 645.43 | 41.73 | F | 5.19 | 0.26 |
0 | 0 | 0 | 10 | 1.25 | 0.41 | 10 | 0.07 | 0.06 | 10 | 8.68 | 9.53 | |
Section 4.3.7 | N/A | N/A | N/A | F | 20 | 1.33 | F | 255.48 | 96.00 | F | 4.53 | 0.25 |
0 | 0 | 0 | 10 | 1.82 | 1.56 | 10 | 0.14 | 0.02 | 10 | 8.04 | 8.42 | |
Section 4.3.8 | N/A | N/A | N/A | F | 370 | 33.51 | F | 706.26 | 21.41 | F | 7.75 | 0.29 |
0 | 0 | 0 | 10 | 0.20 | 0.08 | 10 | 0.11 | 0.13 | 10 | 9.69 | 9.79 | |
Section 4.3.9 | N/A | N/A | N/A | F | 73.00 | 1.53 | F | 457.34 | 3.67 | F | 552.08 | 2.58 |
0 | 0 | 0 | 10 | 7.74 | 4.98 | 10 | 1.24 | 2.07 | 10 | 1.02 | 2.95 | |
Section 4.3.10 | N/A | N/A | N/A | F | 16.00 | 330.80 | F | 721.22 | 947.00 | F | 5935.90 | 1537.30 |
0 | 0 | 0 | 10 | 9.76 | 6.39 | 10 | 0.22 | 2.23 | 10 | 0.03 | 1.38 | |
Section 4.3.11 | T | 14.27 | 5.40 | T | 13.00 | 5.05 | T | 634.99 | 10.84 | T | 127.62 | 4.51 |
10 | 4.48 | 2.65 | 10 | 4.92 | 2.84 | 10 | 0.10 | 1.32 | 10 | 0.50 | 3.18 | |
Section 4.3.12 | F | 14.27 | 0.90 | F | 13.00 | 0.80 | F | 333.01 | 2.93 | F | 30.33 | 0.80 |
10 | 3.83 | 2.80 | 10 | 4.20 | 3.17 | 10 | 0.16 | 0.86 | 10 | 1.80 | 3.17 | |
Section 4.3.13 | F | 14.27 | 7.20 | F | 13.00 | 1.04 | F | 501.91 | 3.20 | F | 30.86 | 1.05 |
10 | 3.86 | 0.59 | 10 | 4.24 | 4.08 | 10 | 0.11 | 1.32 | 10 | 1.79 | 4.01 | |
Section 4.3.14 | F | 14.28 | 2.39 | F | 13.00 | 2.0 | F | 173.37 | 2.91 | F | 30.33 | 0.63 |
10 | 3.77 | 1.46 | 10 | 4.14 | 1.75 | 10 | 0.31 | 1.20 | 10 | 1.78 | 5.59 | |
Section 4.3.15 | T | 15 | 0.04 | T | 17.00 | 353.00 | T | 112.56 | 2.20 | T | 29.73 | 0.59 |
10 | 3.97 | 9.15 | 10 | 3.50 | 0 | 10 | 0.53 | 0.18 | 10 | 2.00 | 0.67 | |
Section 4.3.16 | F | 75.00 | 40.93 | F | 13.00 | 432.00 | F | 631.58 | 8.46 | F | 250.30 | 2.03 |
10 | 1.39 | 0.38 | 10 | 8.03 | 0.036 | 10 | 0.17 | 1.85 | 10 | 0.42 | 7.73 | |
Section 4.3.17 | F | 14.04 | 0.16 | F | 24.00 | 3.06 | F | 36.01 | 1.19 | F | 295.52 | 1.36 |
10 | 4.94 | 7.67 | 10 | 2.89 | 0.40 | 10 | 1.93 | 1.03 | 10 | 0.23 | 0.90 | |
Section 4.3.18 | F | 39.79 | 5.18 | F | 2675.00 | 3405.00 | F | 622.66 | 9.96 | F | 234.78 | 2.04 |
10 | 8.01 | 2.47 | 10 | 0.12 | 0.375 | 10 | 0.51 | 1.28 | 10 | 1.36 | 6.25 | |
Section 4.3.19 | T | 14.00 | 5.14 | T | 13.00 | 26.21 | T | 494.39 | 9.17 | T | 239.60 | 3.54 |
10 | 4.62 | 3.12 | 10 | 4.98 | 0.61 | 10 | 0.13 | 1.75 | 10 | 0.27 | 4.53 |
Rank | Team name | Correctness score | Overhead score | Memory score | Total score |
---|---|---|---|---|---|
1 |
QEA
| 190 | 77.79 | 71.36 | 339.15 |
2 |
MonPoly
| 190 | 39.35 | 64.19 | 293.54 |
3 |
RiTHM-2
| 140 | 54.40 | 42.52 | 236.91 |
4 |
StePr
| 190 | 18.46 | 11.93 | 220.40 |
7 Lessons learned and discussion
-
The competition featured 8 distinct teams participating in the 3 tracks resulting in 11 participating teams in the tracks.
-
The organisers have designed a sensible evaluation method. This method has been peer-reviewed and validated by the participating teams before the beginning of the competition. The method has been built upon the research efforts made in the runtime verification community when evaluating runtime verification prototypes.
-
Choices needed to be made regarding the classification criteria of tracks. Moreover, different tracks could have been possible: domain of the monitored system, programming language of the monitor, categories of specifications, a track on elegance of the specification. The organisers have arranged the tracks of the competition according to the monitored system: either its programming language in case of monitoring software or traces. This reflects the fact inline monitoring has been so far the most popular RV setting when monitoring software.
-
Significant delays were observed regarding benchmark submission. These delays were due to the substantial efforts required to convey the exact semantics of the specifications submitted. Indeed, as can be expected, some of the specifications could be interpreted differently by different participants. Moreover, as the participating teams mainly provided specifications in the input language of their tools, participants had also to formalize them in the specification language of their own tool.
-
Delays were also observed during the phases where the organizers had to prepare the next phases. For instance, after the benchmark submission phase, a sanity check had to be performed regarding the submissions of some participants. Several iterations were needed to unify the submissions in spite of the provided provided, which was consequently not constraining enough or ambiguous on some aspects. We note that building on this observation, the next edition of the competition learned from this and for instance defined standard formats for traces [47].
jstat
tool (standing for Java Virtual Machine Statistics Monitoring Tool) was proposed. This method sampled memory usage every 10ms and dumped the output into a file, which was then parsed after the benchmark had run to compute memory utilization. The script for running a program and recording its memory utilization is given below:
-
every tool is defined in its own logic that is mathematically defined but have different semantics
-
There is no unified specification that the organizers could use to provide specifications
-
Even if such a specification language existed some tools would handle only a fragment of such logic, one would have to provide specs in each fragment in such a way that the competition is fair.