1 Introduction
ProM
[39] (HybridILPMiner package) and available in RapidProM
[5, 26]. We have compared our technique with two state-of-the-art filtering techniques [9, 15]. We additionally validated the applicability of our approach on two real life event logs [11, 19]. Our experiments confirm the effectiveness of the proposed approach, both in terms of resulting model quality and computational complexity.2 Related work
2.1 Process discovery
2.2 Filtering infrequent behaviour
3 Background
3.1 Bags, sequences and vectors
3.2 Event logs and workflow nets
Case-id | Activity | Resource | Time-stamp |
---|---|---|---|
... | ... | ... | ... |
1 | Register request (a) | John | 2015-05-08:08.45 |
1 | Examine thoroughly (b) | Lucy | 2015-05-08:09.13 |
2 | Register request (a) | John | 2015-05-08:09.14 |
2 | Check ticket (d) | Pete | 2015-05-08:10.11 |
1 | Check ticket (d) | Pete | 2015-05-08:10.28 |
2 | Examine causally (b) | Rob | 2015-05-08:10.43 |
1 | Decide (e) | Rob | 2015-05-08:11.14 |
1 | Reject request (h) | Rob | 2015-05-08:11.35 |
... | ... | ... | ... |
3.3 Discovering petri net places using integer linear programming
4 Discovering relaxed sound workflow nets
4.1 Discovering multiple places based on causal relations
4.2 Discovering workflow nets
5 Dealing with infrequent behaviour
5.1 The impact of infrequent exceptional behaviour
\(m + \mathbf {x}(a_s) + \mathbf {x}(a) + \mathbf {x}(b) - \mathbf {y}(a_s) - \mathbf {y}(a) - \mathbf {y}(b) - \mathbf {y}(c) \ge 0\)
| |
\(m + \mathbf {x}(a_s) + \mathbf {x}(a) + \mathbf {x}(b) + \mathbf {x}(c) - \mathbf {y}(a_s) - \mathbf {y}(a) - \mathbf {y}(b) - \mathbf {y}(c) - \mathbf {y}(d) \ge 0\)
| |
\(\vdots \)
| |
\(m + \mathbf {x}(a_s) + \mathbf {x}(a) + \mathbf {x}(b) + \mathbf {x}(c) + \mathbf {x}(d) + \mathbf {x}(e) + \mathbf {x}(g) - \mathbf {y}(a_s) - \mathbf {y}(a) - \mathbf {y}(b) - \mathbf {y}(c) - \mathbf {y}(d) - \mathbf {y}(e) - \mathbf {y}(g) - \mathbf {y}(a_f) \ge 0\)
| |
\(m + \mathbf {x}(a_s) + \mathbf {x}(a) + \mathbf {x}(b) + \mathbf {x}(c) + \mathbf {x}(d) + \mathbf {x}(e) + \mathbf {x}(g) + \mathbf {x}(a_f) - \mathbf {y}(a_s) - \mathbf {y}(a) - \mathbf {y}(b) - \mathbf {y}(c) - \mathbf {y}(d) - \mathbf {y}(e) - \mathbf {y}(g) - \mathbf {y}(a_f) = 0\)
|
5.2 Sequence encoding graphs
\(\sigma \in \overline{\pi (L'_1)}\)
|
\(\mathbf {\phi }(\sigma )^{\intercal }\), i.e. \((m, \mathbf {x}(a_s), \mathbf {x}(a), \ldots , \mathbf {y}(h), \mathbf {y}(a_f))\)
|
\(\mathbf {\phi }(\sigma )\) (shorthand) |
\(\overline{\pi (L'_1)}(\sigma ) \)
|
---|---|---|---|
\(\epsilon \)
| (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) |
\(([],\bot )\)
| 56 |
\(\langle a_s \rangle \)
|
\((1,0,0,0,0,0,0,0,0,0,0,-\,1,0,0,0,0,0,0,0,0,0)\)
|
\(([],a_s)\)
| 56 |
\(\langle a_s, a \rangle \)
|
\((1,1,0,0,0,0,0,0,0,0,0,-\,1,-\,1,0,0,0,0,0,0,0,0)\)
|
\(([a_s],a)\)
| 56 |
\(\langle a_s,a,b \rangle \)
|
\((1,1,1,0,0,0,0,0,0,0,0,-\,1,-\,1,-\,1,0,0,0,0,0,0,0)\)
|
\(([a_s,a],b)\)
| 22 |
\(\langle a_s,a,c \rangle \)
|
\((1,1,1,0,0,0,0,0,0,0,0,-\,1,-\,1,0,-\,1,0,0,0,0,0,0)\)
|
\(([a_s,a],c)\)
| 12 |
\(\langle a_s,a,d \rangle \)
|
\((1,1,1,0,0,0,0,0,0,0,0,-\,1,-\,1,0,0,-\,1,0,0,0,0,0)\)
|
\(([a_s,a],d)\)
| 22 |
\(\langle a_s,a,b,c \rangle \)
|
\((1,1,1,1,0,0,0,0,0,0,0,-\,1,-\,1,-\,1,-\,1,0,0,0,0,0,0)\)
|
\(([a_s,a,b],c)\)
| 1 |
\(\langle a_s,a,b,d \rangle \)
|
\((1,1,1,1,0,0,0,0,0,0,0,-\,1,-\,1,-\,1,0,-\,1,0,0,0,0,0)\)
|
\(([a_s,a,b],d)\)
| 21 |
\(\langle a_s,a,c,d \rangle \)
|
\((1,1,1,0,1,0,0,0,0,0,0,-\,1,-\,1,0,-\,1,-\,1,0,0,0,0,0)\)
|
\(([a_s,a,c],b)\)
| 12 |
\(\langle a_s,a,d,c \rangle \)
|
\((1,1,1,0,0,1,0,0,0,0,0,-\,1,-\,1,0,-\,1,-\,1,0,0,0,0,0)\)
|
\(([a_s,a,d],c)\)
| 22 |
\(\langle a_s,a,b,c,d \rangle \)
|
\((1,1,1,1,1,0,0,0,0,0,0,-\,1,-\,1,-\,1,-\,1,-\,1,0,0,0,0,0)\)
|
\(([a_s,a,b,c],d)\)
| 1 |
\(\langle a_s,a,b,d,e \rangle \)
|
\((1,1,1,1,0,1,0,0,0,0,0,-\,1,-\,1,-\,1,0,-\,1,-\,1,0,0,0,0)\)
|
\(([a_s,a,b,d],e)\)
| 21 |
\(\langle a_s,a,c,d,e \rangle \)
|
\((1,1,1,0,1,1,0,0,0,0,0,-\,1,-\,1,0,-\,1,-\,1,-\,1,0,0,0,0)\)
|
\(([a_s,a,c,d],e)\)
| 12 |
\(\langle a_s,a,d,c,e \rangle \)
|
\((1,1,1,0,1,1,0,0,0,0,0,-\,1,-\,1,0,-\,1,-\,1,-\,1,0,0,0,0)\)
|
\(([a_s,a,c,d],e)\)
| 22 |
\(\langle a_s,a,b,c,d,e \rangle \)
|
\((1,1,1,1,1,1,0,0,0,0,0,-\,1,-\,1,-\,1,-\,1,-\,1,-\,1,0,0,0,0)\)
|
\(([a_s,a,b,c,d],e)\)
| 1 |
\(\langle a_s,a,b,d,e,f \rangle \)
|
\((1,1,1,1,0,1,1,0,0,0,0,-\,1,-\,1,-\,1,0,-\,1,-\,1,-\,1,0,0,0)\)
|
\(([a_s,a,b,d,e],f)\)
| 11 |
\(\langle a_s,a,b,d,e,g \rangle \)
|
\((1,1,1,1,0,1,1,0,0,0,0,-\,1,-\,1,-\,1,0,-\,1,-\,1,0,-\,1,0,0)\)
|
\(([a_s,a,b,d,e],g)\)
| 10 |
\(\langle a_s,a,c,d,e,f \rangle \)
|
\((1,1,1,0,1,1,1,0,0,0,0,-\,1,-\,1,0,-\,1,-\,1,-\,1,-\,1,0,0,0)\)
|
\(([a_s,a,c,d,e],f)\)
| 12 |
\(\langle a_s,a,d,c,e,f \rangle \)
|
\((1,1,1,0,1,1,1,0,0,0,0,-\,1,-\,1,0,-\,1,-\,1,-\,1,-\,1,0,0,0)\)
|
\(([a_s,a,c,d,e],f)\)
| 13 |
\(\langle a_s,a,d,c,e,h \rangle \)
|
\((1,1,1,0,1,1,1,0,0,0,0,-\,1,-\,1,0,-\,1,-\,1,-\,1,0,0,-\,1,0)\)
|
\(([a_s,a,c,d,e],h)\)
| 9 |
\(\langle a_s,a,b,c,d,e,g \rangle \)
|
\((1,1,1,1,1,1,1,0,0,0,0,-\,1,-\,1,-\,1,-\,1,-\,1,-\,1,0,-\,1,0,0)\)
|
\(([a_s,a,b,c,d,e],g)\)
| 1 |
\(\langle a_s,a,b,d,e,f,c \rangle \)
|
\((1,1,1,1,0,1,1,1,0,0,0,-\,1,-\,1,-\,1,-\,1,-\,1,-\,1,-\,1,0,0,0)\)
|
\(([a_s,a,b,d,e,f],c)\)
| 11 |
\(\langle a_s,a,b,d,e,g,a_f \rangle \)
|
\((1,1,1,1,0,1,1,0,1,0,0,-\,1,-\,1,-\,1,0,-\,1,-\,1,0,-\,1,0,-\,1)\)
|
\(([a_s,a,b,d,e,g],a_f)\)
| 10 |
\(\langle a_s,a,c,d,e,f,d \rangle \)
|
\((1,1,1,0,1,1,1,1,0,0,0,-\,1,-\,1,0,-\,1,-2,-\,1,-\,1,0,0,0)\)
|
\(([a_s,a,c,d,e,f],d)\)
| 12 |
\(\langle a_s,a,d,c,e,f,b \rangle \)
|
\((1,1,1,0,1,1,1,1,0,0,0,-\,1,-\,1,-\,1,-\,1,-\,1,-\,1,-\,1,0,0,0)\)
|
\(([a_s,a,c,d,e,f],b)\)
| 13 |
\(\langle a_s,a,d,c,e,h,a_f \rangle \)
|
\((1,1,1,0,1,1,1,0,0,1,0,-\,1,-\,1,0,-\,1,-\,1,-\,1,0,0,-\,1,-\,1)\)
|
\(([a_s,a,c,d,e,h],a_f)\)
| 9 |
\(\langle a_s,a,b,c,d,e,g,a_f \rangle \)
|
\((1,1,1,1,1,1,1,0,1,0,0,-\,1,-\,1,-\,1,-\,1,-\,1,-\,1,0,-\,1,0,-\,1)\)
|
\(([a_s,a,b,c,d,e,g],a_f)\)
| 1 |
\(\langle a_s,a,b,d,e,f,c,d \rangle \)
|
\((1,1,1,1,1,1,1,1,0,0,0,-\,1,-\,1,-\,1,-\,1,-2,-\,1,-\,1,0,0,0)\)
|
\(([a_s,a,b,c,d,e,f],d)\)
| 11 |
\(\langle a_s,a,c,d,e,f,d,b \rangle \)
|
\((1,1,1,0,1,2,1,1,0,0,0,-\,1,-\,1,-\,1,-\,1,-2,-\,1,-\,1,0,0,0)\)
|
\(([a_s,a,c,d^2,e,f],b)\)
| 12 |
\(\langle a_s,a,d,c,e,f,b,d \rangle \)
|
\((1,1,1,1,1,1,1,1,0,0,0,-\,1,-\,1,-\,1,-\,1,-2,-\,1,-\,1,0,0,0)\)
|
\(([a_s,a,b,c,d,e,f],d)\)
| 13 |
\(\langle a_s,a,b,d,e,f,c,d,e \rangle \)
|
\((1,1,1,1,1,2,1,1,0,0,0,-\,1,-\,1,-\,1,-\,1,-2,-2,-\,1,0,0,0)\)
|
\(([a_s,a,b,c,d^2,e,f],e)\)
| 11 |
\(\langle a_s,a,c,d,e,f,d,b,e \rangle \)
|
\((1,1,1,1,1,2,1,1,0,0,0,-\,1,-\,1,-\,1,-\,1,-2,-2,-\,1,0,0,0)\)
|
\(([a_s,a,b,c,d^2,e,f],e)\)
| 12 |
\(\langle a_s,a,d,c,e,f,b,d,e \rangle \)
|
\((1,1,1,1,1,2,1,1,0,0,0,-\,1,-\,1,-\,1,-\,1,-2,-2,-\,1,0,0,0)\)
|
\(([a_s,a,b,c,d^2,e,f],e)\)
| 13 |
\(\langle a_s,a,b,d,e,f,c,d,e,g \rangle \)
|
\((1,1,1,1,1,2,2,1,0,0,0,-\,1,-\,1,-\,1,-\,1,-2,-2,-\,1,-\,1,0,0)\)
|
\(([a_s,a,b,c,d^2,e^2,f],g)\)
| 11 |
\(\langle a_s,a,c,d,e,f,d,b,e,g \rangle \)
|
\((1,1,1,1,1,2,2,1,0,0,0,-\,1,-\,1,-\,1,-\,1,-2,-2,-\,1,-\,1,0,0)\)
|
\(([a_s,a,b,c,d^2,e^2,f],g)\)
| 12 |
\(\langle a_s,a,d,c,e,f,b,d,e,h \rangle \)
|
\((1,1,1,1,1,2,2,1,0,0,0,-\,1,-\,1,-\,1,-\,1,-2,-2,-\,1,0,-\,1,0)\)
|
\(([a_s,a,b,c,d^2,e^2,f],h)\)
| 13 |
\(\langle a_s,a,b,d,e,f,c,d,e,g,a_f \rangle \)
|
\((1,1,1,1,1,2,2,1,1,0,0,-\,1,-\,1,-\,1,-\,1,-2,-2,-\,1,-\,1,0,-\,1)\)
|
\(([a_s,a,b,c,d^2,e^2,f,g],a_f)\)
| 11 |
\(\langle a_s,a,c,d,e,f,d,b,e,g,a_f \rangle \)
|
\((1,1,1,1,1,2,2,1,1,0,0,-\,1,-\,1,-\,1,-\,1,-2,-2,-\,1,-\,1,0,-\,1)\)
|
\(([a_s,a,b,c,d^2,e^2,f,g],a_f)\)
| 12 |
\(\langle a_s,a,d,c,e,f,b,d,e,h,a_f \rangle \)
|
\((1,1,1,1,1,2,2,1,0,1,0,-\,1,-\,1,-\,1,-\,1,-2,-2,-\,1,0,-\,1,-\,1)\)
|
\(([a_s,a,b,c,d^2,e^2,f,h],a_f)\)
| 13 |
5.3 Filtering
Sequence Encoding Filtering—Breadth First Search
(SEF-BFS
), that traverses the sequence encoding graph and concurrently constructs a set of ILP constraints. The algorithm needs a function as an input that is able to determine, given a vertex in the sequence encoding graph, what portion of adjacent vertices remains in the graph and which are removed.SEF-BFS
.
SEF-BFS
-algorithm. It is trivial to prove, by means of induction on the length of a sequence encoding’s corresponding sequence, that a sequence encoding graph is acyclic. Hence, termination is guaranteed.SEF-BFS
algorithm, reconsider Fig. 3. Assume we use \(\kappa ^{0.75}_{\max }\). Vertex \(([],\bot )\) is initially present in Q and will be analysed. Since \(([],a_s)\) is the only child of \(([],\bot )\), it is added to Q. Vertex \(([],\bot )\) is removed from the queue and is never inserted in the queue again due to the acyclic property of the graph. Similarly, since \(([a_s],a)\) is the only child of \(([], a_s)\) it is added to Q. All children of \(([a_s],a)\), i.e. \(([a_s,a],b), ([a_s,a],c)\) and \(([a_s,a],d)\), are added to the queue since the maximum corresponding arc value is 22, and, \((1 - 0.75) * 22 = 5.5\), which is smaller than the lowest arc value 12. When analysing \(([a_s,a],b)\) we observe a maximum outgoing arc with value 21 to vertex \(([a_s,a,b],d)\) which is enqueued in Q. Since \((1-0.25)*21=5.25\), the algorithm does not enqueue \(([a_s,a,b],c)\). Note that the whole path of vertices from \(([a_s,a,b],c)\) to \(([a_s,a,b,c,d,e,g], a_f)\) is never analysed and is stripped from the constraint body, i.e. they are never inserted in C.6 Evaluation
6.1 Model quality
6.2 Computation time
RapidMiner
we repeated similar experiments to the experiments performed for model quality, and measured cpu-execution time for the three techniques. However, we only use threshold values 0, 0.25, 0.75 and 1.
6.3 Application to real-life event logs
Road Fines
event log (figures on the left-hand side of Fig. 7) we observe that replay-fitness is around 0.46 whereas precision is around 0.4 for \(\alpha \)-values from 0 to 0.5. The number of arcs for the models of these \(\alpha \)-values remains constant (as well as the number of places and the number of transitions) suggesting that the models found are the same. After this the replay-fitness increases further to around 0.8 and reaches 1 for an \(\alpha \)-level of 1. Interestingly, precision shows a little increase around \(\alpha \)-levels between 0.5 and 0.75 after which it drops slightly below its initial value. In this case, an \(\alpha \)-level in-between 0.5 and 0.75 seems most appropriate in terms of replay-fitness, precision and simplicity.Sepsis
event log (figures on the left-hand side of Fig. 7) we observe that replay-fitness and precision are roughly behaving as each-other’s inverse, i.e. replay-fitness increases whereas precision decreases for increasing \(\alpha \)-levels. We moreover observe that the number of arcs within the process models is steadily increasing for increasing \(\alpha \)-levels. In this case, an \(\alpha \)-level in-between 0.1 and 0.4 seems most appropriate in terms of replay-fitness, precision and simplicity.Road Fines
event log, solving all ILP problems takes roughly 5 s. In case of the Sepsis
event log, obtaining a model ILP problems takes less than 1 s.