1 Introduction
Case | Activity | Start_time | End_time | Case | Activity | Start_time | End_time |
---|---|---|---|---|---|---|---|
1 | A | 00:20 | 00:22 | 4 | A | 03:06 | 03:10 |
1 | B | 02:04 | 02:08 | 4 | B | 05:04 | 05:09 |
1 | E | 02:32 | 02:32 | 4 | E | 07:26 | 07:29 |
2 | A | 02:15 | 02:20 | 5 | A | 03:40 | 03:44 |
2 | D | 03:14 | 03:19 | 5 | B | 05:59 | 06:06 |
2 | E | 05:06 | 05:07 | 5 | E | 07:49 | 07:52 |
3 | A | 02:27 | 02:29 | 6 | A | 04:18 | 04:20 |
3 | D | 04:17 | 04:20 | 6 | C | 07:08 | 07:12 |
3 | E | 06:51 | 06:53 | 6 | E | 09:05 | 09:07 |
2 Background
2.1 Process mining
2.2 Relational algebra
- Selection\(\sigma _\phi R = \{e | e \in R, \phi (e) \}\), where \(\phi (e)\) is derived from \(\phi \) by replacing each attribute name a by its value in tuple e: \(e_a\). The schema of \(\sigma _\phi R\) is r.
- Projection\(\pi _{a,b,\ldots } R = \{\{a\mapsto e_a,b\mapsto e_b,\ldots \} | e \in R \}\). The schema of \(\pi _{a,b,\ldots } R\) is r restricted to the attributes with names \(a, b, \ldots \)
- Renaming\(\rho _{a/b} R = R\). The schema of \(\rho _{a/b} R\) is derived from r by replacing the name of attribute a by b, while the set of tuples in R does not change. In the remainder of this paper, we will also use \(\rho _x R\) to represent prefixing all attribute names of R with x.
\(\sigma _{case=1} Log\) | \(\pi _{case} Log\) | \(\rho _{case/id} Log\) | ||||||
---|---|---|---|---|---|---|---|---|
Case | Activity | Start_time | End_time | Case | Id | Activity | Start_time | End_time |
1 | A | 00:20 | 00:22 | 1 | 1 | A | 00:20 | 00:22 |
1 | B | 02:04 | 02:08 | 2 | 1 | B | 02:04 | 02:08 |
1 | E | 02:32 | 02:32 | 3 | 1 | E | 02:32 | 02:32 |
\(\ldots \) | \(\ldots \) | \(\ldots \) | \(\ldots \) | \(\ldots \) |
2.3 Query optimization
3 Relational algebra for process mining
3.1 Directly follows operator
\(\downarrow \) Case | \(\downarrow \) Activity | \(\downarrow \) Start_time | \(\downarrow \) End_time | \(\uparrow \) Case | \(\uparrow \) Activity | \(\uparrow \) Start_time | \(\uparrow \) End_time |
---|---|---|---|---|---|---|---|
1 | A | 00:20 | 00:22 | 1 | B | 02:04 | 02:08 |
1 | B | 02:04 | 02:08 | 1 | E | 02:32 | 02:32 |
2 | A | 02:15 | 02:20 | 2 | D | 03:14 | 03:19 |
2 | D | 03:14 | 03:19 | 2 | E | 05:06 | 05:07 |
3 | A | 02:27 | 02:29 | 3 | D | 04:17 | 04:20 |
3 | D | 04:17 | 04:20 | 3 | E | 06:51 | 06:53 |
4 | A | 03:06 | 03:10 | 4 | B | 05:04 | 05:09 |
4 | B | 05:04 | 05:09 | 4 | E | 07:26 | 07:29 |
5 | A | 03:40 | 03:44 | 5 | B | 05:59 | 06:06 |
5 | B | 05:59 | 06:06 | 5 | E | 07:49 | 07:52 |
6 | A | 04:18 | 04:20 | 6 | C | 07:08 | 07:12 |
6 | C | 07:08 | 07:12 | 6 | E | 09:05 | 09:07 |
- The activities in which the amount of a loan is changed:\(\pi _{\uparrow activity} \sigma _{\uparrow amount \ne \downarrow amount} >_{case, end\_time} Log\)This query first extracts the directly follows relation from a log. The resulting table contains a directly follows relation with, among others, columns for the preceeding (\(\downarrow activity\)) and the succeeding (\(\uparrow activity\)) activity, as well as the loan amount that is associated with the preceeding (\(\downarrow amount\)) and the succeeding (\(\uparrow amount\)) activity. Selecting the rows in which the succeeding amount differs from the preceeding amount identifies the activities that change the amount. Subsequently, we can project the activities for which this is the case.
- The resources that ever changed the amount of a loan:\(\pi _{\uparrow resource} \sigma _{\uparrow amount \ne \downarrow amount} >_{case, end\_time} Log\)This is a alternative to the previous query, in which we project the resource that changes the amount rather than the activity in which the amount is changed.
- The two activities that precede a rejection:\(\sigma _{\uparrow \uparrow activity = reject} (>_{\uparrow case, \uparrow end\_time} >_{case, end\_time} Log)\)This query first extracts the directly follows relation as usual, it then applies the directly follows relation again to the results, such that we do not only get the activities that succeed (\(\uparrow activity\)) another activity, but also the activities that succeed (\(\uparrow \uparrow activity\)) that activity.
3.2 Directly follows query optimization
4 Theoretical execution cost
4.1 Cost of computing the directly follows relation
Execution strategy | Order of costs (disk blocks) |
---|---|
Classical process mining | |
With intermediate storage | \(3 \cdot B\) |
With database connection | B |
With database operator | |
Native operator | B |
Composite operator | B up to \(B^3\) |
4.2 The effect of query optimization
Execution sequence | In memory (blocks) | On disk (blocks) |
---|---|---|
\(\sigma > Log\) | B | \(B\cdot \left( \frac{N}{V}\right) ^2\) |
\(> \sigma Log\) | \(B\cdot S\) | \(Q\cdot B \cdot \left( \frac{N}{V}\right) ^2\) |
5 Practical execution time performance
F
OLLOWS(R) returns the directly follows relation of the table.Log | Cases | Events | Event types | Performance (ms) classical mining intermediate storage\(^{\text{ a }}\) | Database connection | Database operator native operator | Composite operator |
---|---|---|---|---|---|---|---|
[4] | 1143 | 150,291 | 360 | 64 | 67 | 206 | \(919\times 10^3\) |
[15] | 13,087 | 262,200 | 36 | 72 | 64 | 95 | \(41\times 10^3\) |
[16] | 42,134 | 65,533 | 13 | 20 | 23 | 35 | \(2\times 10^3\) |
[17] | 46,616 | 466,737 | 39 | 113 | 71 | 139 | \(29\times 10^3\) |
[18] | 34,015 | 59,083 | 254 | 15 | 16 | 30 | \(13\times 10^3\) |
[19] | 31,509 | 561,671 | 26 | 125 | 171 | 169 | \(25\times 10^3\) |
- Classical mining with intermediate storage is most suitable for once-off process mining, in which the data that must be extracted from the database has been carefully selected beforehand, because the extraction of the data from the database is labor intensive and often requires administrator privileges.
- Classical mining with a database connection is most suitable for continuous process mining, in which a dedicated database table exists that contains the event log and that is continuously being updated. It cannot be used for ad-hoc process mining from a database, because it only works when a dedicated database table with an event log exists. Examples of such an approach are the DB-XES approach [21] and the approach that is proposed by the professional process mining tool Celonis [22].
- Process mining with a native directly follows operator is most suitable for exploratory process mining, in which no dedicated table exists that contains the event log, or in which data from different tables must be combined before process mining takes place. It allows for fast ad-hoc process mining even when a database does not contain a dedicated table with an event log. An example of such an approach is the Event Cube approach [23].
Log | Nodes | Edges | Performance (ms) classical mining intermediate storage | Database connection | Database operator native operator | Composite operator |
---|---|---|---|---|---|---|
[4] | 648 | \(149\times 10^3\) | 237 | 238 | 280 | 291 |
[15] | 39 | \(249\times 10^3\) | 12 | 12 | 13 | 26 |
[16] | 33 | \(58\times 10^3\) | 12 | 10 | 11 | 1 |
[17] | 45 | \(420\times 10^3\) | 5 | 3 | 1 | 2 |
[18] | 286 | \(58\times 10^3\) | 29 | 31 | 26 | 27 |
[19] | 43 | \(530\times 10^3\) | 4 | 4 | 2 | 1 |
6 Alternatives for querying the directly follows relation
Log | Cases | Events | Performance native operator (ms) | Performance sorted operator (ms) | #Relations | #Relations sorted operator |
---|---|---|---|---|---|---|
[4] | 1143 | 150,291 | 141 | 141 | 16,546 | 372 |
[15] | 13,087 | 262,200 | 172 | 187 | 344 | 263 |
[16] | 42,134 | 65,533 | 16 | 47 | 90 | 90 |
[17] | 46,616 | 466,737 | 203 | 297 | 890 | 779 |
[18] | 34,015 | 59,083 | 16 | 47 | 7765 | 1042 |
[19] | 31,509 | 561,671 | 250 | 343 | 449 | 278 |