1 Introduction
2 Business process model features
-
N is the set of nodes;
-
E⊆N×N is the set of edges; and
-
\(\lambda: N \rightarrow \mathcal{L}\) is a function that maps nodes to labels.
-
A start feature is a node n∈N that has an empty pre-set;
-
A stop feature is a node n∈N that has an empty post-set;
-
A sequence feature of size s is a list of nodes [n 1,n 2,n 3,…,n s ]⊆N, such that (n 1,n 2)∈E,(n 2,n 3)∈E,…,(n s−1,n s )∈E, for s≥2;
-
A split feature of size s is a split node n and a set of nodes {n 1,n 2,…,n s−1}⊆N, such that (n,n 1)∈E,(n,n 2)∈E,…,(n,n s−1)∈E, for s≥3;
-
A join feature of size s is a join node n and a set of nodes {n 1,n 2,…,n s−1}⊆N, such that (n 1,n)∈E,(n 2,n)∈E,…,(n s−1,n)∈E, for s≥3.
3 Feature similarity, matching and indexing
3.1 Feature similarity
3.2 Feature matching
-
their label features are similar to a high degree, i.e., lsim(n,m)≥lcutoffhigh;
-
their role features are similar, and their label features are similar to a medium degree, i.e., rdsim(n,m)≥rcutoff and lsim(n,m)≥lcutoffmed.
3.3 Feature indexing
4 Feature-based similarity estimation
-
relevant to G q if and only if ESim(G q ,G)≥ratior
-
potentially relevant to G q if and only if ratior>ESim(G q ,G)>ratiop
-
irrelevant to G q if and only if ratiop≥ESim(G q ,G)
5 The improved greedy algorithm for process similarity search
5.1 The greedy algorithm for process similarity search
5.2 Improvements
5.2.1 Selecting only the top-k similar node pairs
-
TOP(k,n,N,sim)⊆N
-
|TOP(k,n,N,sim)|=min(k,|N|)
-
∀p∈TOP(k,n,N,sim), \(\not\exists o\in N/\mathit{TOP}(k,n,N,\mathit{sim})\), such that sim(n,o)>sim(n,p).
5.2.2 Incrementally computing the graph similarity
5.2.3 Pre-selecting similar node pairs
5.3 The improved greedy algorithm for process similarity search
6 Ranking
-
Gr is the sequence that consists of all models from Gs that are relevant to G q , such that for each Gr i , Gr j from Gr holds: if i<j then ESim(Gr i ,G q )≥ESim(Gr j ,G q ); and
-
Gp is the sequence that consists of all models G from Gs that are potentially relevant to G q and for which GSim(G,G q )>cutoff, such that for each Gp i ,Gp j from Gp holds: if i<j then GSim(Gp i ,G q )≥GSim(Gp j ,G q ).
7 Implementation
8 Evaluation
8.1 Homogeneous evaluation
8.1.1 Evaluation setup
8.1.2 Evaluation results
Feature (n) | Occurrences | Matches | Rel | PoR | Ir | R-Prec |
---|---|---|---|---|---|---|
Previous Work [7] | – | – | 0 | 100 | 0 | 0.84 |
1: Node(1) | 374 | 581 | 5.5 | 10.9 | 83.6 | 0.84 |
2: 1+Seq(2) | +267 | +197 | 8.1 | 8 | 83.9 | 0.83 |
3: 2+Seq(3) | +175 | +96 | 7.8 | 10.1 | 82.1 | 0.83 |
4: 2+Split(3) | +87 | +93 | 7.8 | 10.1 | 82.1 | 0.83 |
5: 4+Split(4) | +23 | +11 | 7.8 | 10.1 | 82.1 | 0.83 |
6: 2+Join(3) | +58 | +18 | 7.8 | 10.1 | 82.1 | 0.83 |
7: 6+Join(4) | +14 | +1 | 7.8 | 10.1 | 82.1 | 0.83 |
Features (n) | Rel | PoR | Ir | Test
| Tcom
|
\(\mathrm{T}_{\mathrm{total}}^{\mathrm{avg}}\)
|
\(\mathrm{T}_{\mathrm{total}}^{\min}\)
|
\(\mathrm{T}_{\mathrm{total}}^{\max}\)
|
---|---|---|---|---|---|---|---|---|
Previous Work [7] | 0 | 604 | 0 | 0.00 s | 0.60 s | 0.60 s | 0.16 s | 1.45 s |
1: Node(1) | 7 | 73 | 524 | 0.05 s | 0.04 s | 0.09 s | 0.03 s | 0.14 s |
2: 1+Seq(2) | 13.7 | 44.9 | 554.4 | 0.05 s | 0.02 s | 0.07 s | 0.03 s | 0.09 s |
3: 2+Seq(3) | 9.5 | 73.2 | 521.3 | 0.05 s | 0.05 s | 0.10 s | 0.03 s | 0.15 s |
4: 2+Split(3) | 9.5 | 73.2 | 521.3 | 0.05 s | 0.05 s | 0.10 s | 0.03 s | 0.15 s |
5: 4+Split(4) | 9.5 | 73.2 | 521.3 | 0.05 s | 0.05 s | 0.10 s | 0.03 s | 0.15 s |
6: 2+Join(3) | 9.5 | 73.2 | 521.3 | 0.05 s | 0.05 s | 0.10 s | 0.03 s | 0.15 s |
7: 6+Join(4) | 9.5 | 73.2 | 521.3 | 0.05 s | 0.05 s | 0.10 s | 0.03 s | 0.15 s |
-
dcutoff, which is a parameter that determines whether a role feature is considered to be discriminative (Definition 6).
-
lcutoffhigh, rcutoff and lcutoffmed, which are parameters that determine what is considered to be a sufficiently similar for a feature to match (Definition 10).
-
ratior and ratiop, which are parameters that determine which class a process model belongs to based on the fraction of features that match with the query model (Definition 13).
-
wskipn, wskipe and wsubn, which denote the weights given to node deletion, node substitution and edge deletion (Definition 14).
-
k, which denotes how many most similar nodes are considered for a search node (Definition 15).
-
w l and w r , which denote the weights given to the label and role similarities (Definition 16).
8.2 Heterogeneous evaluation
8.2.1 Evaluation setup
Branch/Business Function | nr. of query models | nr. of document models |
---|---|---|
Procurement | 3 | 37 |
Delivery and invoicing | 1 | |
Production planning | 17 | |
Sales | 4 | 43 |
Business planning | 2 | |
Management | 2 |
8.2.2 Evaluation results
Features (n) | Rel | PoR | Ir | R-Prec | Test
| Tcom
|
\(\mathrm{T}_{\mathrm{total}}^{\mathrm{avg}}\)
|
\(\mathrm{T}_{\mathrm{total}}^{\min}\)
|
\(\mathrm{T}_{\mathrm{total}}^{\max}\)
|
---|---|---|---|---|---|---|---|---|---|
Previous [7] | 0 | 100 | 0 | 0.56 | 0.00 s | 0.32 s | 0.32 s | 0.20 s | 0.51 s |
1: Node(1) | 74.2 | 20.4 | 5.4 | 0.54 | 0.02 s | 0.02 s | 0.04 s | 0.02 s | 0.06 s |
2: 1+Seq(2) | 83 | 11.6 | 5.4 | 0.52 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |
3: 2+Seq(3) | 85 | 9.6 | 5.4 | 0.50 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |
4: 2+Split(3) | 85 | 9.6 | 5.4 | 0.50 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |
5: 4+Split(4) | 85 | 9.6 | 5.4 | 0.50 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |
6: 2+Join(3) | 85 | 9.6 | 5.4 | 0.50 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |
7: 6+Join(4) | 85 | 9.6 | 5.4 | 0.50 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |