The article 'The WHY in Business Processes: Discovery of Causal Execution Dependencies' delves into the application of causal discovery techniques to understand business processes more deeply. Traditional process mining focuses on time precedence, but the authors argue that understanding causal relationships is crucial for effective interventions. They introduce the Causal Business Process (CBP) model, which represents causal execution dependencies among tasks. The method uses the Linear Non-Gaussian Acyclic Model (LiNGAM) algorithm to reveal these dependencies from event logs. The authors also provide algorithms to highlight discrepancies between the discovered business process and the CBP model, aiding in process improvement. The article is structured to present relevant background, the novel method, and its application on synthetic and real-world datasets, demonstrating its practical value and potential for enhancing process understanding and adaptation.
AI Generated
This summary of the content was generated with the help of AI.
Abstract
Unraveling the causal relationships among the execution of process activities is a crucial element in predicting the consequences of process interventions and making informed decisions regarding process improvements. Process discovery algorithms exploit time precedence as their main source of model derivation. Hence, a causal view can supplement process discovery, being a new perspective in which relations reflect genuine cause-effect dependencies among the tasks. This calls for faithful new techniques to discover the causal execution dependencies among the tasks in the process. To this end, our work offers a systematic approach to the unveiling of the causal business process by leveraging an existing causal discovery algorithm over activity timing. In addition, this work delves into a set of conditions under which process mining discovery algorithms generate a model that is incongruent with the causal business process model, and shows how the latter model can be methodologically employed for a sound analysis of the process. Our methodology searches for such discrepancies between the two models in the context of three causal patterns, and derives a new view in which these inconsistencies are annotated over the mined process model. We demonstrate our methodology employing two open process mining algorithms, the IBM Process Mining tool, and the LiNGAM causal discovery technique. We apply it to a synthesized dataset and two open benchmark datasets.
1 Introduction and Motivation
Process mining (PM) aims to gain insights into an organization’s business processes by analyzing event data recorded in its information systems, with the goal of improving business operations. Typical process mining techniques are inherently associational, approaching process discovery from a time precedence perspective, i.e., they discover ordering constraints among the process’ activities.
Such associative relationships are between two variables where a change in one variable is associated with a change in another. In contrast, causal relationships describe the connection between a cause and its effect, where the cause is an event that contributes to the production of another event, the effect [1].
Advertisement
In this paper, we focus on employing causal discovery to identify a new type of process execution causal view. Identification of causal relationships is key to the ability to reason about the consequences of interventions. One of the fundamental goals of causal analysis is not only to understand exactly what causes a specific effect but rather to be able to conclude if certain interventions account for the formation of certain outcomes, thus, being able to answer questions of the form: Does the execution of a certain activity in a loan approval process entail a delay in the handling of the application? Or if the same activity is skipped, may the process duration be shortened?
Typical process discovery techniques rely on time precedence among the activities in the event log, denoted by a directly-follows relationship [2]. This may sometimes be deceiving, due to the associative nature of such relationships, presenting an execution order that is different than the causal execution order inherent in the event log. Associative relationships are not enough to develop the causal understandings necessary to inform intervention recommendations as aforementioned. In order to understand which intervention should be made to improve a process outcome, we first must understand the causal chains that tie the unfolding of the processes generating our data.
As a simple example, we consider in Fig. 1 a Petri-net [2] depicting some of the final steps in a loan approval business process (BP), with its executions recorded in a log file. In this BP, the Accept task precedes sending the client an email notification about the acceptance of the loan while also sending its records for archival retention. Note that the duration of the Archive task is usually longer than that of the Email task. Subsequently, the application is closed in the system and all corresponding paperwork is disposed of.
Applying process discovery to the initial steps in this process (i.e., \(Accept, Email, \text {and } Archive\)) can sometimes derive the graphs in Fig. 2. Our intuitive comprehension of such graphs attributes a causal meaning to their edges. That is, the links between the tasks in the graph are typically interpreted as if the execution of the Archive task depends on having the Email task executed first. However, we know that in the real BP (see Fig. 1), such dependency is not imposed, and may have arbitrarily emerged as a result of longer execution times for the Archive task instances in the log.
Advertisement
This may erroneously drive us to infer that, for example, if we do not send an email about a given loan acceptance, then, the loan is also not archived, or that if we shorten the execution time of the emailing task (e.g., by allocating more resources to it) then we can complete the archiving task earlier.
Even in the cases when using more contemporary algorithms that utilize frequency thresholds (e.g., heuristic-miner and inductive-miner), yielding a result that retains parallelism among the Email and Archive tasks in the real business process as depicted in Fig. 1, the interpretation of such a graph structure is different due to having a different meaning to the relations among the tasks as denoted by the Petri-net ‘place’ notation. In a process view, an arrow denotes a time precedence relation. In a causal view, an arrow designates a causal execution dependence as defined in Definition 1 in Sect. 3.
In our illustrative example, the process view reflects individual time precedence dependencies between Email and Accept and between Archive and Accept, with no precedence between Email and Archive (i.e., parallel tasks) . A causal view of the same structure reflects the meaning of having the Accept task being accountable for the execution of the two subsequent tasks, that is, the execution of Email and Archive depends on the prior execution of the Accept task.
We define a new type of a BP model in which all inter-task execution relations are causal as the Causal Business Process (CBP) model. We consider the CBP to be a new view about the BP since it assigns causal execution semantics to the fundamental relation among the tasks as later defined in Definition 1. Such a view aids process owners in gaining a sound understanding of the potential consequences of activity execution interventions (e.g., the effect of adding more resources to some activity). Overall, we provide a novel method that draws from state-of-the-art causal discovery to reveal the CBP model from a given process event log based on the execution times of the activities. Specifically, we realize this by employing the LiNGAM causal discovery algorithm.1 The method is then also extended to tackle the cases where the discovered BP model does not coincide with the CBP model, prescribing how to systematically annotate such discrepancies over the discovered BP model. In our illustrative example this could be the case where the executions of the Email and Archive tasks are both dependent on the execution of the Accept task (as shown in Fig. 7 and detailed henceforth), and the process discovery algorithms yield a result as in Fig. 2, where Email directly-precedes Archive and directly-follows Accept. Our method is agnostic to the concrete process and causal discovery algorithms employed.
We claim that the CBP model is a critical supplementary instrumentation underlying the ability to faithfully discover the logic that unfolds towards process execution outcomes, and consequently, improve and adapt underlying process executions. Acknowledging the importance of automated process adaptation and improvement as recently envisioned in the ABPMS manifesto [3], it is imperative that such improvements also take into account a genuine causal execution view of the process, in addition to considering the temporal appearance of relationships within the process.
In this paper we formulate a novel method for derivation of the causal execution view as a new view that overlays the discovered business process model. Subsequently, we demonstrate the applicability of our method on synthetic and real event logs.
Our paper is structured as follows: Sect. 2 presents relevant background about process discovery and causal discovery. Section 3 describes our novel method for deriving CBPs and utilizing them to construct BP overlays that highlight inter-model discrepancies. General soundness and applicability assessments of the proposed method, followed by two pragmatic applications are presented in Sect. 4. Respectively, the results of applying the method on one synthetic dataset and on two real-world applications for CBP to BP discrepancy analysis, and for inter-variant comparisons, are brought in Sect. 5. We review the related work on the intersection between causal discovery and business process discovery in the related work section 6. We wrap up the paper with a summary of conclusions and potential future work.
2 Background
We use \(\xrightarrow {c}\) to denote a causal execution relation. For example, \(A \xrightarrow {c} B\) means that the execution of A causes the execution of B. That is, without A happening B cannot happen, and given we know that B occurred, we can deduce that A had occurred (sometime) before it. It is important to note that any two time correlated occurrences do not necessarily entail a causal execution relation. For example, let’s assume that \(Accept \xrightarrow {c} Email\), and also that \(Accept \xrightarrow {c} Archive\). As a result, the occurrences of Email and Archive will be correlated in the data. However, one cannot deduce a causal execution dependence between these two tasks, even if one always occurs immediately after the other. In our example, the correlation between Email and Archive is spurious, while the correlations between Accept and Email and between Accept and Archive reflect a causal relationship.
2.1 Process Discovery
Process discovery (PD) is one of the challenging tasks in PM, related to the effort of constructing a process model based on behavior that can be observed in a given event log [2, 4]. The input for PD algorithms is an event log and the output is the business process model. Conventionally, an event log contains data related to a single process, where each event refers to a single process instance, namely a case. An event log consists of several cases, i.e., multiple runs of a process [2], where a case is a sequence of events (at least one) referred to as a trace. An event can also be related to some activity or task. There exists a (time) order between events within the same case. Events include at least task name, timestamp, and case-id attributes. Most fundamentally, PD algorithms rely on time precedence among the events in the log. For instance, if all log appearances (or a sufficiently high portion) of task A occur before the ones of task B, PD may infer that task A usually happens before task B, denoted by \(A \rightarrow B\) in the inferred process model. However, such dependence does not necessarily imply that the occurrence of A also causes the occurrence of B. For example, it might be that a task C is the real cause for both task occurrences.
For the loan example assessment, we employed two PD representative algorithms, Alpha (\(\alpha\)) and Heuristic [2, 5]. While \(\alpha\) is strict, the Heuristic algorithm is more robust to noisy data, using a threshold in order to define if an edge between A and B exists, based on the number of times A happened before B. While we acknowledge that the \(\alpha\) algorithm is somewhat outdated as it was the first algorithm developed for process mining, we consider its inclusion essential to demonstrate that the results of our algorithm, as presented in this paper, remain sound even when using the \(\alpha\) algorithm as input, and how it coincides with the results of a more robust algorithm such as the heuristic miner. However, we limited the use of the \(\alpha\) algorithm to synthetic datasets, where we could control the level of noise in the input. For real datasets, we replaced it with a variant of the contemporary inductive miner algorithm, which is embedded in the IBM Process Mining commercial product, as detailed in Sects. 4 and 5. In both PD algorithms, there can be cases in which certain forms of execution relations are misinterpreted from a causal point of view, when modeled as a sequence of three tasks (Fig. 2). This could occur either in logs where one task always occurs before another task, or where the order of occurrence between the two tasks reverses in a relatively small subset of cases.
2.2 Causal Discovery
Causal inference and causal discovery are the two main pillars of causal analysis. While causal discovery focuses on analyzing and creating models that illustrate the relationships inherent in the data [6‐8], causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect [9‐11]. In this work, we focus on causal discovery (CD) over process execution times as inherently recorded in the timestamps of the events in process event logs.
More concretely,CD aims at constructing causal graphs from data by exploring hypotheses about the causal structure [12]. Our focus is on the exploration of relations among task executions in a process for the sake of discovering the CBP. Additional assumptions, such as functional forms and distributions, are often required to identify the causal graph from the data. In a typical CD setting, the causal graph is assumed to be a Directed Acyclic Graph (DAG), and all the common causes of observed variables are themselves observed (i.e., are present in the event log).
When intervention is possible, if t(A) denotes the execution time for task A, and considering any two viable interventions \(c_1\) and \(c_2\) on this execution time (e.g., by adding more resources to it), one may conclude the execution time of any other task B to be causally dependent on it if:
where p is a probability and ‘do’ designates an intervention on the execution time. There could be additional tasks (a.k.a confounders) that may be commonly affecting the duration of tasks A and B. Hence, to determine the direct effect between these two tasks, all commonly affecting tasks should be handled (e.g., weighted in as covariates) to remove their effect. Subsequently, the elimination of all common causes affecting t(A) and t(B) allows for the removal of the ‘do’ operator, and assessing from the process log directly whether [13]:
Without any further assumptions about a given dataset, it may only be possible to distinguish between the independence of t(A) and t(B) and the dependence between the two activities.
However, in observational circumstances as in our case, where intervention is not viable, we may not be able to tell whether t(A) depends on t(B) or vice versa. Contemporary CD algorithms disentangle this issue of identifying the direction of dependence with some additional assumptions about the dependence concerning its functional form and error distribution. Particularly, if we can model the execution dependence between activities A and B as a linear relationship, we can represent it as follows:
In a typical business process, given such settings, it is plausible to consider such a linear form of dependence between t(A) and t(B), where the duration of both activities’ executions may be captured by a non-Gaussian distribution (e.g., uniform or exponential). For simplicity, we also assume negligible transition times between activities (i.e., \(\beta _{AB}=1\)). This assumption can be relaxed by capturing the transitions themselves as explicit activities.
In this work, we employed the Linear Non-Gaussian Acyclic Model (LiNGAM) algorithm for the discovery of the CBP, given its main assumptions hold in the typical configuration of a process event log, and in particular adhere to non-Gaussian distribution of time as a continuous variable. These assumptions conform to the conventional modeling of business processes with activity durations drawn from non-Gaussian distributions, and also that cycles in a typical business process are finite and can be eliminated with techniques such as k-loop unrolling. Specifically, LiNGAM [14] assumes that a causal model is composed of linear functions connecting the variables and that there are no unobserved confounders. Also, according to [15], the linearity assumption can be in fact relaxed under additive noise assumption.
LiNGAM gives a way to differentiate between \(x \xrightarrow {c} y\) and \(y \xrightarrow {c} x\) where \(\xrightarrow {c}\) meaning causal relationship [14], such that changing y would not trigger a change in x in the former, but a change in x will trigger a change in y. This is achieved in LiNGAM by regressing both directions, i.e., in the bi-variate case estimate f such that \(y=\hat{f(x)} + u_x\), then deduce the residuals \(u_x = y - \hat{f(x)}\) and check if \(x \perp u_x\) is met (i.e., \(\perp\) designating independence). Similarly, do the same for y, i.e., \(x = \hat{f(y)} + u_y, u_y = x - \hat{f(y)}\) and check again for independence of \(y \perp u_y\). If both directions are independent, then no edge between x and y will be inferred. When dependence exists only in one direction, for example, \(x {\not \perp } u_{x}\), then \(y \xrightarrow {c} x\) is inferred. There cannot be dependence in both directions.
2.3 Causal Patterns
The discovery of causal relationships from observational logs can often lead to misleading conditions in which dependencies may appear or disappear due to unintended scoping of observations to specific values or ranges of the observed variables. This phenomenon is known as conditioning, as discussed in [16, p. 112]. Some dependencies may be wrongly detected because of external, non-measured variables (Simpson’s paradox). Others may be omitted because of some inherent bias in the data selection (Berkson’s paradox) [17, 18].
As presented in [16], there are three causal relationship patterns or junctions that constitute the building blocks of any causal net structure (Fig. 3):
(a)
The confounder A is a node that is a common cause for B and for C (the ‘affected’ nodes). A confounder may create a correlation between B and C even if there is no direct causal relationship between them.
(b)
The mediator B is a node that transmits the effect of A to C. If we condition on the mediator values we may not be able to observe the existence of a direct causal effect between A and C.
(c)
The collider C is a node that is caused by both A and B. Conditioning on the collider C may present a correlation between A and B that might be incorrectly interpreted as a causal relationship between the two.
Our aim is to identify the true inter-task causal relationships in the presence of any of these patterns in the event log, tackling the non-obvious effort to separate spurious correlation relations from causal relations. We assume all the confounder tasks are included in the logs (i.e., known confounder assumption). We show that state-of-the-art algorithms for process discovery may fail to genuinely capture the inter-task causal execution relationships, and thus assess the viability to employ the LiNGAM method [14] to annotate the result of the PD algorithm according to the CBP as a new view.
3 Our Method for Deriving Causal BP Overlays
We first formulate our definition of a causal execution dependence between tasks in a business process:
Definition 1
(\(A \xrightarrow {c} B\)) A causal execution dependence implies that the time task B executes is determined by the time task A executes in a given process.
While the dependence defined above is intrinsically a manifestation of some materialistic interaction between the inner workings of the two activities [19], our definition is external, based on the directed relationship (asymmetric) between the execution time property of any two tasks in the process. Such dependence implies that the execution time of task B can only be determined based on the execution time of task A and not vice versa. The relationship is directed such that \(A \xrightarrow {c} B\) does not imply that \(B \xrightarrow {c} A\). That is, changing the execution time of A affects the execution time of B, but not vice versa.
In experimental settings, altering different execution times for A while fixing the effects of all others potentially affecting variables (a.k.a., an ‘intervention’) should present different distributions in the execution time of B as a sign of a causal execution dependence of B on A. Practically, we employ LiNGAM to determine all causal execution dependencies from a given event log. We distinguish between two possible modalities for the calculation of execution time, absolute and relative. The former relates to the time since process execution was initiated, while the latter relates to the time since the preceding activity finished its execution (see Sect. 3).
There could be other dependencies among tasks, corresponding to other task attributes (e.g., number of allocated resources). These other dependencies could be complementary to the CBP and can be denoted as follows: \(A \xrightarrow [attribute]{c} B\). Such dependencies do not affect the conventional results of process mining, which are meant to cater to occurrence ordering among the tasks. In this work, our focus is strictly on the execution dependence that manifests itself via the timing of the tasks.
Definition 2
(CBP) A causal business process is a process model in which inter-task relations among any of its tasks have a causal execution dependence.
It is important to stress that the directly-follows relation (i.e., ordering on time) is fundamentally different from the causalexecutiondependence in the sense that \(A \xrightarrow {c} B \Rightarrow A \rightarrow B\) but \(A \rightarrow B {\Rightarrow }\!\!\!\!\!/\,\, A \xrightarrow {c} B\). Furthermore,
, such that provided it is known that Bdirectly-followsA, the possible causal execution relation \(B \xrightarrow {c} A\) is precluded. Thus, while the CBP view is complementary to the discovered BP view and cannot be inferred from it, we rely on its discovered directly-follows relations to narrow down the exploration space of the CBP.
Contemporary tools for CD, such as the LiNGAM algorithm, allow for a data-driven approach to reveal such causal execution dependencies under the aforementioned assumptions. Our particular focus in this work is on highlighting the discrepancies between the BP and the CBP with respect to the three causal patterns. Concretely, with respect to Definition 1 we further define the three patterns as follows:
Definition 3
(\(C \xleftarrow {c} A \xrightarrow {c} B\)) A confounding execution dependence on A is such that both the time task B executes, and the time task C executes, depend on the time the confounder task A executes.
Definition 4
(\(A \xrightarrow {c} B \xrightarrow {c} C\)) A mediating execution dependence on B is such that the time task B executes determines how much the time task C executes depends on the time task A executes.
Definition 5
(\(A \xrightarrow {c} B \xleftarrow {c} C\)) A colliding execution dependence on B is such that the time the collider task B executes depends on both the time task A executes and the time task C executes.
Our hypothesis is that given a timestamped process log as an input, we can faithfully generate a CBP model with all causal execution dependencies and highlight its discrepancies compared to the process mining algorithm output.
As our validation shows, the highlighting of the inconsistencies can help with not only gaining fundamental insights about the process, but can also serve as a basis for better process improvement and interpretation of its execution outcomes.
Our method to validate the above hypothesis caters to the following steps (see Fig. 4):
I.
Apply any state-of-the-art process mining algorithm for BP model discovery.
II.
Apply the LiNGAM algorithm for CBP model discovery over the occurrence times of the activities.
III.
Compare the CBP and the BP models with respect to the three patterns.
IV.
Construct a new view that combines the results in step 3 to highlight all the discrepancies.
In the implementation of step I, we ensure that our method is decoupled from the selection of any specific process discovery technique. Our underlying assumption is that the discovered process model is correct, implying that our method is sound regardless of the particular PD algorithm used. Furthermore, the view constructed in step IV is relative to its input, regardless of its correctness.
We note that in step II, our concrete choice of the LiNGAM algorithm may be relaxed to the use of other plausible CD algorithms in the future, as long as the underlying assumptions of any such algorithm conform to the assumptions of conventional modeling of business processes (e.g., distribution of activity times).
To ensure the soundness of the resulting view with respect to the highlighting of causal execution dependencies, the application of the above steps should conform to the following requirements:
R1: For any case where a directly-follows relation \(A \rightarrow B\) in the BP does not coincide with a causal execution dependence \(A \xrightarrow {c} B\) in the CBP (i.e., the execution of B is caused by some other activity \(A'\) in the CBP), the highlighting denotes both the directly-follows relation as not being causal, and also the relation \(A' \xrightarrow {c} B\).
R2: For any case where a directly-follows relation \(A \rightarrow B\) in the BP coincides with a causal execution dependence \(A \xrightarrow {c} B\) in the CBP, the highlighting denotes the directly-follows relation as being also a causal execution dependence.
Next, we detail each of the steps in our method.
3.1 Step I: Apply Any State-of-the-Art PD Algorithm for BP Model Discovery
Process discovery is employed as a first step, to guide the overall discovery of causal execution dependencies. This is because while causal execution dependencies are not the same as temporal relations as mentioned above, time-precedence is an essential condition (i.e., necessary but insufficient) in the occurrence of a causal one. That is, the absence of time-precedence also prohibits a causal execution dependence. As such, our method traverses over the discovered process model structure as a means to discover all causal execution dependencies. Thus, this step takes as an input the process event log and yields the discovered BP model in the form of a directly-follows graph as an output.
The result of PD caters to multiple process variants. A “process variant” refers to a specific sequence or path of activities in the process. Determining all process variants directly from an event log is essentially enumerating all unique traces in the log. For each group (trace), the sequence of activity names (event types) is extracted, and any duplicates (i.e., traces that represent the same sequence of activities) are identified and consolidated. This step yields all the unique traces, namely the process variants. For various reasons, there can be many unique process variants. Some may be more frequent in the log, while others may capture infrequent or even noisy behaviors. The CBP construction in the next step is typically based on a unification of (user-selected) variants, while trading between variant completeness and the model’s ability to generalize to cases that may have not yet been observed in the log. For flexibility of the method we keep the selection of particular (one or more) variants of interest optional, with the default of including all variants in the discovered BP. Such a variant selection is typically affected by the segment of interest in the BP subject to intervention. Hence, we consider the remaining steps as being associated with the selection of a concrete set of one or more process variants as an additional output of step I. Particularly, given a discovered process model \(M\) as the union of all variants in \(T\):
$$\begin{aligned} M = \bigcup _{v \in T} v \end{aligned}$$
A single process variant \(v\) is one specific sequence (or list) of events from \(E\). Formally:
$$\begin{aligned} v = (e_1, e_2, \ldots , e_n) \end{aligned}$$
where \(e_i \in E\) for all \(i = 1, \ldots , n\), and \(n\) is the length of variant \(v\).
3.2 Step II: Apply the LiNGAM Algorithm for CBP Model Discovery Over the Occurrence Times of the Activities
The application of the LiNGAM algorithm involves two sub-steps. The first corresponds to the pre-processing of the data with respect to each pair of activities to be examined. The second sub-step prescribes a pair-wise order for LiNGAM invocations. It is important to note that a prerequisite to running LiNGAM is to test adherence to the non-Gaussian assumption. In general, the requirement in LiNGAM allows for a relaxation of this requirement to having one activity duration (at most) being drawn from a Normal distribution. In the (unlikely) case where more than one activity duration is normally distributed, the activity pair could be skipped in the second step.
3.3 Step II(a): Pre-processing
Let \(E\) be the set of all event types, and \(C\) be the set of all cases. Let \(A\) be the set of all possible attributes. We define the initial input for the process as an event log \(L\), being a collection of tuples \((e, c, t, N)\):
The tabular form \(T\) is a collection of rows, where each row \(r \in T\) corresponds to a unique case \(c \in C\), and an ordered list of events, ordered by their timestamps, originally belonging to that case, where:
\(c \in C\) is the case ID,
\([(e_1, t_1, N_1), (e_2, t_2, N_2), \ldots ]\) is an ordered list of events, where \(e_1, e_2, \ldots\) are the event types,
\(t_1, t_2, \ldots\) are the timestamps of the events, and
\(N_1, N_2, \ldots\) are the sets of name-value attribute pairs associated with each event.
We select all cases in T in which their order of events matches the order of events in each variant of interest v.
Given: \(v = (e_1, \ldots , e_k)\) as an ordered set of events,
$$\begin{aligned} \begin{aligned} T^{v}=\text {Select}_{v} (T) = \{&r \in T \mid r \text { is of form } (c, s) \\&s = [(e'_1, t'_1, N'_1), \ldots , (e'_k, t'_k, N'_k)] \\&\text {such that } e'_j = e_j \\&\text {for all } 1 \le j \le k \} \end{aligned} \end{aligned}$$
For any given set of variant vs that is selected for the analysis, this step is repeated for each of the individual variants and then all sets are unified:
For any pair of events that is determined for comparison by the application of LiNGAM (see Step II.b), we extract a tuple of timestamps (\(<t_i, t_j>\)):
That is, we form a new set \(D\) consisting of tuples \((t_i, t_j)\), where \(t_i\) and \(t_j\) are the timestamps of two extracted events in each case.
Next, we adjust the times extracted in each row in \(D\) with respect to some anchoring modality. We refer to an anchor as the point in time from which the activity execution duration is determined. Fundamentally, we consider one of two modalities:
1.
Absolute: Anchoring of all timestamps with respect to the initial execution time \(t_0\) of each case. The adjustment can be represented as
Relative: Anchoring of the timestamps in each trace with respect to the timestamp of the event that precedes \(e_i\) in each case (i.e., \(t_{i-1}\)). The adjustment can be represented as
Using \(D'\) as an input, the adjusted timestamps are treated as variables and we apply the LiNGAM algorithm to detect causal relationships between \(e_i\) and \(e_j\) based on their corresponding adjusted timestamps \(t_i\) and \(t_j\).
3.4 Step II(b): Pair-Wise Application of LiNGAM
With respect to any given set of variants vs discovered by the PD algorithm, we apply the CD algorithm (LiNGAM) in a pairwise manner over all ordered event pairs that preserve the original order in the set.
That is, given a variant \(v\) represented as a sequence of events:
$$\begin{aligned} v = (e_1, e_2, \ldots , e_n) \end{aligned}$$
where \(n\) is the length of the variant, we denote the set of all ordered pairs in it as:
$$\begin{aligned} P(v) = \{ (e_i, e_j) | 1 \le i < j \le n \} \end{aligned}$$
The above is repeated for each \(v \in vs\). We combine all event pairs into a single unified set and apply the LiNGAM algorithm to each pair of events in P(vs):
The application also includes a choice of anchoring \(\text {modality}\) which is either “absolute” or “relative”. For a given set of variants to be examined, the same modality should be consistently employed across all LiNGAM applications. We then apply the LiNGAM algorithm to each individual element in \(P(vs)\) as follows:
$$\begin{aligned} \text {setModality}\,(\text {``absolute'' or ``relative''}) \end{aligned}$$
$$\begin{aligned} \forall p \in P(vs): \end{aligned}$$
Where \(G_{L}\) is the graph that denotes the discovered causal business process (CBP) that corresponds to the set of variants vs in the discovered BP.
In our illustrative example, the output of step II.b is the graph \(G_L:(V_L,E_L)\):
This reflects a \(G_L\) result in which the Accept task is a confounder as denoted in Fig. 7.
3.5 Step III: Compare the CBP and the Discovered BP Models with Respect to the Three Patterns
Compare the discovered BP model and the CBP models with respect to the three causal patterns: confounder, collider, and mediator. The comparison highlights the differences in terms of marking the directly-follows relations in the discovered BP as either causal or non-causal, and also adding additional causal execution dependencies missing in the discovered BP. All comparison results are merged into a new process perspective graph (\(G_{result}\)) that constitutes a causal discrepancy view.
Algorithm 1 applies to the case where the LiNGAM output DAG shows a confounder structure (i.e., being A the confounder) while the PD algorithm discovers this structure as a sequence of \(A \rightarrow B \rightarrow C\). In accordance with the aforementioned requirements, the generation of a new view that highlights the discrepancies between the two models corresponding to the confounder pattern should conform to: R1 - highlighting the relationship \(B \rightarrow C\) as not being causal, and adding \(A \xrightarrow {c} C\), and R2 - highlighting the relationship \(A \rightarrow B\) as also being causal. We formally elaborate the realization of step #3 in our method (see Sect. 3) for the confounder pattern in Algorithm 1.
Algorithm 1
Given a discovered BP and a corresponding CBP that may consist of a CONFOUNDER pattern
We note that we also check for \(A {\not \rightarrow } C\) since such a sequence gives rise to the possibility of B not being executed at all in the discovered BP. Provided A is a confounder in this case, a variant in which only C executes after A is not feasible. In this case, the algorithm sets the resulting graph \(G_{result}\) to initially match the sequential structure in the PD, adds the missing causal edge \(A \xrightarrow {c} C\), and explicitly annotates the directly-follows edge \(A \rightarrow B\) with a ‘c’ symbol to indicate its essence of being causal. Lastly, if B is also not a cause for the execution of C, the directly-follows relationship \(B \rightarrow C\) is also denoted as being non-causal.
Algorithm 2 relates to a collider structure (i.e., being C the collider) in the LiNGAM output DAG where the PD algorithm is a sequence \(A \rightarrow B \rightarrow C\) as in Algorithm 1. Similarly, for the collider pattern, the generation of a new view to highlight the discrepancies between the two models should follow: R1 - marking the relationship \(A \rightarrow B\) as non-causal and adding \(A \xrightarrow {c} C\), and R2 - highlighting that the relationship \(B \rightarrow C\) is also causal. The implementation of step #3 in our method (Sect. 3) for the collider pattern is formally described in Algorithm 2.
Algorithm 2
Given a discovered BP and a corresponding CBP that may consist of a COLLIDER pattern
Similarly, we check for \(A {\not \rightarrow } C\) in the collider scenario, as this sequence suggests the possibility of B being skipped in the discovered BP. If C acts as a collider in this case, a variant in which only A occurs before C is deemed unfeasible. Here, the algorithm adjusts the graph \(G_{result}\) to initially align with the sequential pattern in the PD, includes the missing causal edge \(A \xrightarrow {c} C\), and labels the directly-follows edge \(B \rightarrow C\) with a ‘c’ to reflect its causal significance. Finally, if A is not a causal factor for B’s execution, the directly-follows relationship \(A \rightarrow B\) is also marked as non-causal.
Algorithm 3 is relevant when the LiNGAM output DAG reveals a mediator structure (with M as the mediator), whereas the PD algorithm identifies this structure as the sequence \(A \rightarrow M \rightarrow B\). To reconcile this discrepancy for the mediator pattern, the generated view should adhere to: R1 - marking the relationship \(A \xrightarrow {c} B\) as causal, and R2 - confirming that both \(A \rightarrow M\) and \(M \rightarrow B\) are causal. Step #3 for the mediator pattern is elaborated in Algorithm 3 (see Sect. 3).
Algorithm 3
Given a discovered BP and a corresponding CBP that may consist of a MEDIATOR pattern
For cases involving a mediator, we also verify the absence of \(A \not \rightarrow B\), as this sequence may prevent M from executing in the discovered BP. When M functions as a mediator, any variant omitting M is not feasible. Here, the algorithm sets \(G_{result}\) to match the PD’s sequential layout, adds the missing causal edge \(A \xrightarrow {c} B\), and marks the directly-follows edges \(A \rightarrow M\) and \(M \rightarrow B\) with ‘c’ symbols to denote their causal roles.
3.6 Step IV: Construct a New View that Combines the Results in Step 3 to Highlight All the Discrepancies
To accommodate for any situation where the BP model is consistent with the CBP structure, we also need to annotate all remaining directly-follows edges in the BP model that are also determined as causal execution dependencies in the CBP. Thus, we annotate as causal any remaining edge \((a,b) \in G_{BP}: a \rightarrow b\) for which \((a',b') \in G_L: a' \xrightarrow {c} b'\). Lastly, we unify all the results of all three algorithms jointly with the annotations into a single \(G_{result}\) graph.
4 Soundness and Applicability Evaluation of Our Method
In order to corroborate our hypothesis, i.e., the ability to form a faithful CBP model and highlight potential discrepancies w.r.t the BP model counterpart, we carried out as a first step several experiments related to our loan example in Fig. 1, using simulated data and two PD algorithms: alpha miner and heuristic miner, as detailed in the following subsections. The synthetic datasets can be accessed here.2
The soundness of our method was assessed with respect to the two aforementioned requirements R1 and R2. For the sake of applicability evaluation, we also complemented the synthetic evaluation with two real-world benchmark datasets: Sepsis [20] and Request-for-payment [21]. As proof of applicability, two concrete types of method applications are presented: discrepancy analysis between the discovered BP and the CBP, and comparative analysis of CBPs among different process variants. In this work, focused on establishing the feasibility of our method, we acknowledge that it remains susceptible to other qualitative evaluation aspects, including usability and performance, which will be empirically assessed in the future.
With regard to the theoretical performance of our method, two specific implementations of LiNGAM are available3: DirectLiNGAM and ICALiNGAM. Similar to existing discovery techniques, the performance of both is directly related to the number of variables (i.e., tasks) whose interrelationships are to be examined. Specifically, DirectLiNGAM uses a direct regression approach with a computational complexity that is linear in the number of tasks. For example, for a variant with 35 traces and 3 tasks in the Sepsis benchmark dataset, the runtime was 0.7 s, and for a variant with 350 traces and 5 tasks in the Request-for-Payment dataset, the runtime was 1.8 s.
However, DirectLiNGAM strictly requires non-Gaussian distributions. ICALiNGAM is less strict in this regard, but it is more computationally intensive. It employs an independent component analysis (ICA) based optimization step, which is polynomial (quadratic or cubic) in the number of tasks, and a matrix permutation step with factorial complexity, giving it an overall exponential complexity. Thus, as long as the non-Gaussian distribution assumption holds, DirectLiNGAM is preferable. Alternatively, when using ICALiNGAM is necessary, the computation time ranges from seconds to minutes for 10 to 100 tasks but increases exponentially as the number of tasks grows. Typically, business processes are less likely to contain hundreds of activities, so its practical applicability is still feasible. Beyond the LiNGAM implementation, the complexity of our method is also quadratic in the number of tasks due to the pairwise comparisons in step II. Similarly, the Alpha and Heuristic algorithms generally exhibit quadratic complexity (\(O(n^2)\)) in the number of process activities, determined by analyzing directly-follows relationships.
To test our hypothesis, we ran each of the datasets with DirectLiNGAM. In all cases, timestamps were adjusted according to the absolute modality with respect to the process start time. For comparison between the two modalities, we also employed the relative modality to the Request-for-payment data.
4.1 Soundness and Applicability Evaluation with Synthetic Data
In the following subsections, we describe our evaluation for each of the patterns using a series of generated event logs corresponding to different fragments of the BP model depicted in Fig. 1. In all three patterns, data generation adhered to a set of pattern-specific equations, reflective of the assumptions of linear relationships and error independence. Each log was generated consisting of 9999 cases using the BIMP open-source log simulation tool,4 which is the maximum number of cases the BIMP simulator can generate. The generated data included activity times drawn from either uniform or exponential distributions, with start times every 5 and 30 min respectively. Specific configurations for each log distribution are elaborated further in each pattern.
4.1.1 Confounder
We conducted a total of 10 experiments for the confounder pattern, having five logs in each experiment. Data generations adhered to the case where the Accept task is a confounder, following the equations:
We have created two datasets conforming to the configuration shown in Table 1. In the first one, Email always completes before Archive has started, dictating full-time precedence between the two activities. In the second one, we introduced an exponential distribution in both activities, assuring that Email completes before Archive in most of the cases. Intervals between start times (relative to \(Accept_{time}\) as in Eq. (1)) were extended in the exponential case to ensure a small amount of swapping activities between Email and Archive. Swapping means that the execution order of the two activities alternates between different traces. Data generation was conforming to the assumptions of a known confounder as stated in Sect. 2.3.
Table 1
Configuration of random duration variables (in seconds) for each event type
Event
Uniform
Exponential
name
duration
duration
Accept
\(D_1\sim \mathcal {U}\)([2,4])
\(D_1\sim\)Exp(2)
Email
\(D_2\sim \mathcal {U}\)([7,9])
\(D_2\sim\)Exp(2)
Archive
\(D_3\sim \mathcal {U}\)([10,12])
\(D_3\sim\)Exp(60)
4.1.2 Collider
We conducted two experiments for the collider pattern. Specifically, data generations adhered to the case where the CloseApplication task is a collider, following the equations:
We have created two datasets conforming to the configuration shown in Table 2. In the first one (column #2), Email always completes before Archive has started as drawn from non-overlapping distributions, dictating full-time precedence between the two activities. In the second one (column #3), also having a uniform distribution in both activities with overlapping distributions, assuring Email completes before Archive in about half of the cases.
Table 2
Configuration of random duration variables (in seconds) for each event type
Event
Uniform
Uniform
name
duration
duration
Email
\(D_1\sim \mathcal {U}\)([5,7])
\(D_1\sim \mathcal {U}\)([5,9])
Archive
\(D_2\sim \mathcal {U}\)([9,11])
\(D_2\sim \mathcal {U}\)([7,11])
CloseApplication
\(D_3\sim \mathcal {U}\)([2,4])
\(D_3\sim \mathcal {U}\)([2,4])
4.1.3 Mediator
We conducted an additional experiment for the mediator pattern. Data generations adhered to the case where the CloseApplication task is a partial mediator for the effect the Archive activity has on the PaperDisposal activity, following the equations:
In the created dataset, all activity durations were drawn from a uniform distribution as shown in Table 3.
Table 3
Configuration of random duration variables (in seconds) for each event type
Event
Uniform
name
duration
Archive
\(D_1\sim \mathcal {U}\)([7,11])
CloseApplication
\(D_2\sim \mathcal {U}\)([5,7])
PaperDisposal
\(D_3\sim \mathcal {U}\)([2,4])
4.2 Soundness and Applicability Evaluation with Benchmark Data
We explored two benchmark datasets: Sepsis and Request-for-payment. The former shows an example of the difference between the discovered BP and CBP models as arises from real-world datasets, and the latter shows the benefit of comparing causal perspectives associated with different process variants when trying to explain certain outcomes.
4.2.1 Discrepancy Between BP and CBP Models
To complement the synthetic evaluation, we investigated the hospital Sepsis benchmark dataset. This open dataset holds records of 1050 patients experiencing sepsis symptoms, as captured by the ERP system of the hospital, from their arrival in the emergency ward to their admission to the hospital. Overall, the event log consists of a total of 15,000 events that were recorded for 16 different activities. Our particular focus using this dataset has been on identifying a confounder-related situation that presents inconsistency compared to the corresponding process mining structure for further investigation of the discrepancy and possible insight derivation.
Given the inherent complexity and variability in health care processes, in the case of the Sepsis dataset, we employed the heuristic miner algorithm, and the IBM Process Mining tool.5 We did not use the \(\alpha\) algorithm due to its limitation when applying it to noisy or complex data [22]. As with the synthetic data, LiNGAM was then employed to elicit a corresponding CBP perspective.
4.2.2 Comparative Analysis Between Variants
We explored a part of the benchmark BPI2020 to elaborate on the value of the causal execution view for comparative analysis of different variants. Overall, this dataset contains events related to two years of travel expense claims for domestic and international trips by university employees from 2017 to 2018. Concretely, we inspected the part of the dataset that contains events related to Requests for Payment that include 6886 cases, and 36,796 events. We chose to investigate the condition where the payment request is rejected by the administrator which contains 755 cases overall. Following such a rejection, the process unfolds in one of three prominent variants.
For the sake of comparative analysis between variants, we compared the IBM Process Mining tool and LiNGAM results.
5 Results
5.1 Synthetic Data Results
Our findings are consistent with our hypothesis. Across all three patterns, for the cases with no swapping, the applied PD algorithms results did not match the CBP structure.
5.1.1 Confounder
We first show the results corresponding to the synthetic datasets configurations in Table 1, split by distribution. For the uniform case, in both algorithms, results were consistent across all five logs. The respective process mining algorithm results for each of the log runs are illustrated in Fig. 2. For the exponential case, the results are shown in Fig. 5.
Note that in the exponential case, in each log there were several execution traces in which the order of the Email and the Archive tasks is reversed (this number ranging between 3 and 15). As a result, the \(\alpha\)-algorithm always concluded a split after the execution of the Accept task, subsequently with no deterministic order between the Email and Archive tasks. Structure-wise, this result is, in fact, equivalent to the CBP model. On the contrary, being more robust to order swapping, the Heuristic algorithm yielded the same sequential flow structure, where the numbers on the edges denote the number of instances that conform to the execution order (i.e., excluding swapped occurrences). Specifically, Fig. 5b shows the result for a log where the number of swapped cases equals to three.
Fig. 5
Process discovery results for the exp. case with no removal of swapped cases in the confounder pattern
To test for the sensitivity of the Heuristic algorithm to the number of swapped cases, we ran two additional tests, one in which we filtered out the swapped cases, and a second in which we generated a log with the uniform distribution configuration, but altered the duration for the Archive task to be drawn from \(\mathcal {U}[8,10]\), to ensure it significantly overlaps with the Email task. The corresponding results for these two tests are presented in Fig. 6. Running a similar test with the \(\alpha\)-algorithm was skipped, given its rigid sensitivity to the number of swapped cases that consistently leads to the result in Fig. 5a.
Fig. 6
Sensitivity of the Heuristic algorithm to swapped cases in the confounder pattern
As with the case of having only a few swapped cases, their elimination retained a sequential structure, as illustrated in Fig. 6a. However, for the second test, we observed that the significant number of swaps also affected the result of the Heuristic algorithm, yielding a structure illustrated in Fig. 6b that corresponds to the CBP as illustrated in Fig. 7.
Given the above results, we tested the LiNGAM algorithm and assessed its results in comparison to each of the cases generated by the PD algorithms, as summarized in Fig. 7.
Figure 7a relates to the cases where there were no swaps at all (as in Fig. 2a, b), and also to the case where the few swaps were removed from the event logs as in Fig. 6a.
Figure 7b relates to the cases where there was a limited number of swapped cases with no removal, as in Fig. 5a, b.
Figure 7c relates to the case with significant swapping cases as in Fig. 6b.
Fig. 7
LiNGAM algorithm results for the confounder pattern
For the cases with a few swapped instances, the \(\alpha\) algorithm generated a flow structure that matches the CBP, whereas the Heuristic algorithm generated a sequential flow structure. This stems from the difference in the threshold used by the two algorithms, with \(\alpha\) having no tolerance for the presence of swapped cases. Once the amount of swapped cases exceeded the threshold, the Heuristic algorithm also generated a flow structure that matched the confounder pattern. In all these cases, LiNGAM was consistently able to identify the confounder pattern. Particularly, we can conclude that LiNGAM was able to correctly recognize the cases in which the PD algorithms generated the sequential pattern \(A \rightarrow B \rightarrow C\) and that regression coefficients, as denoted on the edges of LiNGAM outputs, reflect the tendency towards the true value of 1 as in the original data generation equations (see Eq. (1)).
In cases when the CBP model does not match the BP model, the application of Algorithm 1 to the discovered BP as in Fig. 2a yields the highlighted result view (\(G_{result}\)) as shown in Fig. 8.
Fig. 8
Highlighted view (\(G_{result}\)) for the confounder pattern. Respective to the discovered BP in Fig. 2a, the \(Accept \rightarrow Email\) relation is annotated as causal (denoted by ‘C’ inside the connecting place), the \(Email \rightarrow Archive\) relation is annotated as non-causal, and the \(Accept \rightarrow Archive\) relation is new and annotated as causal
Complementing the robustness tests employed for the confounder pattern, we pursued the assessment for the collider pattern with two datasets with activity duration times drawn from a uniform distribution, one with and another without overlapping activities. The results corresponding to the dataset configuration in Table 2 are as follows. For the first uniform distribution with no overlapping cases between Email and Archive, the process mining results for both \(\alpha\) and heuristic algorithms are as illustrated in Fig. 9. For the second uniform distribution with some overlapping cases between Email and Archive, the process mining results for the two process mining algorithms are as illustrated in Fig. 10.
As with the confounder case, it is apparent that when there are no swapping cases, the \(\alpha\)-algorithm and the heuristic one both yield a sequential structure that is inconsistent with the collider pattern structure. Having only a few swapping cases, or passing the threshold for the heuristic algorithm, yields a structure that is consistent with the collider pattern.
Fig. 9
Process discovery results for the uniform distribution with no overlapping cases in the collider pattern
Running the LiNGAM algorithm for the two distributions resulted in both with the same CBP structure as illustrated in Fig. 11. As presented, in both distributions LiNGAM was able to identify the collider pattern.
In cases when the CBP model does not match the BP model, the application of Algorithm 2 to the discovered BP as in Fig. 9a yields the highlighted result view (\(G_{result}\)) as shown in Fig. 12.
Fig. 12
Highlighted view (\(G_{result}\)) for the collider pattern. Respective to the discovered BP in Fig. 9a, the \(Email \rightarrow Archive\) relation is annotated as non-causal (denoted by ‘NC’ inside the connecting place), the \(Archive \rightarrow CloseApplication\) relation is annotated as causal, and the \(Email \rightarrow CloseApplication\) relation is new and annotated as causal
We pursued the assessment for the mediator pattern with the uniform distribution dataset that corresponds to the configuration in Table 3. Process mining results for the \(\alpha\) and heuristic mining algorithms are illustrated in Fig. 13.
Fig. 13
Process discovery results for the uniform distribution cases in the mediator pattern
As aforementioned, in the dataset that was used, the CloseApplication activity acted as a partial mediator to the execution dependency between the Archive and the PaperDisposal activities, retaining also a direct execution dependence between the two. However, this direct effect was not apparent in any of the process mining results.
Running the LiNGAM algorithm for the same dataset yielded a CBP structure as illustrated in Fig. 14. As presented, LiNGAM did construct the pattern properly, capturing the mediator pattern with part of the effect being mediated and part being direct.
Application of Algorithm 3 to the discovered BP as in Fig. 13a yields the highlighted result view (\(G_{result}\)) as shown in Fig. 15.
Fig. 15
Highlighted view (\(G_{result}\)) for the mediator pattern. Respective to the discovered BP in Fig. 13a, the \(Archive \rightarrow CloseApplication\) relation is annotated as causal (denoted by ‘C’ inside the connecting place), the \(CloseApplication \rightarrow PaperDisposal\) relation is annotated as causal, and the \(Archive \rightarrow PaperDisposal\) relation is new and annotated as causal
In the second step of the evaluation, examining the hospital Sepsis dataset, we also searched for a situation where CD uncovers a confounder execution pattern while conventional process mining yields a different structure. Both the Heuristic mining algorithm and the IBM Process Mining tool revealed a sequential process structure related to the first three activities in the process: ER Registration, ER Triage, and ER Sepsis Triage as shown in Fig. 16.
The Sepsis dataset also consists of a series of data attributes captured along the events. Among these attributes, the data about each ER Registration event also includes attributes reflective of SIRS (Systemic Inflammatory Response Syndrome) screening criteria. For Sepsis patients, screening is likely to trigger both the ER Triage and the ER Sepsis triage activities, where the completion of the former also triggers the execution of the latter. Hence, a sequential time-precedence relationship among the three activities as observed in the process mining result tells only a part of the story and requires further examination in order to determine if indeed the execution of both triage activities is in fact already determined by the SIRS screening during registration.
Before running LiNGAM, we checked for distribution conformance. A Shapiro-Wilks test showed that the distribution of the activities ER Sepsis Triage (W = 0.36, p-value <.01) and ER Triage (W = 0.06, p-value <.01) departed significantly from normality, thus allowing the use of DirectLiNGAM.
Elicitation of the CBP model for the same three activities yielded a confounder pattern, as shown in Fig. 17. According to this result, we were able to get a confirmation that the execution of both triage activities is determined by the execution of the ER Registration activity in the process. However, the CBP did not show causal execution dependence between the basic triage and the sepsis triage. To investigate this lack of causal execution dependence further, we generated pair-wise XY scatter plots to observe the nature of the correlations among the three activities. The plot results are illustrated in Fig. 18. For the two causal execution dependencies that were denoted by the arrows in the CBP, the majority of points were scattered along the diagonal of the axes in the plots, respectively, as shown in plots (a) and (b), in accordance to the causal execution dependencies in Fig. 17. However, the lack of causal execution dependence between the two triage activities has manifested itself in an interesting ’V’-like shape as prominently depicted in plot (c), with a significant amount of points clustered along the Y-axis. This cluster reflects process execution traces in which the execution of the sepsis triage was lagging significantly behind the execution of basic triage for cases where the latter finished relatively quickly. Such a delay is likely to be indicative of a lack of resources in the process, that is, limited availability of physicians who can attend to patients in the ER soon after the basic triage is concluded to perform the sepsis triage. In fact, but to a lesser extent, the plots also reveal a similar ’V‘ pattern such that for very short Registration task executions, ER Sepsis task durations vary in length. Aside from a resource issue, this could also be attributed to those instances where SIRS was not done/not done properly during registration and required repeating it during Sepsis Triage. Such insight derivation demonstrates the value of detecting and further delving into an investigation of discrepancies between the BP and the CBP views.
Fig. 17
CBP for the Sepsis data corresponding to the first three activities in the process
In this section, we further delved into demonstrating the value of a comparative investigation of different variants associated with a given process (intermediate) outcome. Using the Request-for-Payment BPI benchmark dataset, we specifically analyzed the part of the process associated with the post-rejection of a payment request by an administrator. Our analysis begins at the point where a payment request is resubmitted by the employee for approval and continues to the point where it is resubmitted and handled.
Employing the IBM Process Mining tool for process discovery on this dataset yields the BP model in Fig. 19. We highlighted in the BP model the activities that are part of the aforementioned segment of interest. According to this model, following payment rejection by an Admin, it is observable that either the employee rejects the submission and withdraws from the process (reaching the process end), or the request is resubmitted by the employee. In this latter case, a sequence of approvals takes place, splitting between two major variants. One variant consists of 136 cases, in which the chain of approvals includes the process Admin, Budget Owner, and Supervisor. A second variant, consisting of 241 cases, skips the approval step by a Budget Owner when the Budget Owner is in fact the same person as the Supervisor. Following the approval, the payment request is repeated, and subsequently handled. In such circumstances, if there may happen to be any delays with the re-submission of the requests, the immediate activity one may choose to investigate according to the BP model would be the approval-by-the-supervisor activity that precedes it.
Fig. 19
IBM process mining BP model for the Request-for-payment dataset
The elicitation of the CBP for the same two major variants may yield slightly different interpretations. The results of running LiNGAM for these two variants are illustrated in Fig. 20. While in the case where there are only approvals by an admin and by a supervisor, the latter activity is also the sole cause for the execution of the eventual payment request, this is not the case when there is also approval by a budget owner. In all such cases, the eventual RequestPayment is revealed to be a collider activity, having its execution timing being also partially dependent on the initial approval by the admin. This implies that some of the delays with such requests may inherently be caused by some action that is made by the admin, and not necessarily only a responsibility of the supervisor. Such a conclusion can only be drawn from observing the CBP.
Fig. 20
CPB comparable variants for the Request-for-payment data - Absolute Modality.
Figure 21 depicts the CBP results when running the relative modality on the same variants. Normalizing activity timestamps on the immediate parent, based on the process model, effectively “cuts off” chains of three or more dependencies, revealing causal relations only between any two directly-followed activities. This eliminates any collider or confounder pattern from the results.
Fig. 21
CPB comparable variants for the Request-for-payment data - Relative Modality
Our novelty resides in applying CD techniques in order to find causal execution dependencies in event logs, so our research is positioned at the intersection between PD and CD. While to the best of our knowledge, there is no related work in this specific space, we bring some notable works that associate CD and causal inference with business process management. Most of the PD approaches apply some threshold over the time precedence (counting over ‘directly follows’ or ‘eventually follows’ relationships). “Causal” relations are sometimes used to express the frequencies of these time-precedence dependencies. One such example is [23] where there is a reduction in the representational bias of the hybrid miner algorithm by exploiting causal graph metrics to mine for long-term dependencies. We suggest following a different approach that is based on a bidirectional (asymmetric) observation of the relationship between the time property of any two activities. It is important to note that our core meaning of the causal relationships is different. The novelty of our work resides in applying CD techniques (e.g., LiNGAM) to determine the causal execution dependency between any two activities only by observing a series of task execution times.
Causal-nets [2, 24], introduced in 2002 by van der Aalst, are a form of process representation (i.e., a notation) that is expressively richer than a variety of modeling languages (e.g., Petri-nets, BPMN, and EPC), and could serve as a viable alternative representation of CBPs. However, being merely a notation, causal-nets do not include any mechanism to infer the causal structure among activity executions, and the knowledge about the causal structure should be either provided by domain experts or by using some discovery algorithm as tackled in our work.
As mentioned in the background section, methods for causal inferencing and discovery are split between techniques that are used to assess the quantitative extent of the impact of an intervention and techniques to determine if qualitative causal relationships exist with respect to a given dataset with no further intervention. A technique such as the Conditional Average Treatment Effect (CATE) [25] gives a measurement to assess the magnitude of an intervention. For example, the work in [26] used CATE to create an uplifting tree that maximizes the CATE in the splits, to determine the best action to be taken. The work assumed that the confounding set is given while we discover it using LiNGAM. In [27, 28] CATE is used for prescriptive process monitoring, to prevent interventions from being triggered unnecessarily when the level of confidence is insufficient. Here as well, the causal model is given, while we discover it applying LiNGAM on the execution times of the activities in the process.
In [29], the authors used causality to answer what-if questions for process improvement relying on a given process model, while we focus on its discovery out of the event logs. In [30], the authors propose a technique that generates a graph of causal factors explaining process performance. To detect causal relations, they test for Granger causality, a statistical test widely used for causal analysis of time series. The idea is that values of performance indicators are seen as time series while we use CD to determine dependencies among task execution times. In other words, in [30] the relations between variables are determined based on correspondences among timestamped values (of the KPIs) rather than solely between the timestamps (of the tasks) themselves as in our method.
In [31, 32] an alternative way to discover a business process model from a double timestamp event log (i.e., in which there is a start time and complete time of each activity) is proposed. The algorithm is based on “temporal causal relations”, which are flow patterns reflective of different flow structures based on types of time-precedence relations among activities in the event log (e.g., same start time, same end time, contains, overlaps). These patterns can reveal sequential and parallel AND, OR, and XOR relations.
This work uses a modified version of the Alpha Miner called Modified Time-based Alpha Miner that incorporates timing information to better handle real-world complexities in process mining. As with the core comparisons presented in our synthetic examples, this approach still retains the fundamental difference stemming from a process model constructed based on the frequency and order of activities, not on establishing a causal statistical relationship between the timing of those activities. Thus, even with timestamps, LiNGAM would not inherently construct process models or reveal the kind of direct-follows relationships typical in process mining but would rather attempt to infer causal directions by utilizing the statistical properties of the timing, specifically assuming linear dependencies and non-Gaussian distribution of the errors, which is conceptually and methodologically distinct from the goals of process mining.
Another work that proposes an alternative way to process mining is [33]. Here, the authors propose the ARE algorithm for process mining that identifies action-response-effect patterns in order to understand relevant aspects of process execution beyond the control flow perspective. To achieve this, they rely on statistical relations discovered from the event logs through some statistical tests stemming from time adjacency between the events as a signal for cause-effect dependence. While the ARE algorithm is essentially a frequency-based discovery of the directed follows relations with cause-effect pairs already given in the log, our method relies merely on the functional (linear) relation among the event timestamps without necessarily assuming any prior knowledge about the relationship among the tasks.
In [34], causal event models are introduced to correctly capture the n : m relations among events stored in a relational database. Causal relations are identified by the use of foreign key relationships. More recently, the work in [35] employs relational databases to tackle the problem of spurious relations. It demonstrates that directly-follows miners produce numerous spurious relationships that can be reduced. Our approach differs in some aspects. First, our notion of causality is based on causal execution dependencies and not on time precedence. Second, we base our CD solely on the input event log. Third, our aim is to produce a new view that emphasizes the causal execution dependencies rather than altering the process model discovery itself.
With respect to PD algorithms, we employed \(\alpha\) and Heuristic on the synthetic data. We acknowledge there are more contemporary algorithms, such as Flexible Heuristics Miner (FHM) [36], Inductive Miner [37], and Split Miner [38], being more robust to noisy and incomplete data. For the sake of establishing the core hypothesis, the two algorithms selected are representative, while for the Sepsis analysis, we employed a commercial tool.
While we used the LiNGAM algorithm for CD, there are other alternatives recently developed and applied in the context of causal analysis in business processes. For example, the work in [39] shows how explanations (about effects of changes) could be facilitated by causal structures describing a variety of features about the process execution. Concretely, this work demonstrated the viability of the approach using a Greedy Fast Causal Inference (GFCI) [40] algorithm.Under the assumptions of a large sample size, linear dependencies among features, and additive noise, this algorithm can yield faithful results.
Another example is the work in [41] that looks into inferring causal relationships among decision points (i.e., splits) in the process that aims to determine causal dependency chains along process decisions to set the probability a decision could affect process outcome, employing the Missing Value PC (MVPC) CD algorithm [42], an extension to PC [8] that starts from a fully connected undirected graph over all variables, and gradually removes edges based on causal independence statistical tests.
The main underlying assumption in [41], is that the process model is given and is correct while we apply process discovery techniques to mine the process model and apply CD techniques to mine the causal process model out of the same event log and use the latter as a reference to test whether any relation in the discovered/mined process model is causal or not from an execution time perspective.
Both MVPC and GFCI algorithms belong to a family of constraint-based algorithms, yielding a set of causal graphs that are consistent with the data in the form of a Partial Ancestral Graph (PAG). While both algorithms may yield equivalent classes of possible DAGs, the LiNGAM algorithm infers a unique DAG for a given dataset, making it more appropriate for the purpose of discrepancy analysis with the BP.
7 Conclusions and Future Work
Traditionally, process discovery techniques have been associational rather than causal, limiting reasoning about potential process changes thus affecting process improvement capabilities.
The ability to reason about the causes leading to certain consequences in a business process largely depends on the ability to infer the correct chain of event executions that affect these consequences. Current discovery techniques use time precedence as the primary guiding rationale for determining task ordering in the discovered process model. In this paper, we show that relying merely on time precedence can sometimes deviate from the real causal dependencies in the business process model, and as a result, to inadequate execution interpretations. We provide a method to reveal the causal execution dependencies present in an event log by applying the LiNGAM algorithm. Respectively, we also provide a method for constructing a new view that overlays the discovered process model with causal relationships, which is agnostic to the concrete PD algorithm employed.
To the best of our knowledge, our work is a first attempt at applying state-of-the-art CD algorithms to the timing of activities in order to detect, and respectively highlight, causal execution dependencies among tasks over the output of process discovery techniques.
We envision different directions as future work. First, we could relax the assumption of known common causes (e.g., observed confounders) to test for latent causal tasks.
Second, this work calls for an empirical user study to test the interpretability and usefulness of the proposed discrepancy view.
Third, one may leverage CBPs for the sake of process outcome explainability. We have previously shown [43] that XAI frameworks such as LIME can be extended for better business process explainability. A similar approach was also employed to SHAP [44]. We could improve this naïve approach that bundled all attributes in the input vector to the explainer to stratify the attributes based on the true causal dependencies as discovered in the underlying causal model.
Last, the discovery of the actual causal process execution can yield a significant means to drive more logically compact explanations (i.e., sufficient reasons [45]). The imposing of causal dependency constraints over the overall space of explanations can help remove explanations that are logically subsumed by other explanations that are shorter, do not embed redundant reasons, and as a result, are also likely to be more comprehensible.
Causal execution dependencies play a vital role as an instrument for a deeper understanding of process executions, and a vehicle for process improvement. Our ultimate goal is to enrich current PM tools with a CBP-view overlaying feature.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Our product recommendations
KI - Künstliche Intelligenz
The Scientific journal "KI – Künstliche Intelligenz" is the official journal of the division for artificial intelligence within the "Gesellschaft für Informatik e.V." (GI) – the German Informatics Society - with constributions from troughout the field of artificial intelligence.
Pearl J (2000) Causality: models, reasoning, and inference. Cambridge University Press, New YorkMATH
2.
van der Aalst W (2016) Process mining: data science in action, vol 2. Springer, BerlinCrossRef
3.
Dumas M et al (2023) Ai-augmented business process management systems: a research manifesto. ACM Trans Manage Inf Syst 14(1):1–19CrossRef
4.
van der Aalst W et al. (2011) Process mining manifesto. In: International conference on business process management. pp. 169–194. Springer
5.
van der Aalst W, Weijters T, Maruster L (2004) Workflow mining: discovering process models from event logs. IEEE Trans Knowl Data Eng 16(9):1128–1142CrossRef
6.
Peters J, Janzing D, Schölkopf B (2017) Elements of causal inference: foundations and learning algorithms. The MIT Press, CambridgeMATH
7.
Pearl J (2009) Causality. Cambridge University Press, CambridgeCrossRefMATH
8.
Spirtes P, Glymour CN, Scheines R, Heckerman D (2000) Causation, prediction, and search. MIT Press, Cambridge
Parente C, Costa CJ (2022) Comparing process mining tools and algorithms. In: 2022 17th Iberian Conference on Information Systems and Technologies. pp. 1–7
23.
Kourani H, Di Francescomarino C, Ghidini C, van der Aalst W, van Zelst S (2023) Mining for long-term dependencies in causal graphs. In: Business Process Management Workshops. pp. 117–131. Springer International Publishing, Cham
24.
van der Aalst W, Adriansyah A, van Dongen B (2011) Causal nets: A modeling language tailored towards process discovery. In: CONCUR 2011 - Concurrency Theory. pp. 28–42. Springer
25.
Künzel SR, Sekhon JS, Bickel PJ, Yu B (2019) Metalearners for estimating heterogeneous treatment effects using machine learning. Proc Natl Acad Sci U S A 116(10):4156–4165CrossRef
26.
Bozorgi ZD, et al. (2020) Process mining meets causal machine learning: Discovering causal rules from event logs. In: 2nd ICPM’2020. pp. 129–136. IEEE
27.
Shoush M, Dumas M (2022) When to intervene? prescriptive process monitoring under uncertainty and resource constraints. In: Business Process Management Forum. pp. 207–223. Springer International Publishing, Cham
28.
Shoush M, Dumas M (2023) Intervening with confidence: Conformal prescriptive monitoring of business processes. arXiv:2212.03710
29.
Narendra T, Agarwal P, Gupta M, Dechu S (2019) Counterfactual reasoning for process optimization using structural causal models. In: International Conference on Business Process Management, pp. 91–106. Springer
30.
Hompes BFA et al (2017) Discovering causal factors explaining business process performance variation. In: Dubois E, Pohl K (eds) Advanced Information systems engineering. Springer, Cham, pp 177–192CrossRef
Waibel P, Novak C, Bala S, Revoredo K, Mendling J (2021) Analysis of business process batching using causal event models. In: Leemans S, Leopold H (eds) Process mining workshops. Springer, Cham, pp 17–29CrossRef
35.
Waibel P, Pfahlsberger L, Revoredo K, Mendling J (2022) Causal process mining from relational databases with domain knowledge. arXiv:2202.08314
36.
Weijters A, Ribeiro J (2011) Flexible heuristics miner (fhm). In: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). pp. 310–317
37.
Leemans, S (2017) Robust process mining with guarantees. Phd thesis 1 (research tu/e / graduation tu/e), Mathematics and Computer Science , proefschrift
38.
Augusto A, Conforti R, Dumas M, Rosa ML, Polyvyanyy A (2019) Split miner: automated discovery of accurate and simple business process models from event logs. Knowl Inf Syst 59:251–284. https://doi.org/10.1007/s10115-018-1214-xCrossRef
39.
Qafari MS, van der Aalst W (2020) Root cause analysis in process mining using structural equation models. In: Business Process Management Workshops. pp. 155–167. Springer International Publishing, Cham
40.
Ogarrio JM, Spirtes P, Ramsey J (2016) A hybrid causal search algorithm for latent variable models. In: Proceedings of the Eighth International Conference on Probabilistic Graphical Models. vol. 52, pp. 368–379. Lugano, Switzerland
41.
Leemans SJJ, Tax N (2022) Causal reasoning over control-flow decisions in process models. In: CAiSE 2022, Leuven, Belgium, June 6–10, 2022, Proceedings. LNCS, vol. 13295, pp. 183–200. Springer
42.
Tu R., et al (2019) Causal discovery in the presence of missing data. In: Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics. vol. 89, pp. 1762–1770. PMLR
43.
Amit G, Fournier F, Gur S, Limonad L (2022) Model-informed LIME Extension for Business Process Explainability. In: PMAI@IJCAI’22. CEUR, Vienna
44.
Heskes T, Sijben E, Bucur IG, Claassen T (2020) Causal Shapley values: exploiting causal knowledge to explain individual predictions of complex models. Adv Neural Inf Process Syst 33:4778–4789
45.
Gorji N, Rubin S (2022) Sufficient reasons for classifier decisions in the presence of domain constraints. Proc AAAI 36(5):5660–5667CrossRef