1 Introduction
-
In the online shop scenario, the responsible employees are causing the delays.
-
In the second scenario, we may conclude that over time the employees get more and more reckless, and consequently the rate of deviations increases.
-
In the IT company, we may conclude that the more people working on a project, the more time is spent on team management and communication, which prolongs the project unnecessarily.
2 Motivating example
-
“Priority” which is a feature of business case development indicating how urgent the software is for the customer,
-
“Team size” which is a feature of team charter specifying the number of resources working on a project,
-
“Duration” of product backlog activity, a feature of product backlog activity, which indicates its duration.
3 Related work
-
In [10], the authors propose an approach for discovering causal relationships between a range of business process characteristics and process performance indicators based on time-series analysis. The idea is to generate a set of time series using the values of performance indicators, and then apply the Granger causality test on them, to investigate and discover their causal relationships. Granger test is a statistical hypothesis test to detect predictive causality; consequently, the causal relationships using this approach might not be true cause-and-effect relationships.
-
In [11], the authors use the event log and the BPMN model of a process to discover the structural causal model of the process features. They first apply loop unfolding on the BPMN model of the process and generate a partial order of features. They use the generated partial order to guide the search algorithm. In this work, it is assumed that the BPMN model of a process is its accurate model, which is not always the case.
4 Overview of the method
4.1 Data extraction
Case ID | Activity name | Timestamp | Priority | Team size | Duration | Responsible | Implementation phase duration |
---|---|---|---|---|---|---|---|
1 | Business case development | 20,10,2018 | 2 | Alice | 324 | ||
1 | Feasibility study | 15.1.2019 | 87 | Alice | 324 | ||
1 | Product backlog | 19.2.2019 | 35 | Alice | 324 | ||
1 | Team charter | 19.3.2019 | 21 | 28 | Alice | 324 | |
1 | Development | 19.11.2019 | 245 | Alice | 324 | ||
1 | Test | 6.2.2020 | 79 | Alice | 324 | ||
1 | Release | 8.2.2020 | 2 | Alice | 324 | ||
2 | Business case development | 20.2.2019 | 1 | Alex | 807 | ||
2 | Feasibility study | 22.2.2019 | 33 | Alex | 807 | ||
2 | Product backlog | 26.4.2019 | 63 | Alex | 807 | ||
2 | Team charter | 3.5.2019 | 33 | 7 | Alex | 807 | |
2 | Development | 3.2.2020 | 276 | Alex | 807 | ||
2 | Test | 17.4.2020 | 74 | Alex | 807 | ||
2 | Release | 25.4.2020 | 8 | Alex | 807 | ||
2 | Development | 31.3.2021 | 340 | Alex | 807 | ||
2 | Test | 26.7.2021 | 117 | Alex | 807 | ||
2 | Release | 29.7.2021 | 3 | Alex | 807 |
-
Trace situation, when the class feature is one of the trace features, e.g., trace delay, and each situation is a trace.
-
Event situation, when the class feature is one of the event features, e.g., the duration of activity “Test” (in the context of IT company in Sect. 2), and each situation is a prefix of a trace and its trace-level attributes. In this example, each situation includes a prefix of a trace in the IT company event log ending with an event with the activity name “Test” and the trace-level attributes of that trace.
-
the duration of the trace,
-
the timestamp of events with activity name “Test”,
-
the duration of the events with activity name “Development”, or
-
the resource of the events that took longer than 80 days.
-
the duration of the trace is 807 days,
-
the timestamp of the event with activity name “Test” is 117,
-
the duration of the event with activity name “Development” is 340 days, and
-
the resource of the events that took longer than 80 is Alex which is the one for activity “Test” with duration 117 days.
-
a trace if the second term of the class situation feature is empty. In other words, the first element of the class situation feature is a trace-level attribute name.
-
a prefix of a trace and its trace-level attributes in the event log ending with an event that belongs to the event group specified by the second term of the class situation feature. In this case, class situation feature is an event-level situation feature.
4.2 Feature recommendation
4.3 Causal inference
5 Preliminaries
5.1 Process mining
-
\(\mathcal {U}_{att}\) is the universe of attribute names, where \(\{ actName, timestamp, caseID\} \subseteq \mathcal {U}_{att}\). actName indicates the activity name, timestamp indicates the timestamp of an event, and caseID is an identifier indicating the trace (process instance) that the event belongs to.
-
\(\mathcal {U}_{val}\) is the universe of values.
-
\(\textit{values}\in \mathcal {U}_{att} \mapsto {\mathbb {P}}(\mathcal {U}_{val})\) is a function that returns the set of all possible values of a given attribute name3.
-
\(\mathcal {U}_{map}=\{ m \in \mathcal {U}_{att} \not \mapsto \mathcal {U}_{val}\mid \forall at \in dom(m):m(at) \in \textit{values}(at) \}\) is the universe of all mappings from a set of attribute names to attribute values of the correct type.
\(e_1{:}{=} \{(caseID,1), (Responsible, Alice), (actName,\text {``Business case development''}), (timestamp, 20.10.2018), (Priority, 2)\}\)
| |
\(e_2{:}{=} \{(caseID,1),(actName,\text {``Feasibility study''}),(timestamp, 15.1.2019)\}\)
| |
\(e_3{:}{=} \{(caseID,1),(Responsible, Alice), (actName,\text {``Product backlog''}),(timestamp, 19.2.2019), ( Duration,35)\}\)
| |
\(e_4{{:}{=}} \{(caseID,1),(Responsible, Alice), (actName,\text {``Team charter''}),(timestamp, 19.3.2019), (Team\ size,21)\}\)
| |
\(e_5{{:}{=}} \{(caseID,1),(Responsible, Alice),(actName,\text {``Development''}),(timestamp, 19.11.2019), ( Duration,245) \}\)
| |
\(e_6{{:}{=}} \{(caseID,1),(Responsible, Alice), (actName,\text {``Test''}),(timestamp, 6.2.2020), (Duration,79) \}\)
| |
\(e_7{{:}{=}} \{(caseID,1),(Responsible, Alice), (actName,\text {``Release''}),(timestamp, 8.2.2020) \}\)
| |
\(e_8{{:}{=}} \{(caseID,2),(Responsible, Alex), (actName, \text {``Business case development''}) ,(timestamp, 20.2.2019), (Priority, 1) \}\)
| |
\(e_9{{:}{=}} \{(caseID,2),(Responsible, Alex), (actName, \text {``Feasibility study''}), (timestamp, 22.2.2019)\}\)
| |
\(e_{10}{{:}{=}} \{(caseID,2),(Responsible, Alex), (actName, \text {``Product backlog''}),(timestamp, 26.4.2019), ( Duration,63) \}\)
| |
\(e_{11}{{:}{=}} \{(caseID,2),(Responsible, Alex), (actName,\text {``Team charter''}),(timestamp, 3.5.2019), (Team\ size,33)\}\)
| |
\(e_{12}{{:}{=}} \{(caseID,2),(Responsible, Alex), (actName,\text {``Development''}), (timestamp, 3.2.2020), (Duration,276) \}\)
| |
\(e_{13}{{:}{=}} \{(caseID,2),(Responsible, Alex), (actName, \text {``Test''}),(timestamp, 17.4.2020), ( Duration,74) \}\)
| |
\(e_{14}{{:}{=}} \{(caseID,2),(Responsible, Alex) (actName,\text {``Release''}),(timestamp, 25,4,2020)\}\)
| |
\(e_{15}{{:}{=}} \{(caseID,2),(Responsible, Alex), (actName, \text {``Development''}), (timestamp, 31.3.2021), (Duration,340)\}\)
| |
\(e_{16}{{:}{=}} \{(caseID,2),(Responsible, Alex), (actName,\text {``Test''}), (timestamp, 26.7.2021),(Duration,117) \}\)
| |
\(e_{17}{{:}{=}} \{(caseID,2),(Responsible, Alex), (actName,\text {``Release''}), (timestamp, 29.7.2021)\}\)
|
-
the set of events with specific activity names,
-
the set of events which are done by specific resources,
-
the set of events that start in a specific time interval during the day, or,
-
the set of events with a specific duration.
-
G-based situation subset of L as \(S_{L,G}= \{ (\langle e_1, \dots , e_n \rangle , m) \in S_L \mid e_n \in G\}\), and
-
trace-based situation subset of L as \(S_{L,\bot } =L\).
-
if \(G=\bot \), then \( \# _{(at,G)} ((\sigma ,m)) = m(at) \) and
-
if \(G \in \mathcal {G}\), then \( \# _{(at,G)} ((\sigma ,m)) = e(at) \) where \(e =\displaystyle {arg} {max}_{\begin{array}{c} e' \in G \cap \{ e'' \in \sigma \} \end{array}}e'(timestamp) \)
\(\textit{sf}_1 = \) | \(\textit{sf}_2 =\) | \(\textit{sf}_3 =\) | \(\textit{sf}_4 = \) |
---|---|---|---|
\((Team\ size, G_3)\) | \((Duration, G_2)\) | \((Priority, G_1)\) | \((Duration, G_4)\) |
21 | 35 | 2 | 245 |
33 | 63 | 1 | 276 |
33 | 63 | 1 | 340 |
5.2 Structural equation model
\((Priority, G_1) = N_{(Priority, G_1)}\)
|
\(N_{(Priority, G_1)}\sim Uniform (1,3)\)
|
\((Team\ size, G_3) =10(Priority, G_1) + N_{(Team\ size, G_3)}\)
|
\(N_{(Team\ size, G_3)} \sim Uniform(1,15)\)
|
\((Duration, G_2) =2 (Team\ size, G_3) + N_{(Duration, G_2)}\)
|
\(N_{(Duration, G_2)} \sim Uniform(-5,5)\)
|
\((Duration, G_4) =5(Duration, G_2) +10(Priority, G_1) \)
|
\(N_{(Duration, G_4)} \sim Uniform(-100,100)\)
|
\(+ (Team\ size, G_3) +N_{(Duration, G_4)}\)
|
\((Priority, G_1) = N_{(Priority, G_1)}\)
|
\(N_{(Priority, G_1)}\sim Uniform (1,3)\)
|
\((Team\ size, G_3) =13\)
| |
\((Duration, G_2) =2 (Team\ size, G_3) + N_{(Duration, G_2)}\)
|
\(N_{(Duration, G_2)} \sim Uniform(-5,5)\)
|
\((Duration, G_4) =5(Duration, G_2) + 10 (Priority, G_1) \)
|
\(N_{(Duration, G_4)} \sim Uniform(-100,100)\)
|
\(\quad + (Team\ size, G_3) +N_{(Duration, G_4)}\)
|
6 Approach
6.1 Automated situation feature recommendation
6.2 SEM inference
-
The first step is causal structure discovery, which involves discovering the causal structure of the situation feature table. This causal structure encodes the existence and the direction of the causal relationships among the situation features in the situation extraction plan of the given situation feature table.
-
The second step is causal strength estimation, which involves estimating a set of equations describing how each situation feature is influenced by its immediate causes. Using this information we can generate the SEM of the given situation feature table.
6.2.1 Causal structure discovery
-
\(\textit{sf}_1 \rightarrow \textit{sf}_2\) indicates that \(\textit{sf}_1\) is a direct cause of \(\textit{sf}_2\).
-
\(\textit{sf}_1 \leftrightarrow \textit{sf}_2\) means that neither \(\textit{sf}_1\) nor \(\textit{sf}_2\) is an ancestor of the other one, even though they are probabilistically dependent (i.e., \(\textit{sf}_1\) and \(\textit{sf}_2\) are both caused by one or more hidden confounders).
-
means \(\textit{sf}_2\) is not a direct cause of \(\textit{sf}_1\).
-
indicates that there is a relationship between \(\textit{sf}_1\) and \(\textit{sf}_2\), but nothing is known about its direction.
-
If \((\textit{sf}_1, \textit{sf}_2) \in D_{req}\), then we have or in the output PAG.
-
If \((\textit{sf}_1, \textit{sf}_2) \in D_{frb}\), then in the discovered PAG it should not be the case that \(\textit{sf}_1 \rightarrow ~\textit{sf}_2\).
6.2.2 Causal strength estimation
7 Experimental results
7.1 Implementation notes
-
Time perspective: timestamp, activity duration, trace duration, trace delay, sub-model duration.
-
Control-flow perspective: next activity, previous activity.
-
Conformance perspective: deviation, number of log moves, number of model moves.
-
Resource organization perspective: resource, role, group.
-
Aggregated features (regarding a given time window):
-
Process perspective: the number of waiting cases, process workload.
-
Trace perspective: average service time, average waiting time.
-
Event perspective: number of active events with a specific activity name, number of waiting events with a specific activity name.
-
Resource perspective: average service time, average waiting time
-
-
Independence and identically distribution of the instances in the situation feature table.
-
Causal Markov condition which is a form of local causality [22]. This condition states that a situation feature is independent of all other situation features except its decedents, given its direct causes (parents).
-
Causal faithfulness condition [22]. This condition states that all the independence relationships among the measured situation features are implied by the causal Markov condition.
-
No selection bias which implies that the presence of each instance in the situation feature table is independent of the values of its measured situation features.
-
The existence of no feedback cycle among the measured situation features.
7.2 Synthetic event log
\((Complexity, \bot )= N_{(Complexity, \bot )}\) | \(N_{(Complexity, \bot )} \sim Uniform(1,10)\) |
\((Priority, G_1) = N_{(Priority, G_1)}\) | \(N_{(Priority, G_1)}\sim Uniform (1,3)\) |
\((Duration, G_2) =10 (Complexity, \bot ) + N_{(Duration, G_2)}\) | \(N_{(Duration, G_2)} \sim Uniform(-2,4)\) |
\((Team\ size, G_3) =5(Complexity, \bot ) + 3(Priority, G_1) +N_{(Team\ size, G_3)}\) | \(N_{(Team\ size, G_3)} \sim Uniform(-1,2)\) |
\((Implementation\ phase\ duration, \bot ) =50(Complexity, \bot ) + \) | \(N_{(Implementation\ phase\ duration, \bot )} \sim Uniform(10,20)\) |
\(5(Team\ size, G_3) +N_{(Implementation\ phase\ duration, \bot )}\) |
\(N_{(Complexity, \bot )}\) | \(N_{(Priority, G_1)}\) | \(N_{(Duration, G_2)}\) | \(N_{(Team\ size, G_3)}\) | \(N_{(Implementation\ phase\ duration, \bot )}\) | |
\(\mathcal {EQ}_1\) | Uniform(1, 10) | Uniform(1, 3) | \(Uniform(-2,4)\) | \(Uniform(-1,2)\) | Uniform(10, 20) |
\(\mathcal {EQ}_2\) | Uniform(1, 10) | Uniform(1, 3) | \(Uniform(-2,58)\) | \(Uniform(-1,29)\) | Uniform(10, 210) |
\(\mathcal {EQ}_3\) | Uniform(1, 10) | Uniform(1, 3) | \(Uniform(-2,118)\) | \(Uniform(-1,59)\) | Uniform(10, 310) |
\(\mathcal {EQ}_4\) | Uniform(1, 10) | Uniform(1, 3) | \(Uniform(-2,178)\) | \(Uniform(-1,89)\) | Uniform(10, 410) |
7.3 Time and quality evaluation
-
Situation Feature Selection using Random Forest (SFSRF) and
-
Situation Feature Selection Based on Correlation (SFSBC).
-
Receipt phase of an environmental permit application process (WABO) CoSeLoG project (receipt log for short) that has 1434 traces [32].
-
A subset of business process intelligence (BPI) challenge 2017 event log that includes traces of length at least 20 but at most 30. This event log has 11044 traces [33].
-
A subset of BPI challenge 2019 event log that includes traces of length at least 8 but at most 10. This event log has 12574 traces [34].
True causal structure | Causal structure of recommended situation features by SFSRF | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Structural properties | Structural properties | Quality metrics | ||||||||||
Event log | Number of descriptive situation features | Number of edges | Number of edges of effective causal structure | Number of parents | Number of ancestors | Number of recommended situation features | Number of edges | Parent recall | Parent precision | Causal relationship recall | Causal relationship precision | |
1 | 19 | 41 | 7 | 2 | 6 | 2 | 3 | 0.5 | 0.33 | 0.28 | 0.67 | |
2 | 19 | 40 | 11 | 5 | 6 | 5 | 6 | 0.6 | 1 | 0.27 | 0.5 | |
3 | 19 | 32 | 12 | 6 | 8 | 6 | 6 | 0.33 | 0.67 | 0.41 | 0.83 | |
4 | 19 | 34 | 15 | 4 | 8 | 3 | 4 | 0.2 | 0.5 | 0.13 | 0.5 | |
5 | 19 | 30 | 11 | 4 | 5 | 3 | 2 | 0.25 | 0.5 | 0.09 | 0.5 | |
6 | 19 | 34 | 6 | 2 | 4 | 3 | 3 | 1 | 1 | 0.5 | 1 | |
7 | 19 | 37 | 5 | 3 | 4 | 4 | 6 | 1 | 0.5 | 0.6 | 0.5 | |
8 | 19 | 38 | 11 | 5 | 6 | 4 | 6 | 0.5 | 0.5 | 0.36 | 0.67 | |
9 | 19 | 39 | 9 | 3 | 6 | 4 | 3 | 0.33 | 0.33 | 0.11 | 0.33 | |
10 | 19 | 35 | 6 | 2 | 5 | 3 | 6 | 0.33 | 0.33 | 0.17 | 0.17 |
Causal structure of recommended situation features by SFSBC | Causal structure of recommended situation features by SFVPR | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Structural properties | Quality metrics | Structural properties | Quality metrics | |||||||||
Event log | Number of recommended situation features | Number of edges | Parent recall | Parent precision | Causal relationship recall | Causal relationship precision | STAB Number of recommended situation features | Number of edges | Parent recall | Parent precision | Causal relationship recall | Causal relationship precision |
1 | 5 | 6 | 1 | 0.67 | 0.28 | 0.33 | 4 | 5 | 1 | 1 | 0.71 | 1 |
2 | 5 | 6 | 0.6 | 1 | 0.27 | 0.5 | 7 | 8 | 0.4 | 0.67 | 0.18 | 0.25 |
3 | 6 | 6 | 0.33 | 0.67 | 0.41 | 0.83 | 8 | 10 | 0.33 | 0.67 | 0.58 | 0.7 |
4 | 3 | 4 | 0.2 | 0.5 | 0.5 | 0.5 | 8 | 14 | 0.4 | 0.5 | 0.47 | 0.5 |
5 | 5 | 4 | 0.5 | 0.5 | 0.5 | 0.75 | 4 | 4 | 0.25 | 0.5 | 0.18 | 0.5 |
6 | 3 | 3 | 1 | 1 | 1 | 1 | 7 | 10 | 1 | 0.67 | 0.83 | 0.5 |
7 | 5 | 7 | 1 | 1 | 0.5 | 0.57 | 7 | 10 | 1 | 1 | 0.6. | 0.3 |
8 | 4 | 6 | 0.2 | 0.5 | 0.36 | 0.67 | 10 | 12 | 0.8 | 1 | 0.72 | 0.67 |
9 | 4 | 3 | 0.33 | 0.33 | 0.11 | 0.33 | 6 | 10 | 0.33 | 0.5 | 0.56 | 0.5 |
10 | 4 | 9 | 0.5 | 0.33 | 0.33 | 0.22 | 6 | 12 | 1 | 0.67 | 0.67 | 0.36 |
-
rsf as the set of selected situation features.
-
ptcs as the set of parents of the class situation feature in the true causal structure.
-
pfcs as the set of potential parents of class situation feature in the PAG discovered using the trimmed situation feature table.
-
etcs as the set of causal relationships (edges) in the effective causal structure.
-
efcs as the set of potential causal relationships (edges regardless of their type) in the causal structure discovered using the trimmed situation feature table.
-
\(parent\ recall\): the portion of parents of the class situation features in the true causal structure that have been also a potential parent in the PAG discovered using the trimmed situation feature table.
-
\(parent\ precision\): the portion of potential parents of the class situation feature in the PAG discovered using the trimmed situation feature table which are also a parent of the class situation features in the true causal structure of the data.
-
\(causal\ relationship \ recall\): the portion of causal relationships in the effective causal structure that have been detected by the PAG discovered using the trimmed situation feature table. structure.
-
\(causal\ relationship\ precision\): the portion of potential causal relationships in the PAG discovered using the trimmed situation feature table which are also a causal relationship in the effective causal structure.
-
In general, except for causal relationship precision, SFVPR achieved better results than SFSBC and SFSRF. Please note that none of the methods achieved the best results in all the experiments.
-
Considering causal relationship precision, SFVPR achieved weaker results in comparison with SFSBC and SFSRF. It can be explained by considering that this method recommends more situation features than the other two methods which usually results in discovering more potential causal relationships in the discovered PAG. In addition, to compute the causal relationship precision, we compare the portion of the potential causal relationships in the discovered PAG on the trimmed situation feature which are also causal relationships in the effective causal structure. The effective causal structure includes a subset of the causal relationships of the connected component of the true causal relationship that includes the class situation feature. Many of the potential causal relationships present in the discovered PAG on the trimmed situation feature by SFVPR method are corresponding to the causal relationships in the connected including the class situation feature but not in the effective causal structure.
-
In none of the experiments the recommended set of situation features by SFVPR includes a situation feature that does not belong to that connected component of true causal structure which includes class situation feature. However, in three experiments both SFSBC and SFSRF recommend situation features that do not belong to the same connected component of the true causal structure that includes class situation feature.