Fast detection of exact clones in business process model repositories
Introduction
Organizations engaged in long-term business process management programs need to deal with repositories of hundreds or even thousands of process models, with sizes ranging from dozens to hundreds of elements per model [1], [2]. For example, the SAP reference model contains over 600 business process models, while Suncorp, a large Australian insurer, maintains a repository of over 3000 process models [3]. Tool vendors nowadays distribute reference model repositories (e.g. the IT Infrastructure Library—ITIL) with over a thousand process models each. Such models are used to document and to communicate internal procedures, to guide the development of IT systems, or to support business improvement projects, among other uses.
While highly valuable, such collections of process models raise a maintenance challenge [4]. This challenge is amplified by the fact that process models are generally maintained and used by stakeholders with varying skills, responsibilities and goals, sometimes distributed across independent organizational units. One problem that arises as repositories grow is that of managing overlap across models. In particular, process model repositories tend to accumulate duplicate fragments over time, as new process models are created by copying and merging fragments from other models. Experiments conducted during this study have put into evidence a large number of clones in three process model repositories from industrial practice. This situation is akin to that observed in source code repositories, where significant amounts of duplicate code fragments – known as code clones – are accumulated over time [5].
Cloned fragments in process models raise several issues. Firstly, clones make individual process models larger than they need to be, thus affecting their comprehensibility. Secondly, clones are modified independently, sometimes by different stakeholders, leading to unwanted inconsistencies across models. Finally, process model clones hide potential efficiency gains. Indeed, by factoring out cloned fragments into separate subprocesses, and exposing these subprocesses as shared services, companies may reap the benefits of larger resource pools.
Detecting clones by comparing process models in a pairwise manner – using sub-graph isomorphism – would be inefficient in the context of repositories with hundreds of process models. Accordingly, indexes are needed to speed up the clone discovery process.
In this setting, this paper addresses the problem of retrieving all clones in a process model repository that can be refactored into shared subprocesses. Specifically, the contribution of the paper is an index structure, namely the RPSDAG, that provides operations for inserting and deleting models, as well as an operation for retrieving all clones in a repository that meet the following requirements:
- •
All retrieved clones must be single-entry, single-exit (SESE) fragments, since subprocesses are invoked according to a call-and-return semantics.
- •
All retrieved clones must be exact clones so that every occurrence can be replaced by an invocation to a single (shared) subprocess. While identifying approximate clones could be useful in some scenarios, approximate clones cannot be directly refactored into shared subprocesses, and thus fall outside the scope of this study.
- •
All retrieved clones are maximal, in the sense that each of them has multiple non-identical enclosing SESE regions. This “maximality” criterion is desirable because once we have identified a clone, every SESE fragment strictly contained in this clone is also a clone, but we do not wish to return all such non-maximal sub-clones.
- •
Retrieved clones must have at least two nodes (no “trivial” clones).
Identifying clones in a process model repository boils down to identifying fragments of a process model that are isomorphic to other fragments in the same or in another model. Hence, we need a method for decomposing a process model into fragments and a method for testing isomorphism between these fragments. Accordingly, the RPSDAG is built on top of two pillars: (i) a method for decomposing a process model into SESE fragments, namely the Refined Process Structure Tree (RPST) decomposition; and (ii) a method for calculating canonical codes for labeled graphs. These canonical codes reduce the problem of testing for graph isomorphism between a pair of graphs, to a string equality check. These two techniques however need some adaptations in order to fit the requirements of the RPSDAG. Firstly the RPST does not retrieve all possible SESE regions in a model. Some SESE regions are hidden inside others, like for example sub-sequences hidden inside larger sequences. Secondly, naive methods for calculating canonical codes for labeled graphs have problems scaling up, especially when there are multiple nodes with duplicate labels (or unlabeled nodes). Process models contain “gateways” and gateways are generally not labeled. In this paper, we propose several optimizations that take advantage of the specific structure of process models in order to scale up the calculation of canonical codes.
As a byproduct, we show how the RPSDAG can also be used to efficiently answer “fragment queries”, that is, queries aimed at retrieving all occurrences of a given model fragment in a repository.
The rest of the paper is structured as follows. Section 2 introduces the concepts of RPST and canonical code and discusses how they are used to address the problem at hand. Next, Section 3 describes the RPSDAG, including its insertion and deletion algorithms. This section also shows how the RPSDAG can be used to efficiently retrieve all occurrences of a given process model fragment in a repository. Section 4 presents an empirical evaluation of the RPSDAG. Finally, Section 5 discusses related work while Section 6 draws conclusions.
Section snippets
Background
This section introduces the two basic ingredients of the proposed technique: the Refined Process Structure Tree (RPST) and the code-based graph indexing.
Clone detection method
This section introduces the RPSDAG, including its underlying data structure, insertion and deletion algorithms and its support for “all-clone queries” and “fragment queries”.
Evaluation
This section reports on a series of tests to evaluate the performance of the RPSDAG as well as the usefulness of clone detection in practical settings.
Related work
This paper is an extended version of a previous conference paper [16]. The extensions include the ability to detect sub-bond clones and sibling clones, as well as the application of the RPSDAG to handle fragment queries. The experimental evaluation was extended to assess the performance of the extended RPSDAG and of the query processing, and to assess the usefulness of sub-bond and sibling clone detection. Below, we review related work along the following areas: (i) clone detection in software
Conclusion
We presented a technique to index process models in order to identify duplicate SESE fragments (clones) that can be refactored into shared subprocesses. The proposed index, namely the RPSDAG, combines a method for decomposing process models into SESE fragments (the RPST decomposition) with a method for generating a unique string to identify a labeled graph (canonical codes). Canonical codes are used to determine whether an SESE fragment in a model appears elsewhere in the same or in another
Acknowledgments
This research is funded by the Estonian Science Foundation and ERDF via the Estonian Centre of Excellence in Computer Science (Uba, Dumas and García-Bañuelos), and by ARC Linkage Grant “LP110100252” and NICTA Queensland Lab (La Rosa).
References (31)
- et al.
Improved model management with aggregated business process models
Data and Knowledge Engineering
(2009) - et al.
Managing large collections of business process models—current techniques and challenges
Computers in Industry
(2012) - et al.
The refined process structure tree
Data and Knowledge Engineering
(2009) - et al.
APROMORE: an advanced process model repository
Expert Systems with Applications
(2011) - et al.
Refactoring large process model repositories
Computers in Industry
(2011) - et al.
Identifying refactoring opportunities in process model repositories
Information & Software Technology
(2011) - et al.
Querying business processes with BP-QL
Information Systems
(2008) Potential pitfalls of process modeling: part A
Business Process Management Journal
(2006)- M. La Rosa, M. Dumas, R. Dijkman, Business process model merging: an approach to business process consolidation, ACM...
- R. Koschke, Identifying and removing software clones, in: Software Evolution, Springer, 2008, pp....