Elsevier

Information Systems

Volume 38, Issue 4, June 2013, Pages 619-633
Information Systems

Fast detection of exact clones in business process model repositories

https://doi.org/10.1016/j.is.2012.07.002Get rights and content

Abstract

As organizations reach higher levels of business process management maturity, they often find themselves maintaining very large process model repositories, representing valuable knowledge about their operations. A common practice within these repositories is to create new process models, or extend existing ones, by copying and merging fragments from other models. We contend that if these duplicate fragments, a.k.a. exact clones, can be identified and factored out as shared subprocesses, the repository's maintainability can be greatly improved. With this purpose in mind, we propose an indexing structure to support fast detection of clones in process model repositories. Moreover, we show how this index can be used to efficiently query a process model repository for fragments. This index, called RPSDAG, is based on a novel combination of a method for process model decomposition (namely the Refined Process Structure Tree), with established graph canonization and string matching techniques. We evaluated the RPSDAG with large process model repositories from industrial practice. The experiments show that a significant number of non-trivial clones can be efficiently found in such repositories, and that fragment queries can be handled efficiently.

Introduction

Organizations engaged in long-term business process management programs need to deal with repositories of hundreds or even thousands of process models, with sizes ranging from dozens to hundreds of elements per model [1], [2]. For example, the SAP reference model contains over 600 business process models, while Suncorp, a large Australian insurer, maintains a repository of over 3000 process models [3]. Tool vendors nowadays distribute reference model repositories (e.g. the IT Infrastructure Library—ITIL) with over a thousand process models each. Such models are used to document and to communicate internal procedures, to guide the development of IT systems, or to support business improvement projects, among other uses.

While highly valuable, such collections of process models raise a maintenance challenge [4]. This challenge is amplified by the fact that process models are generally maintained and used by stakeholders with varying skills, responsibilities and goals, sometimes distributed across independent organizational units. One problem that arises as repositories grow is that of managing overlap across models. In particular, process model repositories tend to accumulate duplicate fragments over time, as new process models are created by copying and merging fragments from other models. Experiments conducted during this study have put into evidence a large number of clones in three process model repositories from industrial practice. This situation is akin to that observed in source code repositories, where significant amounts of duplicate code fragments – known as code clones – are accumulated over time [5].

Cloned fragments in process models raise several issues. Firstly, clones make individual process models larger than they need to be, thus affecting their comprehensibility. Secondly, clones are modified independently, sometimes by different stakeholders, leading to unwanted inconsistencies across models. Finally, process model clones hide potential efficiency gains. Indeed, by factoring out cloned fragments into separate subprocesses, and exposing these subprocesses as shared services, companies may reap the benefits of larger resource pools.

Detecting clones by comparing process models in a pairwise manner – using sub-graph isomorphism – would be inefficient in the context of repositories with hundreds of process models. Accordingly, indexes are needed to speed up the clone discovery process.

In this setting, this paper addresses the problem of retrieving all clones in a process model repository that can be refactored into shared subprocesses. Specifically, the contribution of the paper is an index structure, namely the RPSDAG, that provides operations for inserting and deleting models, as well as an operation for retrieving all clones in a repository that meet the following requirements:

  • All retrieved clones must be single-entry, single-exit (SESE) fragments, since subprocesses are invoked according to a call-and-return semantics.

  • All retrieved clones must be exact clones so that every occurrence can be replaced by an invocation to a single (shared) subprocess. While identifying approximate clones could be useful in some scenarios, approximate clones cannot be directly refactored into shared subprocesses, and thus fall outside the scope of this study.

  • All retrieved clones are maximal, in the sense that each of them has multiple non-identical enclosing SESE regions. This “maximality” criterion is desirable because once we have identified a clone, every SESE fragment strictly contained in this clone is also a clone, but we do not wish to return all such non-maximal sub-clones.

  • Retrieved clones must have at least two nodes (no “trivial” clones).

Identifying clones in a process model repository boils down to identifying fragments of a process model that are isomorphic to other fragments in the same or in another model. Hence, we need a method for decomposing a process model into fragments and a method for testing isomorphism between these fragments. Accordingly, the RPSDAG is built on top of two pillars: (i) a method for decomposing a process model into SESE fragments, namely the Refined Process Structure Tree (RPST) decomposition; and (ii) a method for calculating canonical codes for labeled graphs. These canonical codes reduce the problem of testing for graph isomorphism between a pair of graphs, to a string equality check. These two techniques however need some adaptations in order to fit the requirements of the RPSDAG. Firstly the RPST does not retrieve all possible SESE regions in a model. Some SESE regions are hidden inside others, like for example sub-sequences hidden inside larger sequences. Secondly, naive methods for calculating canonical codes for labeled graphs have problems scaling up, especially when there are multiple nodes with duplicate labels (or unlabeled nodes). Process models contain “gateways” and gateways are generally not labeled. In this paper, we propose several optimizations that take advantage of the specific structure of process models in order to scale up the calculation of canonical codes.

As a byproduct, we show how the RPSDAG can also be used to efficiently answer “fragment queries”, that is, queries aimed at retrieving all occurrences of a given model fragment in a repository.

The rest of the paper is structured as follows. Section 2 introduces the concepts of RPST and canonical code and discusses how they are used to address the problem at hand. Next, Section 3 describes the RPSDAG, including its insertion and deletion algorithms. This section also shows how the RPSDAG can be used to efficiently retrieve all occurrences of a given process model fragment in a repository. Section 4 presents an empirical evaluation of the RPSDAG. Finally, Section 5 discusses related work while Section 6 draws conclusions.

Section snippets

Background

This section introduces the two basic ingredients of the proposed technique: the Refined Process Structure Tree (RPST) and the code-based graph indexing.

Clone detection method

This section introduces the RPSDAG, including its underlying data structure, insertion and deletion algorithms and its support for “all-clone queries” and “fragment queries”.

Evaluation

This section reports on a series of tests to evaluate the performance of the RPSDAG as well as the usefulness of clone detection in practical settings.

Related work

This paper is an extended version of a previous conference paper [16]. The extensions include the ability to detect sub-bond clones and sibling clones, as well as the application of the RPSDAG to handle fragment queries. The experimental evaluation was extended to assess the performance of the extended RPSDAG and of the query processing, and to assess the usefulness of sub-bond and sibling clone detection. Below, we review related work along the following areas: (i) clone detection in software

Conclusion

We presented a technique to index process models in order to identify duplicate SESE fragments (clones) that can be refactored into shared subprocesses. The proposed index, namely the RPSDAG, combines a method for decomposing process models into SESE fragments (the RPST decomposition) with a method for generating a unique string to identify a labeled graph (canonical codes). Canonical codes are used to determine whether an SESE fragment in a model appears elsewhere in the same or in another

Acknowledgments

This research is funded by the Estonian Science Foundation and ERDF via the Estonian Centre of Excellence in Computer Science (Uba, Dumas and García-Bañuelos), and by ARC Linkage GrantLP110100252” and NICTA Queensland Lab (La Rosa).

References (31)

  • A. Polyvyanyy, J. Vanhatalo, H. Völzer, Simplified computation and generalization of the refined process structure...
  • D.W. Williams, J. Huan, W. Wang, Graph database indexing using structured graph decomposition, in: Proceedings of the...
  • E. Ukkonen

    On-line construction of suffix trees

    Algorithmica

    (1995)
  • D. Gusfield

    Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

    (1997)
  • L. Babai, Monte-Carlo Algorithms in Graph Isomorphism Testing, Technical Report, D.M.S. No. 79-10, Université de...
  • Cited by (0)

    View full text