Abstract

Business data has been one of the current and future research frontiers, with such big data characteristics as high-volume, high-velocity, high-privacy, and so forth. Most corporations view their business data as a valuable asset and make efforts on the development and optimal utilization on these data. Unfortunately, data management technology at present has been lagging behind the requirements of business big data era. Based on previous business process knowledge, a lifecycle of business data is modeled to achieve consistent description between the data and processes. On this basis, a business data partition method based on user interest is proposed which aims to get minimum number of interferential tuples. Then, to balance data privacy and data transmission cost, our strategy is to explore techniques to execute SQL queries over encrypted business data, split the computations of queries across the server and the client, and optimize the queries with syntax tree. Finally, an instance is provided to verify the usefulness and availability of the proposed method.

1. Introduction

With the advent of Big Data, attentions from all walks of life gradually focus on exploiting their controllable data so as to realize a satisfactory profit. Against this background, data resource is widely recognized to be equal in status and value to mineral resource. In enterprise-led dataspace, data generated in business process are the most significant factor which will affect the performance of process execution. As business process is closely related to enterprise’s business strategy and market competiveness, researches on business data will benefit enterprises in coping with the challenges brought by Big Data and are significant in predicting and responding to potential business risks in a timely way as well as offering business opportunities. Recently, research work on data management in business process has gradually become a research hotspot.

During business process execution, there is usually a large data transfer, which falls into the scope of Big Data. For example, currently China Unicom monthly stores more than 2 trillion records, data volume is over 525 TB, and the highest data volume has reached a peak of 5 PB [1]. China UnionPay daily handles more than 60 billion transactions; thereby the generated data are exceptionally large. Google supports such a great many of services as both processing over 20 petabytes (1015 bytes) of data and monitoring 7.2 billion pages per day [2]. Starting from 2005, NTDB (National Trauma Data Bank) has tracked more than half a million trauma patients by now and stored their records, and many service retailers collect data from multiple sales channels, catalogs, stores, and online interaction, such as Client-Side Click-to-Action [3]. Hence, Big Data is ubiquitous (business process data arises in enterprises (large or small)) and grows exponentially, which poses huge challenges in data management. To address this, the first priority is to build an adaptive data model, which provides basis and direction for efficient data acquisition. Secondly, we see it as the next big issue about devising a suitable query strategy for business data which is a prerequisite for data processing and analysis.

Data modeling is the foundation for dataspace building. The research work in the early days focused on dataspace modeling where its subject is individual [4, 5]. iDM (iMeMex data model) [6] is the first model which is able to represent all heterogeneous personal information into a single model. This data model uses database approach so easy to understand but introduce a new query language iQL, which is a little hard for normal users to learn. UDM (unified data model) [7] uses the integrated IR-DB approach, which is able to represent the partial sections of a file but is also not able to support relational data query. Triple model [8] represents heterogeneous data in triple form, which is a simple and flexible solution but does not support the path expression queries, uncertainty, and lineage queries. PDM (probabilistic semantic model) [9] supports top-k query answering but it is difficult to obtain reliable probability functions. The methods above are based on personal dataspace. Unfortunately, in enterprise-led data space scenarios today there is rare research works on data modeling.

Query ability is the basis of the exploitation of Big Data’s value. Query language iQL [6] realizes rules-based query optimization but ignores the evaluation of optimization cost. UDM [7] introduces a new query language, which is based on SQL query language with some extended core operations, called TALZBRA operation. Triple model [8] supports subject predicate object (SPO) query language that can be enhanced by RDF-based query language. DSSP (dataspace support platforms) supports some useful services on dataspace, helps to recognize the correlation among sources of dataspace, and provides a basic query schema upon these data sources. In enterprise-led dataspace, business process data is the key element in data modeling, which has such characteristics as large-volume, strong temporal correlation and stable lifecycle. These characteristics make it an extreme challenge for current query schemes.

Business data realistically records the whole execution process of a single task, including execution status, resource status and real-time usage, and correlation with other business process instances. Executing a business process would generate additional data for a variety of reasons such as monitoring for performance or business concerns, auditing, and compliance checking. Even business process schemas and enactments can be viewed as data so that they can be managed, queried, mined for process schemas, and analyzed [10]. An artifact is a kind of widely recognized business process data, representing key business entities. Artifact-centric approach [11] is the representative method in data-centric business process management and has been applied in various client engagements, including financial [12], supply chain, retailer [13], bank, pharmaceutical research [14], and cooperative work [15]. In this paper we firstly adopt artifact as a basic element, analyze its evolution process, and then model business data through corresponding artifact lifecycle. Secondly, we make efforts on devising a safe and quick query strategy in consideration of the privacy and storage distribution of artifacts.

The rest of the paper is organized as follows. In Section 2 we introduce the concept of artifact in field of workflow management and model business data with its lifecycle from the perspective of process. In Section 3 we propose a business data partition method concerning user interest, based on which we further present a cryptograph query for off-site storage data. Then, we give a detailed instance to verify the proposed method in Section 4. In the last section, we draw a conclusion.

2. Business Data Modeling

As before, artifacts describe the business-relevant data and their lifecycles which is an important property of business data and describes the whole dynamic process of business data. It also contains specific time information. To take advantage of these characteristics, our strategy is to model business data with its lifecycle, which aims to realize the completed description of dynamic business data. In this section, we introduce artifact-relevant notions and take artifact-centric process description method to model business data with artifact lifecycle. Furthermore, we adopt business process logic model to illustrate the lifecycle of business data and then measure the quality of above model.

2.1. Basic Definition

Definition 1. Artifact [16] is an objective data entity which records the business process. Artifact comprises both a unique immutable identity and self-describing mutable content.

Definition 2. An artifact lifecycle captures the end-to-end process of a specific artifact, from creation to completion and archiving.

Definition 3. Artiflow model (artifact logical flow) [17] is 5-tuple , where is the name of model, is a finite set of services, is a finite set of repositories, is a finite set of transport channels, and Ru is a finite set of business rules.

Definition 4. The states of artifact are a set, (conjunction expression), where is a mapping function that assigns a Boolean value to each single attribute in attribute set , ,  , and is the number of attributes in artifact. If the attribute is defined and has value, it will return 1; else it will return 0.

Definition 5. Service is 5-tuple (, , , , ), where is the name of a certain service, ; , are the finite set of artifact classes, where is a set of artifacts which the service is about to read and is a set of artifacts which the service is about to rewrite; is the description of artifact states inputted by ; is the description of activities on .

Definition 6. Repository is 4-tuple , where re is the name of repository; , are the set of stored and read artifacts, respectively; is the reading condition for .

Definition 7. Transport channel is 2-tuple (Cn, Cs) where Cn is the name of the channel; Cs is 3-tuple (prior service/repository name, rear service/repository name, channel type). , where , are the finite set of repository elements and service elements in Artiflow, respectively, and the set of transport channel types is described as Read, ReadOnly, Write}.

2.2. Data Modeling with Artifact Lifecycle

As suggested in Definitions 2 and 3, Artiflow is a logical model that records the artifact lifecycle, in which elements of repository, service, artifact type, and transport channel are abstracted to represent a realistic business process. Artiflow views business process as a graph, where nodes are either “service” or “repository.” We formalize Artiflow to facilitate data analysis and illustrate it to facilitate process analysis. Figure 1 illustrates a quality inspection process instance of a certain enterprise where the main artifact is the “monitoring information sheet.” The artifact captures the detected product’s evolvement from creation to archiving, which includes all the business-relevant data in this process. The whole process comprises detection task registration, task assignment, task inspection, task audition, and so forth. Note that artifact “monitoring information sheet” is inseparable from the coordinate with such other artifacts as “product standard” and “method & standard” within its lifecycle. When “monitoring information sheet” completes its lifecycle, it will serve as a reference to form a new artifact—“detection information sheet (DIS, for short).”

In this figure, there are nine services (“task assignment,” “auditing,” etc.), seven repositories (“assignment task library,” etc.), and serial transport channels between these repositories and services.

2.3. Model Quality Evaluation

Exactly, one business object can be achieved by implementing different business processes, while different business process corresponds to a different Artiflow model. However, we will measure the Artiflow based on two factors: the number of services determines the flexibility of model. The repository services read and update artifacts. It is in this context that we define following theorem to measure the quality of artifact models.

Theorem 8. Given an Artiflow , it has Artifacts where the number of attributes in any is . Suppose and represent the service amount and repository amount of corresponding , respectively; formula (1) is defined to calculate Artiflow’s web service granularity and repository service proportion, so as to measure the quality of models:where , , and are known.

Theorem Proving. For a given Artiflow (, , , , Ru), each comprises both a service sequence and a repository sequence, marked as , where service sequence is described as and repository sequence is described as . Each also contains attributes.

Suppose and represent the service amount and repository amount, respectively, then represents the granularity of services when dividing the whole lifecycle of artifact by its attribute number . A larger value indicates there are more blocks that are divided and the granularity is less, which contributes to building a more flexible model.

In Artiflow, normally each artifact has a following repository to store its intermediate state, but there is exception that some services can directly communicate with each other and do not need intermediate repositories. Therefore, for the same Artifact, the few the repository elements are, the less the redundancy would be. represents the proportion of repository elements in both service and repository elements within its corresponding artifact lifecycle. The shorter the value is, the better the designed lifecycle would be.

The quality of is computed by the following formula:where and are predefined constants, which is used to balance the different magnitude between values both before and after the plus.

Each Artiflow comprises multiple Artifacts, so the quality measurement formula for the whole Artiflow is , where is the number of artifacts and ; represents the importance of . The optimization of key Artifacts has a great impact on the whole model to great extent, while the optimization of less-valuable artifact does not contribute too much to the model efficiency. Note that can be either given by user or obtained by data analysis.

By integrating with both repository element redundancy and service element granularity,can be deduced and taken to measure the model quality.

3. Business Data Querying

Enterprises like Google, Amazon have provided plenty of cloud services, which provide an open storage solution for data like process data all over the world. But off-site storage is unsafe due to data privacy, even public cloud. In this case, these data need to be encrypted and then stored in database. But it is hard to make a trade-off between data security and query speed, which is because process data need to be frequent queried, modified, and transmitted. In this section we make study on partitioning encrypted artifacts and coming up with a superior query plan for cryptograph query that minimizes the execution cost.

3.1. Business Data Partition

In order to ensure the efficiency of business process, a superior data partition is on-demand. When using Bucket partition method, query result on cryptograph is actually a superset of true results generated by relevant operators and then filtered at the client after decryption. Thus, superior partition method is of great help and aims to minimize the work done as much as possible, such as minimizing the number of interferential results.

3.1.1. Data Analysis

Definition 9 (Bucket [18]). Mapping the domain of attribute into another partitions set , where , , , each partition is named as a Bucket; is the Bucket number.

Definition 10 (the user interest on artifact). Querying on Artifact’s attribute of times, respectively, while represents any single result of queries that contains value , suppose is the frequency of occurring in trials, as increases, the frequency stabilizes at a certain value, which is expressed as . In other words, is the probability of artifact attribute emerged in query result dataset, called user interest.

Definition 11 (interferential artifact). (Intf-Artifact) is an artifact which is incorrect result but belong to cryptograph query result , named as .

3.1.2. Min-Interference Partition

All Artifacts in each Bucket correspond to a given index number in Bucket-based cryptograph partition. Cryptograph query returns all the encrypted Artifacts in Bucket where true result exists. The rest in Bucket would be transmitted to users as Intf-Artifact, and then it should be deciphered and further filtered. Hence, Bucket partition method determines the number of Intf-Artifacts, which further effects the query processing cost.

Suppose a cryptograph relation contains tuples and is a large integer; then we pose random queries. Totally, there are queries where their final query results are , and other tuples are returned as the provisional result. In this case the expectation of Intf-Artifact is .

There are tuples in the relation at all, and then the expectation of total Intf-Artifacts is

As for each Bucket containing different attribute values, its user interest is .

If the user interest on th artifact in a given Bucket is , (), then the number of Intf-Artifacts brought by above query is .

As for Bucket (), based on the user interest on artifact and the number of Intf-Artifacts in each Bucket, we can describe Bucket Intf-Artifact as follows:

From here we see that in the case of a fixed Bucket number, the smaller the value of formula (5) is, the more excellent the index would be. A larger value brings a heavy cost when querying and renders a low efficiency of Bucket partition. From the probability angle, Bucket where artifact with higher user interest exists should contain fewer Artifacts. Therefore, user interest on each artifact should be viewed as the weight in the whole process. Moreover, when the index is being built, formula (5) is used to determine which Bucket we store each artifact in, which helps to obtain an optimal partition result.

3.2. Business Data Query

Cloud service stores encrypted artifact information and corresponding index information, while such other information as the partitioning of attributes, mapping function, and so forth are stored at client. When a user issues a query request, query should be rewritten to its server-side cryptograph query , which is then executed on cloud. The purpose of rewriting SQL queries is to split the query computation across the client and cloud.

3.2.1. Basic Definitions

Definition 12. is a function which returns a set of all the Bucket ID where its right boundary value is not greater than when once partitioning Bucket; that is, .

Definition 13. is a function which returns a set of all the Bucket ID where its left boundary value is greater than when once partitioning Bucket; that is, .

Definition 14. is a function which returns a set of all the Bucket ID where its maximum artifact query probability is not greater than when twice partitioning Bucket; that is, .

Definition 15. is a function which returns a set of all the Bucket ID where its minimum artifact query probability is not less than when twice partitioning Bucket; that is, .

Definition 16. is a function that translates specific query conditions to encrypted ones.

Definition 17. Query rewriting function is described as , where is the original query and is the cryptograph query.

3.2.2. Query Rewriting Rules

In view of grammatical rules, query condition cond includes: , , , , where “:” is the operator, such as equal, less than, not greater than, greater than, and not less than. We list the rewrite formulas for various query conditions as shown in Formulas (6) to (8).

  :where both Map 2 and Map 4 are order preserving, and both Map3 and Map 5 are random.

  : where , , and .

When the condition is , in Map 1 is order preserving, while in Map 3 both and are order preserving. Meanwhile, in Map 2 is order preserving, and in Map 4 both and are random.

  :

For instance, suppose there are two artifact plaintext tables in cloud database, which are app (aid, aname, time, content, cid) check (cid, aid, result), respectively, where the range of attribute aid is divided into 6 partitions, including ; ; ; ; ; .

Given above partition results, we rewrite the following query conditions based on above formulas:

(app.did = check.did)(app.did = 3 check.did = 2) (app.did = 7 check.did = 2) (app.did = 5 check.did = 6) (app.did = 1 check.did = 6).

(app.did < check.did)(app.did = 3 check.did = 2) (app.did = 3 check.did = 6) (app.did = 7 check.did = 2) (app.did = 7 check.did = 6) (app.did = 5 check.did = 6) (app.did = 1 check.did = 6).

3.2.3. Query Optimization Principles

Because data is encrypted and stored in various places, in order to reduce the transmission cost and improve the query efficiency, we should run operations on cloud services as much as possible, and the answers can be computed with little effort by the client.

For clear expression, operation procedure is expressed by using the syntax tree. The decryption operation splits the tree into cryptograph operations and plaintext operations. Because any single operation on the original tree ends with the selection after decryption; thereby the principle of query optimization by using syntax tree is to iteratively pull up the selection.

For example, given a selection “SELECT chairman FROM Airway, Price, Plane WHERE price < 900 AND begin = “shanghai” AND end = “beijing” AND Price.planeid = Plane.planeid AND Plane.airway = Airway.airway”, we take query tree to illustrate how to optimize this query and describe its detailed procedures.

In Figure 2 the SQL statement is converted into an initial syntax tree. If the enterprise use cloud services or other off-site storage platforms, we need to first decrypt the cryptograph then query the data at client, as shown in Figure 3, where cryptograph database on cloud is bounded by the dotted line. Query objects (Price, Plane, and Airway) are converted to cryptograph tables (Price, Plane and Airway) in the cloud database.

Operations on syntax tree are performed from bottom to up. In Figure 3 the first step is to execute selection, while the following steps include rewriting the condition of selection operations, converting it to a selection on cryptograph in cloud database and then decrypting and further filtering the result at client. A new syntax tree is derived as shown in Figure 4.

According to optimization principles described above, we should iteratively pull up selections. Therefore, by both exchanging the positions between selection operations (price < 900, begin = “shanghai” and end = “beijing”) and join operation and then combing corresponding conditions, we obtain a new syntax tree, as shown in Figure 5.

Moreover, based on operation rewriting rules, join operation in Figure 5 should be converted into two parts, including the join on cryptograph in the cloud database and the selection on decrypted provisional results, as shown in Figure 6. Repeat the above steps, rewrite all kinds of operations, and continuously exchange the positions between selection operations and other operations, till all the selections cannot be pulled up. As a result, we get the ultima syntax tree as shown in Figure 7. Operations within dotted line would be executed on cloud service, whereas user only needs to execute the last selection. From here we see that the above method takes full advantage of cloud service to reduce the cost of transmitting and postprocessing and improve the efficiency of artifact querying in business process.

4. Case Study

In this section, we will introduce a business instance of a certain enterprise. Based on the method in Section 2, we complete the data modeling with artifact lifecycle from a given process instance and illustrate the query process through query tree mentioned in Section 3.

An enterprise’s process of equipment purchase/scrap involves the following steps. At first equipment division fills out the equipment purchase/scrap application and hands it to department managers and company’s leadership for approval. If the application is consented, then we should archive it; else we withdraw it. Purchasing department does purchase according to a copy of application, and when the purchase is completed, documents should be archived. Equipment division scraps the equipment based on specific methods and standards and then archives the processing results. Assets department regularly verifies company’s assets based on purchase/scrap equipment information. Archive department has permission to query all the archived information.

This process involves multiple departments and multiple sets of information. If we manage the data alone, as business data are complicated, and even one attribute has difference value in different event, thereby it is difficult to manage. If we manage the process alone, only the department activities will be involved while business data in the process will be ignored. In this context, we analyze the process concerning both data and process and describe this instance with an Artiflow (, , , , Ru), where: “EP/”;: FilloutEPA, Audit1, Audit2, Query, Purchase, Asset Verification…};: NewEPA, PrimaryEPA, FinalEPA, Unapro-ved EPA, PO, FAL…};: FtoN, NtoA,…};Ru: constraint (EPA) = (FilloutEPA, Audit1, Aud-it2)…}.

The model contains multiple artifacts: “EPA” is for describing equipment purchase application, while its lifecycle starts from filling, auditing to archiving. When having been archived, it will provide asset verification and support query processing. “ESA” is for describing equipment scrap application, while its lifecycle starts from filling, auditing to archiving. It is associated with another artifact, called “method & standard.” The lifecycle of “PO”/“SL” captures process from purchase/scrap to application archive. The whole process is shown in Figure 8.

Artifact Example.Artifact: (, , , , , ) = “EPA”;: EquipmentName, PurchaseAmount, UnitPrice, ApplicationDate, Applicant, AuditingComment, AuditingDate};: empty table (initial state ), basic information filling, delivery auditing, auditing completion, audited application archiving (terminate state )};: EquipmentName: verchar; PurchaseAmount: In; UnitPrice: Int; ApplicationDate: Date; Applicant: Verchar; AuditingComment: Verchar; AuditingDate: Date.

Service Example.service = (, , , , ), where: = “Audit2”;: EPA};: EPA};: DEFINED (EquipmentName) DEFINED (PurchaseAmount) DEFINED (UnitPrice) DEFINED (ApplicationDate) DEFINED (Applicant) DEFINED (AuditingComment) DEFINED (AuditingDate).: DEFINED (EquipmentName) DEFINED (PurchaseAmount) DEFINED (UnitPrice) DEFINED (ApplicationDate) DEFINED (Applicant) DEFINED (AuditingComment) DEFINED (AuditingDate).

Repository Example..re: “FinalEPA”;: EPA};: EPA};: IsDefine(AuditingComment).

There is a repository named “FinalEPA,” which reads and stores artifact “EPA” only if “AuditingComment” has been assigned.

Given a query “SELECT app_name FROM FAL, PO, FinalEPA WHERE Fal.name = ‘depthsounder’ AND FAL.POid = PO.Id AND PO.EPAid = FinalEPA.Id AND FinalEPA.Audit2 = ‘Doctor Li’”, it can be converted into a syntax tree as shown in Figure 9, which can be further converted into a new syntax tree shown in Figure 10. Queries will be issued on this syntax tree.

5. Conclusion

There is no doubt that more and more large datasets will be poured out during business process execution; meanwhile, these business data are extremely valuable. In this case, we modeled business data through its lifecycle from the perspective of process, which ensures the integrity of dynamic business data. Furthermore, we present the notion of user interest on business data, which has a superior function in counting minimum interferential tuples during data partition and ensuring a lower cost of postprocessing brought by data partition. Considering current business data are mostly stored on cloud, we proposed a query rewriting strategy for off-site encrypted data which has a significant advantage in reducing the postprocessing cost. Currently there is little research on business data modeling and querying in the true sense. Our research lays great foundation for business data’s application in enterprise. That is the initial step of business data management architecture, and we will further research on business data analysis with its lifecycle, to fully dig the significant value of business data.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by National Natural Science Foundation of China (61272098) and Science and Technology Development Foundation of Shanghai Ocean University.