nach oben

Journal of Cloud Computing

Erschienen in:

Open Access 01.12.2014 | Research

Clustering-based fragmentation and data replication for flexible query answering in distributed databases

verfasst von: Lena Wiese

Erschienen in: Journal of Cloud Computing | Ausgabe 1/2014

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

One feature of cloud storage systems is data fragmentation (or sharding) so that data can be distributed over multiple servers and subqueries can be run in parallel on the fragments. On the other hand, flexible query answering can enable a database system to find related information for a user whose original query cannot be answered exactly. Query generalization is a way to implement flexible query answering on the syntax level. In this paper we study a clustering-based fragmentation for the generalization operator Anti-Instantiation with which related information can be found in distributed data. We use a standard clustering algorithm to derive a semantic fragmentation of data in the database. The database system uses the derived fragments to support an intelligent flexible query answering mechanism that avoids overgeneralization but supports data replication in a distributed database system. We show that the data replication problem can be expressed as a special Bin Packing Problem and can hence be solved by an off-the shelf solver for integer linear programs. We present a prototype system that makes use of a medical taxonomy to determine similarities between medical expressions.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Electronic supplementary material

The online version of this article (doi:10.1186/s13677-014-0018-0) contains supplementary material, which is available to authorized users.

Competing interests

The author declares that she has no competing interests.

Introduction

In the era of “big data” huge data sets usually cannot be stored on a single server any longer. Cloud storage (where data are stored in a cloud infrastructure) offers the advantage of flexibly adapting the amount of used storage based on the growing or shrinking storage demands of the data owners. In a cloud storage system, a distributed database management system (DDBMS) can be used to manage the data in a network of servers. This allows for load balancing (data can be distributed according to the capacities of servers) and higher availability (servers can process user requests in parallel). In particular, when data are distributed over a wider area in different data centers, it is important that only few servers have to be contacted to answer user queries in order to reduce network delays; in the ideal case, these servers are also geographically close to the user.

Depending on the data structure used in the DDBMS a variety of distribution models are possible. For relational data, the theory of fragmentation has a long history (see for example [1]) and several procedures have been analyzed for splitting tabular data into fragments and subsequently assigning fragments to servers. Other database systems with key-based access (like key-value stores, document databases, or column family stores) use range-based partitioning or consistent hashing to distribute data.

On the other hand, flexible query answering offers mechanisms to intelligently answer user queries going beyond conventional exact query answering. If a database system is not able to find an exactly matching answer, the query is said to be a failing query. Conventional database systems usually return an empty answer to a failing query. In most cases, this is an undesirable situation for the user, because he has to revise his query and send the revised query to the database system in order to get some information from the database. In contrast, flexible query answering systems internally revise failing user queries themselves and - by evaluating the revised query - return answers to the user that are more informative for the user than just an empty answer. Query generalization is one way to implement flexible query answering.

This paper revises and extends the previous results presented in [2]. In this paper we make the following additional contributions:

We study how a standard clustering heuristic on a single relaxation attribute (that is, table column) can induce a horizontal fragmentation of a database table; in [2] a taxonomy-based fragmentation was used instead of a clustering-based fragmentation.
We formally study the data replication problem for these fragments by representing it as a variant of the bin packing problem and solve it using an integer linear programming solver. This was not discussed in [2].
We present a detailed query rewriting and query redirecting method that allows access to the distributed fragments. This was discussed in [2] only briefly.

The paper is organized as follows. Section Background provides background on data fragmentation, query generalization (in particular anti-instantiation) and data replication. Section Clustering-based fragmentation presents the main contribution on clustering-based fragmentation and its management with a lookup table; whereas Section Query rewriting talks about how to decompose a query to be distributed among the servers. Section Improving data locality with derived fragmentations extends the basic approach by allowing derived fragmentation in order to facilitate joins over multiple tables. Section Implementation and example presents the components of our prototype implementation. Section Related work surveys related work and Section Discussion and conclusion concludes the paper.

Background

In the following subsections we present prior work on data fragmentation, flexible query answering (with a focus on anti-instantiation) and data replication. These three techniques will be combined to obtain an intelligent distributed database system that can autonomously configure its replication mechanism while at the same time support users in finding relevant information by flexible query answering.

Data fragmentation

As the basic data model, we consider the case of data stored in relational tables. The relational data model is still widely applied today although alternatives exist (like tree- or graph-structured data or data stored in a simple key-value format).

Example 1.

As a running example, we consider a hospital information system that stores illnesses and treatments of patients as well as their personal information (like address and age) in the following three database tables:

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equf_HTML.gif

In relational database theory, several alternatives of splitting tables into fragments have been discussed (see for example [1]), for example:

Vertical fragmentation: Subsets of attributes (that is, columns) form the fragments. Rows of the fragments that correspond to each other have to be linked by a tuple identifier. A vertical fragmentation corresponds to projection operations on the table.
Horizontal fragmentation: Subsets of tuples (that is, rows) form the fragments. A horizontal fragmentation can be expressed by a selection condition on the table.
Derived fragmentation: A given horizontal fragmentation on a primary table (the primary fragmentation) induces a horizontal fragmentation of another table based on the semijoin with the primary table. In this case, the primary and derived fragments with matching values for the join attributes can be stored on the same server; this improves efficiency of a join on the primary and the derived fragments.

The following three properties are considered the important correctness properties of a fragmentation:

Completeness: No data should be lost during fragmentation. For vertical fragmentation, each column can be found in some fragment; in horizontal fragmentation each row can be found in a fragment.
Reconstructability: Data from the fragments can be recombined to result in the original data set. For vertical fragmentation, the join operator is used on the tuple identifier to link the columns from the fragments; in horizontal fragmentation, the union operator is used on the rows coming from the fragments.
Non-redundancy: To avoid duplicate storage of data, data should be uniquely assigned to one fragment. In vertical fragmentation, each column is contained in only one fragment (except for the tuple identifier that links the fragments); in horizontal fragmentation, each row is contained in only one fragment.

In this paper we will compute semantically-guided horizontal fragmentations of a primary table. Each of these fragmentations will be based on clustering an attribute for which values should be relaxed to allow for flexible query answering. In contrast to the conventional applications of fragmentation, the clustering-based fragmentations will support flexible query answering in an efficient manner.

For other tables (those that can be joined with the primary table) a derived fragmentation will be computed that allows for data locality in a distributed database system.

Anti-instantiation

In this paper we focus on flexible query answering for conjunctive queries expressed as logical formulas. That is, we assume a logical language

L

consisting of a finite set of predicate symbols (denoting the table names; for example, Ill, Treat or P), a possibly infinite set dom of constant symbols (denoting the values in table cells; for example, Mary or a), and an infinite set of variables (x or y). A term is either a constant or a variable. The capital letter X denotes a vector of variables; if the order of variables in X does not matter, we identify X with the set of its variables and apply set operators - for example we write y ϵ X. We use the standard logical connectors conjunction ∧, disjunction ∨, negation - and material implication → and universal ∀ as well as existential ∃ quantifiers. An atom is a formula consisting of a single predicate symbol only; a literal is an atom (a “positive literal”) or a negation of an atom (a “negative literal”); a clause is a disjunction of atoms; a ground formula is one that contains no variables; the existential (universal) closure of a formula ϕ is written as ∃ϕ (∀ϕ) and denotes the closed formula obtained by binding all free variables of ϕ with the respective quantifier.

A query formula Q is a conjunction of literals with some variables X occurring freely (that is, not bound by variables); that is,

Q (X) = L_{i_{1}} \land \dots \land L_{i_{n}}

. By abuse of notation, we will also write L_ij ϵ Q when L_ij is a conjunct in formula Q. A query Q(X) is sent to a knowledge base Σ (a set of logical formulas) and then evaluated in Σ by a function a n s that returns a set of answers containing instantiations of the free variables (in other words, a set of formulas that are logically implied by Σ); as we focus on the generalization of queries, we assume the a n s function and an appropriate notion of logical truth given. A special case of a knowledge base can be a relational database with database tables as in Example 1.

Example 2.

Query Q(x₁,x₂,x₃)=I l l(x₁,F l u) ∧ I l l(x₁,C o u g h) ∧ I n f o(x₁,x₂,x₃) asks for all the patient IDs x₁ as well as names x₂ and addresses x₃ of patients that suffer from both flu and cough. This query fails with the given database tables as there is no patient with both flu and cough. However, the querying user might instead be interested in the patient called Mary who is ill with both flu and asthma. Query generalization will enable an intelligent database system to find this informative answer.

As in [3] we apply a notion of generalization based on a model operator ⊨.

Definition 1 (Deductive generalization 3).

Let Σ be a knowledge base, ϕ(X) be a formula with a tuple X of free variables, and ψ(X,Y) be a formula with an additional tuple Y of free variables disjoint from X. The formula ψ(X,Y) is a deductive generalization of ϕ(X), if it holds in Σ that the less general ϕ implies the more general ψ where for the free variables X (the ones that occur in ϕ and possibly in ψ) the universal closure and for free variables Y (the ones that occur in ϕ only) the existential closure is taken:

Σ ⊨ \forall X \exists Y (ϕ (X) \to ψ (X, Y))

The CoopQA system [4] applies three generalization operators to a conjunctive query (which - among others - can already be found in the seminal paper of Michalski [5]): Dropping Condition (DC) removes one conjunct from a query; Anti-Instantiation (AI) replaces a constant (or a variable occurring at least twice) in Q with a new variable y; Goal Replacement (GR) takes a rule from Σ, finds a substitution θ that maps the rule’s body to some conjuncts in the query and replaces these conjuncts by the head (with θ applied). In this paper we focus only on the AI operator.

Example 3.

For query Q(x₁,x₂,x₃)=I l l(x₁,F l u) ∧ I l l(x₁,C o u g h) ∧ I n f o(x₁,x₂,x₃) an example generalization with AI is Q^{A
I}(x₁,x₂,x₃,y)=I l l(x₁,F l u) ∧ I l l(x₁,y) ∧ I n f o(x₁,x₂,x₃). A non-empty answer (and hence informative answer) I l l(2748,F l u) ∧ I l l(2748,A s t h m a) ∧ I n f o(2748,M a r y,` N e w S t r 3 , N e w t o w n `) is returned as an answer saying that Mary suffers from flu and asthma at the same time. However, another obtained answer is I l l(2748,F l u) ∧ I l l(2748,b r o k e n L e g) ∧ I n f o(2748,M a r y,` N e w S t r 3 , N e w t o w n `) saying that Mary suffers from flu and a broken leg.

AI applies to constants and to variables and covers these special cases:

turning constants into variables: P(a) is converted to P(x) (see [5])
breaking joins: P(x) ? S(x) is converted to P(x) ? S(y) (introduced in [3])
naming apart variables inside atoms: P(x,x) is converted to P(x,y)

For each constant a all occurrences can be anti-instantiated one after the other; the same applies to variables x - however, with the exception that if x only occurs twice, one occurrence of x need not be anti-instantiated due to equivalence. For logical queries, anti-instantiation can be implemented as shown in the listing in Listing 1.

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equg_HTML.gif

In this paper, we focus on the first application of anti-instantiation: turning constants into variables. In the following section, we present an approach that identifies those tuples in a relational table that are good candidates for answers to such an anti-instantiated query; these candidates are put into one fragment for storage in a distributed database system.

Data replication

To achieve fault tolerance, reliability and high availability, data in a distributed database system should be copied (that is, replicated) to different servers. Whenever one of the database servers fails, if it is too overloaded or geographically too far away from the requesting user, a data copy (that is, a replica) can be retrieved from one of the other servers.

The data replication problem (DRP; see [6]) is a formal description of the task of distributing copies of data records (that is, database fragments) among a set of servers in a distributed database system. The data replication problem is basically a Bin Packing Problem (BPP) in the following sense:

K servers correspond to K bins
bins have a maximum capacity W
n fragments correspond to n objects
each object has a weight (a capacity consumption) w_i = W
objects have to be placed into a minimum number of bins without exceeding the maximum capacity

This BPP can be written as an integer linear program (ILP) as follows - where x_ik is a binary variable that denotes whether fragment/object i is placed in server/bin k; and y_k denotes that server/bin k is used (that is, is non-empty):

\begin{array}{lcr} minimize Σ_{k = 1}^{K} y_{k} \end{array}

(1)

\begin{array}{lcr} s.t. Σ_{k = 1}^{K} x_{ik} = 1, & i = 1, \dots, n \end{array}

(2)

\begin{array}{lcr} Σ_{i = 1}^{n} w_{i} x_{ik} \leq {Wy}_{k}, & k = 1, \dots, K \end{array}

(3)

\begin{array}{lcr} y_{k} ϵ {0, 1} & k = 1, \dots, K \end{array}

(4)

\begin{array}{lcr} x_{ik} ϵ {0, 1} & k = 1, \dots, K, i = 1, \dots, n \end{array}

(5)

To explain, Equation 1 means that we want to minimize the number of servers/bins used; Equation 2 means that each object is assigned to exactly one bin; Equation 3 means that the capacity of each server is not exceeded; and the last two equations denote that the variables are binary - that is, the ILP is a so-called 0-1 linear program.

An extension of the basic BPP will be used to ensure that replicas will be placed on distinct servers: the Bin Packing with Conflicts (BPPC; [7]-[9]) problem allows constraints to be expressed on pairs of objects that should not be placed in the same bin. That is, one adds a conflict graph G=(V,E) where the node set V={1,…,n} corresponds to the set of objects. A binary edge e=(i,j) exists whenever the two incident nodes i and j must not be placed in the same bin; note that (i,j) is meant to be undirected and hence identical to (j,i). In the ILP representation, a further constraint is added to avoid conflicts in the placements.

\begin{array}{lcr} minimize Σ_{k = 1}^{K} y_{k} \end{array}

(6)

\begin{array}{lcr} s.t. Σ_{k = 1}^{K} x_{ik} = 1, & i = 1, \dots, n \end{array}

(7)

\begin{array}{lcr} Σ_{i = 1}^{n} w_{i} x_{ik} \leq {Wy}_{k}, & k = 1, \dots, K \end{array}

(8)

\begin{array}{lcr} x_{ik} + x_{jk} \leq y_{k} & (i, j) ϵ E, k = 1, \dots, K \end{array}

(9)

\begin{array}{lcr} y_{k} ϵ {0, 1} & k = 1, \dots, K \end{array}

(10)

\begin{array}{lcr} x_{ik} ϵ {0, 1} & k = 1, \dots, K, i = 1, \dots, n \end{array}

(11)

Equation 9 ensures that no conflicting objects i and j are placed in the same bin k because otherwise the sum of the two x-variables x_ik and x_jk would be 2 and hence exceed y_k which is 1.

In this paper, we will extend the BPPC to ensure that a certain replication factor m for each fragment of the relational table is obeyed; that is, for each fragment stored at one server there are at least m - 1 other servers storing a copy of this fragment, too.

Clustering-based fragmentation

We now present our intelligent fragmentation and replication procedure that will support flexible query answering with anti-instantiation.

The anti-instantiation operator as stated above is a purely syntactic operator. For the application of turning constants into variables, any constant can be inserted in the answer. This syntactic operator is oblivious of whether the obtained answer is semantically close to the replaced constant in the original query or not. For example in Example 3, the two diseases cough and asthma are semantically closer to each other than the two diseases cough and broken leg. That is, the generalization operators can sometimes lead to overgeneralization where the generalized queries (and hence the obtained answers) are too far away from the user’s original query intention. To avoid this overgeneralization and the overabundance of answers, a semantic guidance has to be added to the process. This semantic guidance can for example be given by a taxonomy on constants.

As an extension to [2], we will present a clustering heuristics attributes on which anti-instantiation should be applied. We call such attribute a relaxation attribute. The domain of an attribute is the set of values that the attribute may range over; whereas the active domain is the set of values actually occuring in a given table. For a given table instance F (a set of tuples ranging over the same attributes) and a relaxation attribute A, the active domain can be obtained by a projection π to A on F: π_A(F). In our example the relaxation attribute is the attribute Diagnosis in table Ill. From a semantical point of view, the domain of Diagnosis is the set of strings that denote a disease; the active domain is the set of terms {C o u g h , F l u , A s t h m a , b r o k e n A r m , b r o k e n L e g}.

Wiese 2013 [2] assumes a tree-shaped taxonomy on the active domain of a relaxation attribute where the active domain values can be found in the leave nodes connected by some intermediary nodes serving as a classification of the values. As an alternative, in this paper we only rely on the specification of a similarity value s i m(a,b) between any two values a and b in the active domain of a relaxation attribute. These similarity values, however, can indeed be calculated by using a taxonomy; we will briefly survey some of such similarity measures below when describing the prototype. Based on this similarity specification, we derive a clustering of the active domain of each relaxation attribute A in a relation instance F. We rely on a very general definition of a clustering as being a set of subsets (the clusters) of a larger set of values. For a clustering to be reasonable, similarities of any two values inside one cluster should somehow be larger than between any two values from different clusters. This will be ensured below by relying on so-called head elements in the clusters and on a threshold value α that restricts the minimal similarity allowed inside a cluster: if c_i is a cluster, then h e a d_i ϵ c_i and for any other value a ∈ c_i (with a ≠ h e a d_i) it holds that s i m(a,h e a d_i) ≥ α The clustering of the active domain of A induces a horizontal fragmentation of F into fragments F_i ⊆ F such that the active domain of each fragment F_i coincides with one cluster; more formally, c_i = π_A(F_i). For the fragmentation to be complete, we also require the clustering C to be complete; that is, if π_A(F) is the active domain to be clustered, then the complete clustering C=c₁,…,c_n covers the whole active domain and no value is lost: c₁⋃…⋃c_n=π_A(F). These requirements are summarized in the definition of a clustering-based fragmentation as follows.

Definition 2 (Clustering-based fragmentation).

Let A be a relaxation attribute; let F be a table instance (a set of tuples); let C={c₁,…c_n} be a complete clustering of the active domain π_A(F) of A in F; let h e a d_iϵc_i; then, a set of fragments {F₁,…,F_n} (defined over the same attributes as F) is a clustering-based fragmentation if

Horizontal fragmentation: for every fragment F_i, F_i ? F
Clustering: for every F_i there is a cluster c_i?C such that c_i=p_A(F_i) (that is, the active domain of F_i on A is equal to a cluster in C)
Threshold: for every a ? c_i (with a?h e a d_i) it holds that s i m(a,h e a d_i)=a
Completeness: For every tuple t in F there is an F_i in which t is contained
Reconstructability: F=F₁?…?F_n
Non-redundancy: for any i ? j, F_i ? F_j=? (or in other words c_i ? c_j=?)

Approximation algorithm for clustering

We use and adapt an established approximation algorithm for clustering originally presented by Gonzalez [10]. Its original presentation relies on a notion of distance between any two values. It has a running time of O(k f) for clustering a set of k objects into f clusters. Each cluster is represented by one or more so-called head values; and each value is assigned to the cluster represented by a head with minimal distance to the value. In case the distance measure is metric (in particular, satisfies the triangular inequation), Gonzalez showed that the number of heads obtained by his algorithm is at most twice as much as the optimal number of heads (in other words, it is a 2-approximation of the optimal solution).

Rieck et al. [11] apply this algorithm to malware detection. Instead of fixing the number f of clusters, they use a threshold for the distances of values inside a cluster to the cluster head; hence the number of obtained clusters can differ. This functionality is also needed in our application. We however rely on the notion of similarity between two values (instead of distance) and provide a reformulation of the clustering algorithm here based on [10],[11]. The algorithm starts by assigning all values of the active domain to an initial cluster c₁, choosing an arbitrary element of it as h e a d₁ and then step by step choosing other head elements h e a d₂,…,h e a d_f that have lowest similarity to all other heads and moving other elements to the new clusters c₂,…,c_f; an element is moved to a new cluster when it has higher similarity to the new head element than to the old head element. The algorithm continues finding new heads until a threshold α is reached; α limits the minimum similarity that elements inside a cluster may have to their cluster heads. Listing 2 shows a pseudocode for the clustering procedure.

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equh_HTML.gif

Note that the clustering obtained by this heuristic is always complete: any value of π_A(F) is assigned to some cluster c_i. And we also have the property that clusters do not overlap: c_i⋂c_j≠θ for each i ≠ j.

Example 4.

In our example, we assume that the pairwise similarities for the values in the active domain of the relaxation attribute Diagnosis are given. We assume further that the pairwise similarities in the subset {C o u g h , F l u , A s t h m a} and in the subset {b r o k e n A r m , b r o k e n L e g} are higher than any similarity in between these two subsets. In the first clustering step, we choose h e a d₁ arbitrarily - let us assume F l u - and the entire active domain forms cluster c₁. Now as h e a d₂ the value with the lowest similarity is chosen - let us assume b r o k e n A r m. Now, all values with higher similarity to b r o k e n A r m (than to F l u) are moved to cluster c₂ - which will then consist of {b r o k e n A r m , b r o k e n L e g}. If we choose threshold α to lie in between the minimum intra-cluster (of both c₁ and c₂) similarity and the maximum inter-cluster similarity (between pairs of values from c₁ and c₂), we will stop after this second iteration.

Fragmentation and lookup table

When considering only a single relaxation attribute A, we obtain a fragmentation of the corresponding table: a set of fragments F_i - each corresponding to a cluster c_i. A relational algebra expression for each fragment can be stated as follows (using the selection operator σ and a disjunction of equality conditions on A for each value a contained in the cluster):

\begin{array}{lcr} F_{i} = Σ_{condition (c_{i})} (F) \\ where condition (c_{i}) = \underset{a \in c_{i}}{⋁} (A = a) \end{array}

The selection operator results in a set of rows - hence a horizontal fragmentation is obtained. Because the clustering is complete, the fragmentation itself will also be complete; hence, in addition a reconstruction of the original instance F is possible by the union operator. Moreover, because clusters do not overlap, we also achieve non-redundancy in this fragmentation. Hence, all properties of Definition 2 will be ensured.

Example 5.

Based on the above clustering, we obtain two fragments of the Ill table.

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equi_HTML.gif

Fragmentation and replica management are usually supported by lookup tables [12] (also called root tables [13]) that store metadata - for example, information about in which fragment to look for matching tuples when a query arrives. In our case, as we enable flexible query answering in distributed database systems, we create a lookup table that contains:

the fragment ID F_i that is used to solve the data replication problem
the fragment name that will be used in queries to the fragment
the head h e a d_i of the cluster c_i that was used to obtain the fragment F_i as a semantic representative of the values for relaxation attribute A inside fragment F_i
the size w_i of fragment F_i that is used in the data replication problem; for simplicity, in this paper we only count the number of rows - but more advanced size measures can be used, too
an array of the IP addresses or names of the database servers that fragment F_i is assigned to

Example 6.

We insert the following data into the ROOT lookup table where ID is the fragment identifier, Name is the fragment name, S is the fragment size in number of tuples, and Host is the name of the server where the fragment is assigned to.

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equj_HTML.gif

The last missing information - identifying the database server hosting the fragment - is computed by solving a bin packing problem with conflicts (BPPC). The basic idea is that for f fragments we want to replicate m times, each fragment F_i is copied m - 1 times: for F_i (and 1 ≤ i ≤ f) we obtain the copies F_f+i,F_2f+i,…,F_{(m - 1)f+i} so that the total number of fragments will be n=f•m. Furthermore any two copies of fragment F_i (and including F_i itself) must not be placed on the same server; this will be ensured by a conflict graph where there exist edges between all pairs of copies of F_i.

Example 7.

In our example, when we assume a replication factor of m=2, we have to copy each fragment once. Hence we have that F₁=F₃=R e s p i r a t o r y each with a size of 4; and F₂=F₄=F r a c t u r e each with a size of 2. The conflict graph then consists of nodes V={F₁,F₂,F₃,F₄} and edges E={(F₁,F₃),(F₂,F₄)}.

As input information for the BPPC we hence need:

the capacity W of each of the database servers based on some configuration information of the distributed database system
the replication factor m based on some configuration information of the distributed database system
F_i as well as m - 1 copies F_f+i,F_2f+i,…,F_{(m - 1)f+i} of each F_i (where 1 = i = f)
the sizes w_i for each F_i (where 1 = i = n) where the copies of a fragment have the same size as the fragment itself
the conflict graph G where the set of n=f•m nodes is the set of fragments and their copies - that is, V={F₁,F₂,…,F_f,F_f+1,…,F_n} - and the set of undirected binary edges E consists of the sets E_i (where 1 ≤ i ≤ f) of pairs (X,Y) of a fragment F_i and all its copies - that is, $E = ⋃_{i = 1}^{f} E_{i}$ where E_i={(X,Y) ∈ X,Y ∈ {F_i,F_f+i,F_2f+i,…,F_{(m - 1)f+i}};1 ≤ i ≤ f}.

When solving the usual ILP formulation of BPPC (as shown in Equations 6 to 11) with these inputs, we obtain a solution that occupies the minimal number of servers (bins) while respecting the different sizes w_i that the fragments (for a single relaxation attribute A) may have as well as ensuring the replication factor. An example with the ILP solver lpsolve is provided in an upcoming section.

As opposed to lookup tables for individual tuples [12], we only store a row per fragment (and only the appropriate cluster head). Due to this, the lookup tables are small and lookups can be faster. That is why we assume that there is only a master server for the lookup table; one hot backup server can be used that can take over the task of the master server in case of a failure. Alternatively, distribution of the lookup table to all replica servers can be used; however, this incurs extra overhead and consistency problems [12].

Query rewriting

Flexible query answering can now be executed on the obtained clustering-based fragmentation. Queries are rewritten and redirected to the appropriate fragment with the help of the lookup table as follows:

The user sends a query to the database system with a selection condition containing a constant a for the relaxation attribute A.

The database system checks if there is a head value h e a d _i in the lookup table such that h e a d _i=a. Then the appropriate fragment F _i is already identified and the next three steps can be skipped.

Otherwise the database system reads all f head values from the lookup table.

The database system computes all similarities s i m(a,h e a d _i) (for 1 ≤ i ≤ f).

The database system chooses a head h e a d _i with maximum similarity to a and thereby identifies appropriate fragment F _i. A threshold β can be provided by the user to limit this similarity divergence.

The database system rewrites the query by replacing the original table name with the identified fragment name and removes the selection condition containing a for the relaxation attribute.

The rewritten query is redirected to the server that hosts the identified fragment.

The server can return the entire fragment for the rewritten query with the assertion that the distance threshold β is not exceed and hence the answers are relevant for the user.

If the query contains multiple selection conditions for the relaxation attribute, several query rewritings will be executed and theses queries can be redirected to different servers.

Example 8.

In the example query Q(x₁,x₂,x₃)=I l l(x₁,F l u) ∧ I l l(x₁,C o u g h) ∧ I n f o(x₁,x₂,x₃) the constant Cough is anti-instantiated. The fragment matching the Cough constant is the one containing respiratory diseases because we assume that it holds that s i m(F l u,C o u g h)>s i m(b r o k e n A r m,C o u g h). The second constant for the relaxation attribute in this query is Flu; however, Flu is a head element of the corresponding fragment and hence no similarities have to be computed. The anti-instantiated query is

\begin{array}{lcr} Q^{AI} (x_{1}, x_{2}, x_{3}, y, y^{\land}) = Respiratory (x_{1}, y) \\ \land Respiratory (x_{1}, y^{\land}) \land Info (x_{1}, x_{2}, x_{3}) \land y \neq y^{'} \end{array}

The inequality condition on the new variables is necessary to only obtain answers where the two disease values found in the Respiratory fragment differ. A distributed join on x₁ has to be executed to combine the data from the Info table with the data from the Respiratory fragment; we will later on discuss how this overhead can be avoided by using derived fragmentation. Because the query is redirected to the fragment with highest similarity, in this case only the first informative answer (see Example 3) with the disease asthma I l l(2748,F l u) ∧ I l l(2748,A s t h m a) ∧ I n f o(2748,M a r y,` N e w S t r 3 , N e w t o w n `) is returned. In contrast, the answer for the disease brokenLeg is suppressed because it resides in the Fracture fragment.

The computation of distributed joins cannot be avoided if subqueries must be redirected to different server. We argue however, that with any other conventional data replication scheme (like [12],[14]), distributed joins have to be processed, too; while with our scheme we have added support for flexible query answering.

Example 9.

Consider the example query

\begin{array}{lcr} Q (x_{1}, x_{2}, x_{3}) = Ill (x_{1}, brokenLeg) \\ \land Ill (x_{1}, Cough) \land Info (x_{1}, x_{2}, x_{3}) \end{array}

The query has to be rewritten into the query

\begin{array}{lcr} Q^{AI} (x_{1}, x_{2}, x_{3}, y, y^{'}) = Fracture (x_{1}, y) \\ \land Respiratory (x_{1}, y^{\land}) \land Info (x_{1}, x_{2}, x_{3}) \end{array}

which has to be answered by both Fracture the and the Respiratory fragment. It may happen that the Respiratory, Fracture and Info tables all reside on different servers and so we would have to compute a three-way distributed join on x₁.

Improving data locality with derived fragmentations

Apart from failure tolerance and load balancing, another important issue for cloud storage is data locality: Data that are often accessed together should be stored on the same server in order to avoid excessive network traffic and delays. That is why we propose to compute a derived fragmentation for each table that shares join attributes with the primary table (for which the clustering-based fragmentation was computed). Each derived fragment should then be assigned to the same database server on which the primary fragment with the matching join attribute values resides.

Hence for a given fragmentation {F₁,…,F_f} of a primary table F we compute the corresponding fragmentation {G₁,…,G_f} of any table G sharing join attributes with F as a semijoin of G with each fragment - which is equivalent to the projection on the attributes of G of the natural join of G and F_i: ⋉_{A t t r(G)}(G ⋈F_i).

Example 10.

In our example we can join both the Treat as well as the Info table with the Ill table. Because we have two fragments of Ill, we obtain two derived fragments of Treat and Info as well: the first set of derived fragments is called Treat _resp and Info _resp based on a join on patient IDs occurring in the primary Respiratory fragment.

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equk_HTML.gif

The second set of derived fragments is Treat _frac and Info _frac based on a join on patient IDs occurring in the primary Fracture fragment.

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equl_HTML.gif

Note that non-redundancy of derived fragments is difficult to achieve (this is also discussed in [1]). We opt for having some redundancy in the derived fragments for sake of better data locality and hence better performance of query answering. That is why the information for patient Mary occurs in both derived fragments; the same applies to the treatment fragments.

Data replication for derived fragments

We maintain separate lookup tables for each (primary and derived) fragmentation of each table. Hence, the sizes of the derived fragments are also computed and stored in the corresponding lookup table. These sizes of the derived fragments must be taken into account for the data replication procedure and are encoded in the BPPC as follows. The capacity W, the replication factor m, the primary fragments and their m - 1 copies as well as the conflict graph stay the same as before; the only input that changes is sizes w_i assigned to the fragments:

the sizes w_i are now computed as the sum of the size of the primary fragment F_i plus the size of any derived fragment G_i.
solving the BPPC results in a placement where the primary fragment fits on the server together with all its derived fragments.
the primary fragment and its derived fragments are hence assigned to the same server and the server information in the lookup tables is inserted accordingly.

Implementation and example

Our prototype implementation is based on PostgreSQL and the UMLS::Similarity implementation. In the following subsections we describe the steps that the prototype executes.

UMLS and its similarity measures

The Unified Medical Language System incorporates several taxonomies from the medical domain like the Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT), or Medical Subject Headings (MeSH). It unifies these taxonomies assigning Concept Unique Identifiers (CUI) to terms so that shared terms in the different taxonomies have the same identifier.

The Perl program UMLS::Similarity [15] offers an implementation of several standard similarity measures. They can be differentiated into measures based solely on path lengths in a taxonomy and measures taking the so-called information content [16] into account. The information content (IC) is computed from a pre-assigned estimated probability p(c) of each leave term in the taxonomy (assuming a parent-child or is-a relationship in the taxonomy); for inner nodes that subsume other terms, this probability must be larger than for any child node (for example, by summing over all child nodes) because this concept covers all its child concepts. The information content is then defined as the negative log likelihood: - logp(c). In this way, the higher a term is located in taxonomy, the more abstract the term it is, and the lower is its information content; where the unique root node of the taxonomy has IC 0 (or in other words, its probability is 1) - that is, no information content.

UMLS::Similarity offers implementations of the following measures based on path lengths:

Path length (path) counts the nodes occurring on a path between two terms a and b and takes the inverse: $sim (a, b) = \frac{1}{length (path (a, b))}$ .
Leacock and Chodorow (lch) [17] use the length of the shortest path between two terms but also consider the overall maximum depth d_{m
a
x} of the taxonomy: $sim (a, b) = - log \frac{length (path (a, b))}{2 • d_{\max}}$
Wu and Palmer (wup) [18] consider the depth of terms - that is, the length of the path from the root node to the term. It first calculates the depths of the two terms and the depth of their least common subsumer (LCS) and then calculates similarity as twice the lcs depth divided by the sum of the depths of the two terms: $sim (a, b) = \frac{2 • depth (lcs (a, b))}{depth (c) + depth (b)}$
Conceptual distance (cdist) refers to the path length between two terms; while in the original case paths between terms were defined with respect to whether a meaning was narrower or broader ([19]), later on the paths in a parent-child (is-a) relationship were considered [20] - that is why in the latter case cdist coincides with path.
Al-Mubaid and Nguyen (nam) [21] combine path length and depth into one measure; they consider the overall maximum depth d_{m
a
x} of the taxonomy, the depth of the least common subsumer of the two comparison terms, the shortest path length between the two terms. UMLS::Similarity returns the inverse of this distance measure, that is: $\frac{log 2}{(length (path (a, b)) • 1) • (d_{\max} - depth (lcs (a, b))) + 1}$

UMLS::Similarity offers implementations of the following measures incorporating information content (IC):

Resnik (res) [16] proposed to use the information content of the least common subsumer (LCS): I C(l c s(a,b))

Jiang and Conrath (jcn) [22] use the inverse of a distance that is based on the IC of the two terms and the IC of the least common subsumer: $sim (a, b) = \frac{1}{IC (a) + IC (b) - 2 • IC (lcs (a, b))}$
Lin (lin) [23] takes twice the IC of the LCS and divides it by the sum of the ICs of the two terms: $sim (a, b) = \frac{2 • IC (lcs (a, b))}{IC (a) + IC (b)}$

We used the UMLS::Similarity web interface with the MeSH taxonomy to obtain the pair-wise similarity of the set of terms asthma, cough, influenza, tibial fracture and ulna fracture. Figure 1 shows how the terms are related by a is-a relationship in the MeSH taxonomy. Table 1 shows the similarity values obtained. Due to symmetry of the terms in the taxonomy (the path lengths and LCSs are mostly identical), the similarity values do not differ much in the two subsets asthma, cough and influenza, as opposed to tibial fracture and ulna fracture; the only difference is obtained with the two measures where the IC of the respective terms a, b are taken into account - namely jcn and lin.

Table 1

Sample similarity obtained with UMLS::Similarity

	Asthma	Cough	Influenza	Tibial fracture	Ulna fracture
Asthma		(jcn) 0.3109	(jcn) 0.3844	(jcn) 0.1405	(jcn) 0.1282
	max	(cdist) 0.2	(cdist) 0.2	(cdist) 0.1429	(cdist) 0.1429
		(lin) 0.6175	(lin) 0.6662	(lin) 0.2116	(lin) 0.1968
		(wup) 0.6667	(wup) 0.6667	(wup) 0.5556	(wup) 0.5556
		(path) 0.2	(path) 0.2	(path) 0.1429	(path) 0.1429
		(res) 2.5963	(res) 2.5963	(res) 0.9555	(res) 0.9555
		(lch) 2.0794	(lch) 2.0794	(lch) 1.743	(lch) 1.743
		(nam) is 0.1621	(nam) 0.1621	(nam) 0.1483	(nam) 0.1483
Cough			(jcn) 0.2958	(jcn) 0.1266	(jcn) 0.1166
		max	(cdist) 0.2	(cdist) 0.1429	(cdist) 0.1429
			(lin) 0.6057	(lin) 0.1948	(lin) 0.1822
			(wup) 0.6667	(wup) 0.5556	(wup) 0.5556
			(path) 0.2	(path) 0.1429	(path) 0.1429
			(res) 2.5963	(res) 0.9555	(res) 0.9555
			(lch) 2.0794	(lch) 1.743	(lch) 1.743
			(nam) 0.1621	(nam) 0.1483	(nam) 0.1483
Influenza				(jcn) 0.1373	(jcn) 0.1256
			max	(cdist) 0.1429	(cdist) 0.1429
				(lin) 0.2079	(lin) 0.1936
				(wup) 0.5556	(wup) 0.5556
				(path) 0.1429	(path) 0.1429
				(res) 0.9555	(res) 0.9555
				(lch) 1.743	(lch) 1.743
				(nam) 0.1483	(nam) 0.1483
Tibial fracture					(jcn) 0.243
				max	(cdist) 0.3333
					(lin) 0.6295
					(wup) 0.7778
					(path) 0.3333
					(res) 3.4961
					(lch) 2.5903
					(nam) 0.1867
Ulna fracture
					max

Clustering and fragmentation

The clustering heuristics has been implemented as a Java module that calls the UMLS::Similarity web interface. When using the clustering heuristics with the given similarities, regardless of which heads we choose, after two steps we obtain the two clusters {asthma, cough, influenza }, and {tibial fracture, ulna fracture }: let us assume, we choose asthma as h e a d₁, then we compute all similarities to asthma. The one with the lowest similarity is ulna fracture - which is taken to be h e a d₂. Because tibial fracture has lower similarity to ulna fracture than to asthma, it is assigned to c₂. For an appropriate threshold α (depending on the similarity measure chosen) the process could stop here. If instead we now continue the clustering, we would eventually obtain a total clustering consisting of only singleton sets: tibial fracture would become h e a d₃ (because it has minimal distance to h e a d₂); later on, cough would become h e a d₄ and influenza would be h e a d₅.

Choosing the path similarity and a threshold α=0.15 results in the two mentioned clusters. A fragmentation of the base table Ill can hence be obtained by computing the following materialized views in the Postgres database.

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equm_HTML.gif

Sophisticated size estimations for these fragments might be possible as stated previously. We obtain the sizes of the fragment by counting the number of rows:

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equn_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equo_HTML.gif

Next, we fill a lookup table containing information as.

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equp_HTML.gif

To obtain the placement of the fragments to servers we model the corresponding BPPC and use the solver lp _solve [24]. lp_solve has a simple human-readable syntax and can be accessed by a Java program via the Java Native Interface (JNI). An example input for K=5 (maximum number of servers), W=5 (capacity per server), m=2 (replication factor) looks as shown in Listing 3.

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equq_HTML.gif

The solution uses four servers (out of the five available ones): one for each of the two fragments and their copy. If the capacity is increased to W=6, only two servers are used: the two fragments now fit on one server and the two copies on another server.

For improved efficiency, the root table is currently stored as a hash map in the Java frontend (instead of stored in a separate database table). The hash map is keyed by the head element, because the heads are necessary for the query rewriting module.

Query rewriting

The query rewriting procedure has to parse the SQL string inserted by the user. If a selection condition is given for the relaxation attribute, the root table is consulted to check for a matching head element. If none is found, again the UMLS::Similarity interface is consulted to obtain the similarities between head elements and the selection condition.

As a simple example considers the SQL query

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equr_HTML.gif

When comparing bronchitis to the first head (that is, asthma), UMLS::Similarity gives the following similarity results: `The similarity of bronchitis (C0006277) and asthma (C0004096) using (jcn) is 1.0305, (cdist) is 0.3333, (lin) is 0.881, (wup) is 0.8, (path) is 0.3333, (res) is 3.5921, (lch) is 2.5903, (nam) is 0.1867’

Whereas comparing bronchitis to the second head (that is, tibial fracture), UMLS::Similarity gives the following similarity results: `The similarity of bronchitis (C0006277) and tibial fracture (C0040185) using (jcn) is 0.1308, (cdist) is 0.1429, (lin) is 0.2, (wup) is 0.5556, (path) is 0.1429, (res) is 0.9555, (lch) is 1.743, (nam) is 0.1483’

Hence, asthma is more similar to bronchitis in every measure. The SQL query is rewritten by retrieving the appropriate fragment name and redirected to the appropriate server identified from the root table:

https://static-content.springer.com/image/art%3A10.1186%2Fs13677-014-0018-0/MediaObjects/13677_2014_Article_18_Equs_HTML.gif

That is, the entire fragment is returned as it is the one with the most relevant answers for the user.

Experimental analysis

In general, any flexible query answering approach incurs a certain performance overhead compared to exact query answering. In our case, the clustering and fragmentation have to be compute but also the query answering incurs some extra overhead due to the fact that the appropriate fragment has to be identified and multiple answers are returned whereas exact query answering would simply have returned an empty answer set. With our approach however we aim to reduce this overhead by locating all related answers in the same fragment; any other fragmentation approach would need to recombine related answers from different fragments. For a performance evaluation of our prototype we used a test dataset consisting of values taken from the list of Medical Subject Headings (MeSH) [25]. The similarity computation during the clustering constituted an extreme overhead. That is why we computed pairwise similarities for 300 sample headings and stored these similarity values in a separate table. With these 300 values we randomly filled the disease column of a test table. We varied the row count between 10 and 1000 rows. Another parameter to vary is the threshold α for the intra-cluster similarity: the maximum similarity that values in a cluster may have to their respective head. For a higher threshold, more clusters are computed (and hence more similarity computations are executed) than for a lower threshold. We tested similarity thresholds 0.1, 0.125, 0.3 and 0.5. We ran the clustering and fragmentation algorithm on a PC with 1.73 GHz and 4GB RAM. The observed runtimes and number of obtained fragments are reported in Table 2. For the lower threshold values 0.1 and 0.125 runtimes are in the range of some seconds up to around 17 minutes. For a row count of 1000 rows the higher similarity values lead to a high number of fragments and runtimes are hence prohibitively high. Due to the high amount of pairwise comparisons, obtaining the similarity values is still the bottleneck of the clustering procedure. In future work we will follow two ways of improving scalability of the clustering: optimizing the access to the similarity values and investigating implications of a parallel implementation of the clustering procedure.

Table 2

Runtime and fragment count obtained for MeSH dataset

	10 rows		100 rows		1000 rows
α	Runtime	Fragment	Runtime	Fragment	Runtime	Fragment
	(ms)	count	(ms)	count	(ms)	count
0.1	971	2	18200	4	709627	12
0.125	1211	3	32900	7	1038085	17
0.3	2658	8	201959	45	4132254	94
0.5	2415	10	244161	69	7428473	233

We divide the related work survey into approaches for flexible query answering and approaches for data fragmentation and replication.

Flexible query answering

The area of flexible query answering (sometimes also called cooperative query answering) has been studied extensively for single server systems. Some approaches have used taxonomies or ontologies for flexible query answering but did not consider their application for distributed storage of data: CoBase [26] used a type abstraction hierarchy to generalize values; Shin et al. [27] use some specific notion of metric distance in a knowledge abstraction hierarchy to identify semantically related answers; Halder and Cortesi [28] define a partial order between cooperative answers based on their abstract interpretation framework; Muslea [29] discusses the relaxation of queries in disjunctive normal form. Ontology-based query relaxation has also been studied for non-relational data (like XML data in [30]).

All these approaches address query relaxation at runtime while answering the query. This is usually prohibitively expensive. In contrast, our approach precomputes the clustering and fragmentation so that query answering does not incur a performance penalty.

Data fragmentation and replication

There are some approaches for fine-grained fragmentation and replication on object/tuple level; however none of these approaches support the flexible query answering application aimed at in this paper. In contrast they are mostly workload-driven and try to optimize the locality of data that are covered in the same query. However, they only support exact query answering. In contrast to this, we do not consider workloads but a generic clustering approach that can work with arbitrary workloads providing the feature of flexible query answering by finding semantically related answers. While some approaches are adaptive to updates, no quality guarantee after an update is reported. We intend to extend our approach in the future by bringing robust optimization to the data replication area. Loukopoulos and Ahmad [6] describe data replication as an optimization problem; they focus on fine-grained geo-replication for individual objects. They include an assumed number of reads and writes for each site as well as communication costs between sites. They reduce their problem to the Knapsack problem. In particular, they devise an adaptive genetic algorithm that can reallocate data to different sites. We aim to follow a different path to support this adaptive behavior: the notion of robust optimization is briefly discussed in Section Discussion and conclusion.

Curino et al. [14] represent database tuples as nodes in a graph. They assume a given transaction workload and add hyperedges to the graph between those nodes that are accessed by the same transactions. By using a standard graph partitioning algorithm, they find a database fragmentation that minimizes the number of cut hyperedges. In a second phase, they use a machine learning classifier to derive a range-based fragmentation. Then they make an experimental comparison between the graph-based, the range-based, a hash-based fragmentation on tuple keys and full replication. Lastly, they also compare three different kinds of lookup tables to map tuple identifier to the corresponding fragment: indexes, bit arrays and Bloom filters. Similar to them, we apply lookup tables to locate the replicated data; however we apply this to larger fragments and not to individual tuples.

Quamar et al. [31] also model the fragmentation problem as minimizing cuts of hyperedges in a graph; for efficiency reasons, their algorithm works on a compressed representation of the hypergraph which results in groups of tuples. In particular, the authors criticize the fine-grained (tuple-wise) approach in [14] to be impractical for large number of tuples which is similar to our approach. The authors propose mechanisms to handle changes in the workload and compare their approach to random and tuple-level partitioning.

Tatarowicz et al. [12] assume three existing fragmentations: hash-based, range-based and lookup tables for individual keys and compare those in terms of communication cost and throughput. For an efficient management of lookup tables, they experimented with different compression techniques. In particular they argue that for hash-based partitioning, the query decomposition step is a bottleneck. While we apply the notion of lookup tables, too, the authors do not discuss how the fragments are obtained, whereas we propose a semantically guided fragmentation approach here.

Discussion and conclusion

In this paper we proposed an intelligent fragmentation and replication approach for a distributed database system; with this approach, cloud storage can be enhanced with a semantically-guided flexible query answering mechanism that will provide related but still very relevant answers for the user. The approach combines fragmentation based on a clustering with data replication. For the user, this approach is totally invisible: he can send queries to the database system unchanged. The distributed database system autonomously computes the fragmentation (where the only additional information needed is the clustering backed by a taxonomy specific to the domain of the anti-instantiation column) and can use an automatic data replication mechanism that relies on the size information of each fragment and generates a bin packing input for an Integer Linear Programming (ILP) solver. As most of the related approaches, we assume a static dataset with mostly read-only accesses. When receiving a user query, the database system can autonomously rewrite the query and redirect subqueries to the appropriate servers based on the maintenance of a root table. The proposed method hence offers novel self-management and self-configuration techniques for a user-friendly query handling. While the user provides the original table and the desired similarity threshold as input, the database system can autonomously distribute the data while minimizing the amount of database servers. Hence we see our approach as a first step towards an intelligent cloud database system. For full applicability in a cloud database, automatic reconfiguration after updates, failure-tolerance as well as parallelization of our clustering approach (for example with map-reduce) will be necessary; these topics will be handled in future work.

The work presented in this paper can be extended in various research directions. We give a brief discussion of possible extensions.

So far, the fragmentation process is only centered around a single relaxation attribute. The current approach can of course be executed in parallel for several relaxation attributes in parallel (with separate fragmentations and root tables for each relaxation attribute); however, this will lead to a massive (possibly unnecessary) replication of the data. We are currently investigating a more fine-grained support for multiple relaxation attributes with a more sophisticated data replication approach that can also be stated as a bin packing problem.
In order to have a full-blown distributed flexible query answering system, the interaction of the proposed fragmentation with other generalization operators (like dropping condition and goal replacement) must be elaborated.
When multiple fragments are assigned to one server, data locality can be improved by assigning fragments that are semantically close to each other to the same server.
Our main focus for future work is to study the effect of updates on data (deletions and insertions) in the fragments: it must be studied in detail how fragments can be reconfigured and probably migrated to other server without incurring too much transfer cost.

Regarding the update problem, we plan to apply a special optimization approach to database replication: the notion of recovery robust optimization [32] describes optimization methods that compute a solution that can later on adapt to changing conditions which so far have been used mostly for timetabling applications [33] or job sequencing [8] or telecommunication networks [34]; in this respect it is important to ensure a worst case guarantee as in [8]. This is a different approach than presented here and its implications will hence be the topic of future work.

Acknowledgements

I would like to thank Anita Schöbel and Marie Schmidt for helpful discussions and Florian Henke for implementation and evaluation of the prototype. I acknowledge support by the German Research Foundation and the Open Access Publication Funds of Göttingen University.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Competing interests

The author declares that she has no competing interests.

Vorheriger Artikel Virtual machine introspection: towards bridging the semantic gap

Nächster Artikel A design space for dynamic service level agreements in OpenStack

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Özsu MT, Valduriez P: Principles of distributed database systems, Third Edition. Springer, Berlin/Heidelberg; 2011.CrossRef

Wiese L: Taxonomy-based fragmentation for anti-instantiation in distributed databases. In 3rd International Workshop on Intelligent Techniques and Architectures for Autonomic Clouds (ITAAC’13) collocated with IEEE/ACM 6th international conference on utility and cloud computing. IEEE, Washington, DC; 2013:363–368.

Gaasterland T, Godfrey P, Minker J: Relaxation as a platform for cooperative answering. JIIS 1992, 1(3/4):293–321.

Inoue K, Wiese L: Generalizing conjunctive queries for informative answers. In Flexible query answering systems. Springer, Berlin/Heidelberg; 2011:1–12. 10.1007/978-3-642-24764-4_1CrossRef

Michalski RS: A theory and methodology of inductive learning. Artif Intell 1983, 20(2):111–161. 10.1016/0004-3702(83)90016-4CrossRefMathSciNet

Loukopoulos T, Ahmad I: Static and adaptive distributed data replication using genetic algorithms. J Parallel Distributed Comput 2004, 64(11):1270–1285. 10.1016/j.jpdc.2004.04.005CrossRefMATH

Gendreau M, Laporte G, Semet F: Heuristics and lower bounds for the bin packing problem with conflicts. Comput OR 2004, 31(3):347–358. 10.1016/S0305-0548(02)00195-8CrossRefMathSciNetMATH

Epstein L, Levin A, Marchetti-Spaccamela A, Megow N, Mestre J, Skutella M, Stougie L: Universal sequencing on a single machine. In Integer programming and combinatorial optimization. Springer, Berlin/Heidelberg; 2010:230–243. 10.1007/978-3-642-13036-6_18CrossRef

Sadykov R, Vanderbeck F: Bin packing with conflicts: a generic branch-and-price algorithm. INFORMS J Comput 2013, 25(2):244–255. 10.1287/ijoc.1120.0499CrossRefMathSciNet

10.

Gonzalez TF: Clustering to minimize the maximum intercluster distance. Theor Comput Sci 1985, 38: 293–306. 10.1016/0304-3975(85)90224-5CrossRefMATH

11.

Rieck K, Trinius P, Willems C, Holz T: Automatic analysis of malware behavior using machine learning. J Comput Secur 2011, 19(4):639–668.

12.

Tatarowicz A, Curino C, Jones EPC, Madden S: Lookup tables: Fine-grained partitioning for distributed databases. In IEEE 28th International Conference on Data Engineering (ICDE 2012). Edited by: Kementsietsidis A, Salles MAV. IEEE Computer Society, Washington, DC; 2012:102–113. 10.1109/ICDE.2012.26CrossRef

13.

Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE: Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst 2008, 26(2):4:1–4:26. 10.1145/1365815.1365816CrossRef

14.

Curino C, Zhang Y, Jones EPC, Madden S: Schism: a workload-driven approach to database replication and partitioning. Proc VLDB Endowment 2010, 3(1):48–57. 10.14778/1920841.1920853CrossRef

15.

McInnes BT, Pedersen T, Pakhomov SVS, Liu Y, Melton-Meaux G: Umls::similarity: Measuring the relatedness and similarity of biomedical concepts. In Human language technologies: conference of the North American chapter of the association of computational linguistics. Edited by: Vanderwende L, Daumé H III, Kirchhoff K. The Association for Computational Linguistics, Stroudsburg; 2013:28–31.

16.

Resnik P: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res (JAIR) 1999, 11: 95–130.MATH

17.

Leacock C, Chodorow M: Combining local context and wordnet similarity for word sense identification. WordNet: Electron Lexical Database 1998, 49(2):265–283.

18.

Wu Z, Palmer MS: Verb semantics and lexical selection. In 32nd annual meeting of the association for computational linguistics. Edited by: Pustejovsky J. Morgan Kaufmann Publishers/ACL, Stroudsburg; 1994:133–138. 10.3115/981732.981751CrossRef

19.

Rada R, Mili H, Bicknell E, Blettner M: Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybernet 1989, 19(1):17–30. 10.1109/21.24528CrossRef

20.

Caviedes JE, Cimino JJ: Towards the development of a conceptual distance metric for the umls. J Biomed Inform 2004, 37(2):77–85. 10.1016/j.jbi.2004.02.001CrossRef

21.

Al-Mubaid H, Nguyen HA (2006) New ontology-based semantic similarity measure for the biomedical domain, 623-628 IEEE, Washington, DC

22.

Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. CoRR cmp-lg/9709008

23.

Lin D: An information-theoretic definition of similarity. In Proceedings of the fifteenth international conference on machine learning. Edited by: Shavlik JW. Morgan Kaufmann, San Francisco; 1998:296–304.

24.

lp_solve. [http://lpsolve.sourceforge.net/]

25.

U.S. National Library of Medicine: Medical Subject Headings. [http://www.nlm.nih.gov/mesh/]

26.

Chu WW, Yang H, Chiang K, Minock M, Chow G, Larson C: CoBase: a scalable and extensible cooperative information system. JIIS 1996, 6(2/3):223–259.

27.

Shin MK, Huh S-Y, Lee W: Providing ranked cooperative query answers using the metricized knowledge abstraction hierarchy. Expert Syst Appl 2007, 32(2):469–484. 10.1016/j.eswa.2005.12.016CrossRef

28.

Halder R, Cortesi A: Cooperative query answering by abstract interpretation. In SOFSEM2011. LNCS, vol. 6543. Springer, Berlin/Heidelberg; 2011:284–296.

29.

Muslea I: Machine learning for online query relaxation. In Knowledge Discovery and Data Mining (KDD). ACM, New York; 2004:246–255.

30.

Hill J, Torson J, Guo B, Chen Z: Toward ontology-guided knowledge-driven XML query relaxation. In Computational Intelligence, Modelling and Simulation (CIMSiM). IEEE, Washington, DC; 2010:448–453.

31.

Quamar A, Kumar KA, Deshpande A: Sword: scalable workload-aware data placement for transactional workloads. In Joint 2013 EDBT/ICDT conferences. Edited by: Guerrini G, Paton NW. ACM, New York; 2013:430–441.

32.

Barber F, Salido MA: Robustness, stability, recoverability, and reliability in constraint satisfaction problems. Knowl Inf Syst 2014, 41(2):1–16.

33.

Liebchen C, Lübbecke M, Möhring R, Stiller S: The concept of recoverable robustness, linear programming recovery, and railway applications. In Robust and online large-scale optimization. Springer, Berlin/Heidelberg; 2009:1–27. 10.1007/978-3-642-05465-5_1CrossRef

34.

Büsing C, Koster AM, Kutschka M: Recoverable robust knapsacks: the discrete scenario case. Optimization Lett 2011, 5(3):379–392. 10.1007/s11590-011-0307-1CrossRefMATH

Titel: Clustering-based fragmentation and data replication for flexible query answering in distributed databases
verfasst von: Lena Wiese
Publikationsdatum: 01.12.2014
Verlag: Springer Berlin Heidelberg
Erschienen in: Journal of Cloud Computing / Ausgabe 1/2014
Elektronische ISSN: 2192-113X
DOI: https://doi.org/10.1186/s13677-014-0018-0

Springer Professional

Abstract

Electronic supplementary material

Competing interests

Introduction

Background

Data fragmentation

Example 1.

Anti-instantiation

Example 2.

Definition 1 (Deductive generalization 3).

Example 3.

Data replication

Clustering-based fragmentation

Definition 2 (Clustering-based fragmentation).

Approximation algorithm for clustering

Example 4.

Fragmentation and lookup table

Example 5.

Example 6.

Example 7.

Query rewriting

Example 8.

Example 9.

Improving data locality with derived fragmentations

Example 10.

Data replication for derived fragments

Implementation and example

UMLS and its similarity measures

Clustering and fragmentation

Query rewriting

Experimental analysis

Related work

Flexible query answering

Data fragmentation and replication

Discussion and conclusion

Acknowledgements

Competing interests

Authors’ original submitted files for images

Weitere Artikel der Ausgabe 1/2014

Development of template management technology for easy deployment of virtual resources on OpenStack

A design space for dynamic service level agreements in OpenStack

Correcting a financial brokerage model for cloud computing: closing the window of opportunity for commercialisation

Multi-cloud resource management: cloud service interfacing

Improving the performance of Hadoop Hive by sharing scan and computation tasks

If you want to know about a hunter, study his prey: detection of network based attacks on KVM based cloud environments