nach oben

Journal of the Brazilian Computer Society

Erschienen in:

Open Access 01.12.2015 | Research

ApproxMap - a method for mapping blank nodes in RDF datasets

verfasst von: Juliano de Almeida Monte-Mor, Adilson Marques da Cunha

Erschienen in: Journal of the Brazilian Computer Society | Ausgabe 1/2015

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Background

Versioning has proven to be essential in areas like software development or data and knowledge management. For systems or applications making use of documents formatted according to the Resource Description Framework (RDF) standard, it is difficult to calculate the difference between two versions, owing to the presence of blank nodes, also known as bnodes in RDF graphs. These are anonymous nodes that can assume different identifiers between versions. In this case, the challenge lies in finding a mapping between the sets of blank nodes in the two versions while minimizing the operations needed to convert one version into another.

Methods

Within this context, we propose an algorithm, named ApproxMap, for mapping bnodes based on extended concepts of rough set theory, which provides a way to measure the proximity of bnodes and map them with closer approximations. Our heuristic method considers various strategies for reducing both the number of comparisons between blank nodes and the delta between the compared versions. The proposed algorithm has a worst-case time complexity of O(n ²).

Results

ApproxMap showed satisfactory performance in our groups of experiments, as the algorithm that obtained solutions closest to the optimal values. This algorithm succeeded in finding the optimal delta size in 59% of the tests involving optimal values. ApproxMap achieved a delta size smaller than or equal to those of existing algorithms in at least 95% of the tested cases.

Conclusions

The results show that the proposed algorithm can be successfully applied to versioning RDF documents, such as that produced by software processes with iterative and incremental development. We recommend applying ApproxMap in various situations, particularly those involving similar versions and directly connected bnodes.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JAM developed algorithms, performed experiments, and drafted the manuscript. AMC participated in design and coordination of the study. Both authors read and approved the final manuscript.

Background

In areas such as software engineering, databases, and Web publishing, methods for versioning have already been developed and successfully applied. These methods must be able to calculate the differences (i.e., deltas) between versions to provide efficient storage of subsequent versions.

Particularly in software engineering, versioning algorithms are usually based on a comparison of text lines. However, these methods are not suitable to control versions of structured or semi-structured documents. In this article, we focus specifically on the version control of documents following a Semantic Web standard, the Resource Description Framework (RDF) [1]. We have applied Semantic Web technologies in the software configuration management (SCM) domain [2].

RDF defines a basic data model for writing simple statements about Web objects or resources. It allows the definition of sentences through ‘subject-predicate-object’ triples; that is, a resource, a property, and a value (which can be a literal or a resource). An RDF triple, like a graph’s edge, provides a binary relationship (predicate) that relates a subject to an object. Thus, an RDF document or dataset can be represented by a directed graph [3].

The conventional line-oriented mechanisms in software engineering are insufficient in the Semantic Web context because their deltas are based on unique serializations, which do not occur naturally in RDF datasets [4]. These bases usually consist of unordered collections of affirmations about resources; however, even when a standard serialization order is imposed (e.g., by sorting), existing comparison tools fail to consider knowledge inferred from schemas associated with RDF datasets [5].

Thus, to obtain the delta between two versions of an RDF dataset, we need to map the nodes in the graphs representing these versions. However, the main problem encountered during calculation of the delta concerns the existence of anonymous nodes (i.e., blank nodes or bnodes) in the RDF graphs. Bnodes represent resources that are not identified by a uniform resource identifier (URI) or literals. In this case, the mapping between bnodes contained in different graph versions directly influences the size of the deltas.

As the scope of identifying bnodes is only local, it is a challenge to find a mapping between bnodes in two versions resulting in the smallest possible delta. Tzitzikas et al. [6] showed that the problem of finding the optimal mapping is NP-hard in the general case and polynomial in the case where bnodes are not directly connected. To illustrate this problem, consider two versions of a dataset as shown in the example proposed by Tzitzikas et al. in Figure 1.

First, we can easily map bnodes ‘_:3’, ‘_:4’, and ‘_:5’ to bnodes ‘_:8’, ‘_:10’, and ‘_:9’, respectively. Then, by mapping bnode pairs ‘_:1’ and ‘_:6’ and ‘_:2’ and ‘_:7’, which seems to be a natural choice, we obtain a delta consisting of four triples. In other words, transforming the first graph into the second requires removing triples ‘ <_:1,friend, _:4 >’ and ‘ <_:2,friend, _:5 >’ and adding triples ‘ <_:1,friend, _:5 >’ and ‘ <_:2,friend, _:4 >’. However, if we were to map bnode ‘_:1’ to ‘_:7’ and ‘_:2’ to ‘_:6’, we would have a delta consisting of two triples; that is, triple ‘ <_:1,brother, _:3 >’ must be removed and triple ‘ <_:2,brother, _:3 >’ added. The latter mapping is better, owing to the smaller delta size. In the case of directly connected bnodes, we believe that a mapping based on a bottom-up strategy, where nodes in the lower levels are mapped before those in the upper levels, can help reduce the delta size.

During bnode mapping, we need to address inaccuracies between the modified bnodes. To facilitate the handling of this imprecision, we chose to extend some concepts of rough set theory (RST) [7]. RST has already been successfully applied in several areas like artificial intelligence and cognitive sciences. Nicoletti et al. [8] presented the following application examples: creation of machine learning methods, knowledge representation, inductive reasoning, data mining, processing of imperfect or incomplete information, pattern recognition, and discovery of knowledge in databases.

In this context, our approach proposes a heuristic method for mapping blank nodes based on RST. This theory serves as the conceptual basis for the definition of metrics to assist in the choice of bnode pairs, providing the necessary support to map a bnode to the candidate with the closest approximation. Our main objective is to create an algorithm that can be successfully applied in software project versioning.

The remainder of this article is organized as follows: ‘Related work’ subsection gives an overview of existing work on calculating deltas and mapping bnodes. In ‘Problem description’ subsection, we formally describe the problem addressed in this work, while ‘Rough set theory’ discusses some basic concepts of RST. In ‘Blank nodes as rough sets’, we define a bnode representation model using rough sets, which is necessary for specifying the proposed mapping algorithm in ‘The ApproxMap method’ section. ‘Results and discussion’ discusses some experimental results, while ‘Conclusions’ presents our conclusions and recommendations.

Particularly in the software engineering domain, relatively little effort has been made to develop methods for obtaining a better blank node mapping between two versions, by reducing their delta size. Next, we briefly describe some studies on RDF dataset versioning, explaining how they handle blank nodes.

Berners-Lee and Connolly [4] discussed comparing RDF graphs and updating a graph from a calculated set of differences. They emphasized that the order and identification of bnodes can differ arbitrarily with serializations of the same graph. Hence, calculating deltas based on line-oriented approaches is a problem. Computing the differences between two graphs is simple and straightforward if all nodes are named. However, when not all bnodes are named, finding the largest common subgraph becomes an instance of the graph isomorphism problem. The authors further suggested that available solutions for the general isomorphism problem do not appear to be good matches for practical cases. Thus, they proposed an algorithm that produces an RDF difference only for graphs named directly with URIs or indirectly with functional or inverse functional properties. We extend their approach by performing the mapping considering unnamed nodes as well.

Carroll [9] showed that standard algorithms for graph isomorphism can be used to compare RDF graphs. He developed an algorithm considering an iterative vertex classification, used in his RDF toolkit Jena, where each anonymous resource is identified based on the statements in which it appears. Thus, bnodes receive identifiers considering their local contexts, which can change between different versions. In our approach, although we do not produce identifiers for bnodes, we also consider the triples in which they appear to classify approximations between bnode pairs.

Noy et al. [10-12] presented an algorithm, called PromptDiff, which combines different heuristic matchers to map RDF graphs by comparing structural properties of the ontology versions. New matchers, which may be needed to compare anonymous classes, can easily be added. The authors considered two observations when comparing versions from the same ontology: a large proportion of the frames remain unchanged between versions; and if two frames have the same type and name (or a very similar name), they are almost certainly copies of one another. We follow the first observation, by first mapping equivalent bnodes. We also include some heuristic strategies in the design of our method.

Auer and Herre [13] suggested a framework to support versioning and the evolution of RDF knowledge bases. Their framework is based on atomic changes, including the addition or removal of RDF graphs statements. Atomic changes encompass all statements containing bnodes in a delta, where the graph is atomic if it cannot be split into two nonempty graphs with disjoint blank nodes. In contrast to our approach, because Auer and Herre did not aim to find a mapping between bnodes, there was no commitment to obtain the smallest delta.

Voelkel and Groza [14] showed a versioning approach, called SemVersion, which provides structural and semantic versioning for models in RDF/S and OWL. In their approach, bnodes were given unique identifiers in all versions. To identify equal blank nodes across models, they proposed a method for blank node enrichment, where URIs are attached as inverse functional properties to blank nodes. However, this means that blank nodes with different identifiers cannot be mapped, even if they represent the same element in different versions. Moreover, in our approach, we do not add any information to the datasets and do not consider unique identifiers for bnodes in different versions.

Cassidy and Ballantine [15] and Im et al. [16] presented versioning models for RDF repositories. They provided a collaborative annotation facility to develop and share annotations over the Web. Im et al. proposed a version framework for an RDF data model based on relational databases. None of these authors, however, considered blank nodes in their research or defined any method for mapping bnodes, as we do in our approach. These researchers addressed only procedures enabling versioning in RDF repositories.

By considering deltas as sets of change operations, Zeginis et al. [5,17] described various comparison functions, together with the semantics of primitive change operations, and formally analyzed their possible combinations in terms of correctness, minimality, semantic identity, and redundancy properties. Assuming Add(t) and Del(t) are, respectively, the straightforward addition and deletion of triple t from set Triples(K), then, in our approach, we adopt the differential function Δ _e (where e stands for explicit) for two dataset versions K and K ^′, defined by Zeginis et al. as:

$$ {}\Delta_{e}(K,K') = \left\{ \text{Add}(t) | t \in K' - K \right\} \cup \left\{ \text{Del}(t) | t \in K - K' \right\}. $$

(1)

Tzitzikas et al. [6] proposed two polynomial time algorithms for mapping bnodes between two knowledge bases. Seeking to reduce the size of the resulting delta, the authors modeled the problem of bnode mapping as an assignment problem and used a Hungarian [18] method, Alg _Hung, to solve it. This method seeks to find the optimal solution with time complexity O(n ³).

Alg _Hung obtains the optimal delta if the considered knowledge bases do not have interconnected bnodes. In the case where the datasets have directly connected bnodes, the authors assume that all neighboring bnodes are equal during mapping. This method cannot be applied to larger knowledge bases owing to its quadratic space requirement in terms of RAM [6].

These authors also proposed a faster signature-based method, called Alg _Sign, for comparing large knowledge bases with time complexity O(n· logn). For each bnode, Alg _Sign produces a string based on its direct neighborhood as the bnode’s signature. Thereafter, the mapping phase compares the generated strings, sorted lexicographically to allow a binary search. The cost of reducing the mapping time is a probable increase in the delta size [6].

Through experiments, Tzitzikas et al. verified that their algorithms obtain deltas with large sizes if the number of directly connected bnodes is high. In this case, once the direct neighborhoods lose their discrimination ability, the delta reduction potential becomes more unstable [6].

Because the number of directly connected bnodes affects the results of both Alg _Hung and Alg _Sign, we proposed a greedy method with a different strategy: neighboring bnodes are treated as different nodes, until they have been mapped in a previous iteration. Our proposal aims to develop a method with lower memory overhead than the Alg _Hung algorithm, while reducing the probable increase in delta size when compared with Alg _Sign.

Research performed before that of Tzitzikas et al. [6] did not seek a mapping that reduces the delta between versions. Tzitzikas et al. were the first to address the bnode mapping problem as an optimization problem, as described in the next section. Accordingly, their work served as the basis for implementing our approach, enabling a comparison between our method and their proposed algorithms.

Problem description

In this section, we describe the problem addressed in this article as defined by Tzitzikas et al. [6]. An RDF knowledge base, i.e., an RDF graph, consists of a finite set of RDF triples. Each RDF triple refers to (s,p,o)∈(W∪B)×W×(W∪B∪L), where W is an infinite set of URIs, B is an infinite set of blank nodes, and L is an infinite set of literals. Assuming W _k, B _k, and L _k are sets of URIs, blank nodes, and literals of an RDF G _k graph, respectively, the equivalence between two RDF graphs can be defined as follows:

Definition1.

(from [1]) Two RDF graphs G ₁ and G ₂ are equivalent if there is a bijection M between the sets of nodes of the two graphs (N ₁ and N ₂) such that:

M(u r i)=u r i, for each u r i∈W ₁∩N ₁;
M(l i t)=l i t, for each l i t∈L ₁;
M maps bnodes to bnodes (i.e., for each b∈B ₁ it holds that M(b)∈B ₂); and
triple (s,p,o) is in G ₁ if, and only if, triple (M(s),p,M(o)) is in G ₂.

Tzitzikas et al. denoted this equivalence between two graphs G ₁ and G ₂ as G ₁≡_M G ₂. Moreover, they also defined the edit distance between two nodes as given in Definition 2. From these two definitions, the equivalence between graphs G ₁ and G ₂ can be defined as in Theorem 1.

Definition2.

(from [6]) Let o ₁ and o ₂ be nodes in G ₁ and G ₂, respectively. Suppose a bijection exists between the nodes of these graphs, i.e., function M:N ₁→N ₂ (obviously |N ₁|=|N ₂|). Then, the edit distance between o ₁ and o ₂ over M, denoted by dist_M(o ₁,o ₂), is the number of additions or deletions of triples required to make the ‘direct neighborhoods’ of o ₁ and o ₂ the same (that is, where M-mapped nodes are the same). Formally:

$$ \begin{aligned} {}\text{dist}_{M}(o_{1}, o_{2}) \,=\,&\; \left|\left\{(o_{1}, p, a) \in G_{1} | (o_{2}, p, M(a)) \notin G_{2}\right\}\right|\\ &+ \left|\left\{(a, p, o_{1}) \in G_{1} | (M(a), p, o_{2}) \notin G_{2}\right\}\right|\\ &+ \left|\left\{(o_{2}, p, a) \in G_{2} | (o_{1}, p, M^{-1}(a)) \notin G_{1}\right\}\right|\\ &+ \left|\left\{(a, p, o_{2}) \in G_{2} | (M^{-1}(a), p, o_{1}) \notin G_{1}\right\}\right|. \end{aligned} $$

(2)

Theorem1.

(from [6])

$$ G_{1} \equiv_{M} G_{2} \Leftrightarrow {dist}_{M}(o, M(o)) = 0 \text{~for each}\, o \in N_{1} $$

(3)

In the case of versioning, current interest lies in non-equivalent knowledge bases. In this case, it is necessary to find a mapping between bnodes in the two knowledge bases, B ₁ and B ₂, that reduces the delta resulting from a comparison thereof.

In this regard, Tzitzikas et al. formulated finding this mapping as an optimization problem: given n ₁=|B ₁|, n ₂=|B ₂|, and n=min(n ₁,n ₂), the goal is to find the unknown part of bijection M. First, M contains the mapping of all URIs and literals of the knowledge bases (according to Definition 1). Assuming that n=n ₁<n ₂, I denotes the set of all possible bijections between B ₁ and the subset of B ₂ comprising n elements. Consequently, the set of candidate solutions (i.e., |I|) is exponential in size. Given the objective of finding a bijection M∈I that reduces the size of the delta, they defined the cost of bijection M by Equation 4. From Definition 3, Tzitzikas et al. described the equivalence between two graphs G ₁ and G ₂ according to the mapping cost presented in Theorem 2.

$$ \text{Cost}(M) = \sum_{b_{1} \in B_{1}} \text{dist}_{M}(b_{1}, M(b_{1})) $$

(4)

Definition3.

(from [6]) The best solution (or solutions) is the bijection with the minimal cost. Considering that a r g _M returns the set M∈I with the minimum cost, we have:

$$ M_{\text{sol}} = {arg}_{M} \min_{M \in \Im} (\text{Cost}(M)). $$

(5)

Theorem2.

(from [6])

$$ G_{1} \equiv_{M_{\text{sol}}} G_{2}, \,then \,\text{Cost}(M_{\text{sol}}) = 0. $$

(6)

Therefore, considering the context of this problem described by Tzitzikas et al., we propose a greedy method that seeks to reduce the delta size between two RDF graphs, obtaining an approximate solution to the bijection between the bnodes of these RDF graphs. For this purpose, we define some metrics extending various concepts of RST. In the next section, we present some basic concepts of this theory, which are considered in the design of our algorithm.

Rough set theory

RST is an extension of set theory, consisting of a mathematical model for uncertainty and imprecision handling, knowledge representation, and rough classification. The main advantage of using RST is that it does not require any preliminary or additional information about the data, such as a probability distribution or membership degree.

In our approach, we adopt RST as the formalism for dealing with imprecision resulting from the comparison of bnode pairs. RST also forms the conceptual basis of defining metrics for measuring the closeness between bnode pairs. Our method aims to map the closest bnode pairs in an attempt to reduce the delta size. Next, we present the main concepts of this theory, extracted from [7,19].

Basic concepts

Let U be a finite, nonempty, universe set of objects. In set U, we can define subsets using the equivalence relation R, called the indiscernibility relation. Relation R induces a partition (and consequently, classification) of the objects in U. Thus, an approximation space consists of an ordered pair A=(U,R), where given x,y∈U, if xRy then x and y are indiscernible in A. The equivalence class defined by x is the same as that defined by y, i.e., [x]_R=[y]_R.

Elementary sets correspond to equivalence classes induced by R in U. A partition of U by R, denoted by U/R, can be viewed as the set $\tilde {R} = U/R = {E_{1}, E_{2}, \ldots, E_{n}}$, where each E _i, with 1≤i≤n, is an elementary set of A. It is assumed that the empty set ∅ is an elementary set of all approximation spaces A. Given an approximation space A=(U,R), let X⊆U be any subset of U; then, using the following concepts, we can check how well X is represented by the elementary sets of A:

Lower approximation of X in A - formed by the union of all elementary sets of A fully contained in X, i.e., the largest definable set in A contained in X:
$$ A_{\text{inf}}(X) = \left\{x \in U | [x]_{R} \subseteq X\right\}. $$

(7)
Upper approximation of X in A - formed by the union of all elementary sets of A having a nonempty intersection with X, i.e., the smallest definable set in A containing X:
$$ A_{\text{sup}}(X) = \left\{x \in U | [x]_{R} \cap X \neq \emptyset\right\}. $$

(8)

Thus, the lower approximation of X in A contains those elements in U that can definitely be affirmed as belonging to X. Furthermore, the upper approximation of X in A covers both those elements that definitely belong to X and those that cannot definitely be excluded from X. In many cases, set X may be a finite union of elementary sets, which characterizes X as a definable set in A. This implies that A _sup(X)=A _inf(X)=X. Besides, based on a rough classification of set X⊆U, we can identify the following regions in approximation space A=(U,R):

Positive region of X in A - formed by the union of all elementary sets of U fully contained in X:
$$ \text{pos}\,(X) = A_{\text{inf}}\left(X\right). $$

(9)
Negative region of X in A - formed by the elementary sets of U that have no elements in X:
$$ \text{neg}\;(X) = U - A_{\text{sup}}(X). $$

(10)
Doubtful region of X in A - also called the boundary of X, formed by the elementary sets of U that belong to the upper approximation, but do not belong to the lower approximation. The membership of an element of this region to set X is uncertain, based only on the equivalence classes of A:
$$ \text{duv}\,(X) = A_{\text{sup}}(X) - A_{\text{inf}}\,(X). $$

(11)

The positive region has all elements of U that definitely belong to X. The negative region comprises all elements that definitely do not belong to X. Finally, the doubtful region includes those elements of U whose membership of X cannot definitely be determined. Figure 2 illustrates the main concepts of RST.

Some RST measures

RST provides several measures (e.g., accuracy and a discriminant index) for checking how well a set X∈U can be represented in approximation space A=(U,R) [7,8,19,20]. In the design of the proposed mapping method, we consider the following RST metrics:

Internal measure of X in A
$$ \varpi_{\text{Ainf}}\,(X) = \left| A_{\text{inf}}\,(X)\right| $$

(12)
External measure of X in A
$$ \varpi_{\text{Asup}}(X) = \left| A_{\text{sup}}(X)\right| $$

(13)
Quality of the lower approximation of X in A
$$ \gamma_{\text{Ainf}}(X) = \frac{\varpi_{\text{Ainf}}(X)}{\left|U\right|} = \frac{\left| A_{\text{inf}}(X)\right|}{\left|U\right|} $$

(14)
Quality of the upper approximation of X in A

$$ \gamma_{\text{Asup}}(X) = \frac{\varpi_{\text{Asup}}(X)}{\left|U\right|} = \frac{\left| A_{\text{sup}}(X)\right|}{\left|U\right|} $$

(15)

The internal measure is the number of elements in A that definitely belong to X, while the external measure indicates the number of elements that could belong to X. The metrics for quality of the lower and upper approximations present these measures as percentages of the total number of elements in A. In particular, we extended γ _Ainf(X) and γ _Asup(X) in the design of our mapping algorithm. As a future work, we intend evaluating the adoption of other RST metrics. In the next section, we describe how bnodes can be modeled as approximate sets in an approximation space.

Methods

We adopted RST in our approach as the basis on which to build a heuristic method to reduce the size of the delta found in the mapping between RDF graphs. To achieve this goal, we must first model the bnodes as sets in an approximation space. The steps required for this transformation are explained below.

Blank nodes as rough sets

Considering set B containing the bnodes of an RDF graph G, Equation 16 defines a subgraph G _i⊆G that contains only triples involving a given bnode b _i∈B. The negative sign (−) is used to indicate a reverse link in graph G _i, i.e., if b is in object ‘o’ of triple (s,p,o). Thus, −W is the set consisting of all elements of W, preceded by a negative sign.

$$ G_{i} = \left\{ (s, p, o) | (s, p, o) \in G \wedge \left(s = b_{i} \vee o = b_{i}\right) \right\} $$

(16)

In addition, outgoing links of b _i refer to the links represented by triples in the format (b _i,p,o)∈G _i, where o≠b _i. Similarly, we adopt the expression inbound links of b _i to refer to triples in the format (s,p,b _i)∈G _i, where s≠b _i. Last, we use the symbol ‘ σ’ to denote connections with bnode b _i itself, called b _i recursive links. Thus, to build a set X _i representing bnode b _i, we need to transform the triples of G _i using function $S_{b_{i}}:G_{i} \rightarrow (W \cup -W) \times (W \cup B \cup L \cup \{`{\sigma }'\})$:

$$ S_{b_{i}}(s, p, o) = \left\{ \begin{array}{cl} (p, o), & \text{if}\; s = b_{i} \neq o\\ (-p, s), & \text{if}\; s \neq b_{i} = o\\ (p, \mathrm{`}\sigma\text{'}), & \text{if}\; s = b_{i} = o. \end{array} \right. $$

(17)

Function $S_{b_{i}}(s, p, o)$ returns an ordered pair (l,n), where n represents the neighboring node b _i (s or o) or `σ’, and l represents the connection or predicate between b _i and n. Assuming that n=`σ’, where $S_{b_{i}}\left (b_{i}, p, b_{i}\right) = (p, \mathrm {`}\sigma \text {'})$, the literal `σ’ represents a bnode automatically mapped from the mapping of b _i itself.

In the case of directly connected bnodes, unlike Tzitzikas et al. who considered all bnodes to be the same, our approach considers all unmapped neighboring bnodes to be the same for inbound links and different for outgoing links. Furthermore, we treat ‘already mapped’ neighbors in the same way as identified nodes (URIs and literals). We can now construct set X _i, representing bnode b _i, from subgraph G _i, corresponding to the image set obtained by applying $S_{b_{i}}$ to all triples of G _i:

$$ X_{i} = S_{b_{i}}(G_{i}). $$

(18)

Assuming that B corresponds to the bnode set of RDF graph G, our method proposes the construction of an approximation space A=(U,R) considering blank nodes b _i∈B. Thus, U refers to the set universe obtained from the union of sets X _i, representing all considered bnodes b _i:

$$ U = \bigcup X_{i}. $$

(19)

Besides the set universe, for the construction of approximation space A=(U,R), we also need to define an equivalence relation R, to partition the universe into equivalence classes. Given both the set universe $U = \bigcup X_{i}$ and set intersection $I = \bigcap X_{i}$, and also two elements a=(l _a,n _a)∈U and b=(l _b,n _b)∈U, we define equivalence relation R as:

$$ aRb \Leftrightarrow l_{a} = l_{b} \wedge \left((a, b \in I) \vee (a, b \notin I) \right). $$

(20)

Elements of the same class are indiscernible according to relation R. Having defined the approximation space and sets representing bnode pairs in this space, in the next section, we discuss how to extend the RST concepts to provide a measure of the closeness of bnodes.

Extending the RST concepts

Given any two approximation sets X _i and X _j in the approximation space A _ij=(U _ij,R), we observe the following properties for the intersection of their approximations [7]: A _inf(X _i)∩A _inf(X _j) =A _inf(X _i∩X _j) and A _sup(X _i)∩A _sup(X _j)⊇A _sup(X _i∩X _j). For a more accurate analysis of the approximation of X _i and X _j in A _ij, we can extend the concepts of positive, doubtful, and negative regions, considering the intersections between their approximations:

Definition4.

Change regions for X _i and X _j in A _ij

Positive change region - formed by the union of all elementary sets of U _ij contained entirely in both X _i and X _j:
$$ \text{pos}\left(X_{i}, X_{j}\right) = A_{\text{inf}}\left(X_{i}\right) \cap A_{\text{inf}}\left(X_{j}\right). $$

(21)
Negative change region - formed by elementary sets of U _ij that have no elements in X _i or X _j:
$$ \text{neg}\left(X_{i}, X_{j}\right) = U_{ij} - \left(A_{\text{sup}}\left(X_{i}\right) \cap A_{\text{sup}}\left(X_{j}\right) \right). $$

(22)
Doubtful change region - formed by elementary sets of U _ij partially contained in X _i or X _j. In this case, X _i or X _j, but not both, may integrally contain elementary sets of U _ij:
$$ \begin{aligned} {}\! \text{duv}\left(X_{i}, X_{j}\right) \,=\, (A_{\text{sup}}\!\left(X_{i}\right) \!\cap\! A_{\text{sup}}\!\left(X_{j}\right)) \,-\, \left(A_{\text{inf}}\left(X_{i}\right) \!\cap\! A_{\text{inf}}\left(X_{j}\right)\! \right)\!. \end{aligned} $$

(23)

The positive change region pos(X _i,X _j) comprises classes that relate to existing links in both bnodes, with the same neighboring nodes, i.e., these classes contain elements representing equivalent links, considering the mapping between bnodes. Classes contained in the doubtful change region duv(X _i,X _j) contain elements representing predicates common to the bnodes, but connected to different neighbors, being considered as similar links. They represent change operations on common predicates of bnodes: rename, extend, or reduce. Finally, the negative change region neg(X _i,X _j) consists of classes that are not found in both bnodes. These classes refer to the addition or removal of bnode predicates being considered as independent links.

The change regions may provide a way of measuring the approximation between the two sets representing the bnodes. However, before addressing this issue, we analyze some extreme situations involving these regions to improve the understanding thereof. Initially, considering the case where all elements are in the positive change region, we can rank the bnodes as equivalent in A _ij, because there are no differences between the bnode predicates, i.e., $(b_{i} \equiv _{A_{\textit {ij}}} b_{j}) \Leftrightarrow (A_{\text {inf}}(X_{i}) \cap A_{\text {inf}}(X_{j}) = U_{\textit {ij}})$, where this relationship is denoted by the symbol $\equiv _{A_{\textit {ij}}}$. Otherwise, if this region is empty, the bnodes have no common connections with the same neighboring nodes (equivalent links), i.e., A _inf(X _i)∩A _inf(X _j)=∅. In this case, analysis of other change regions is necessary.

Regarding the doubtful change region, if all elements meet in this region it means that the bnodes have similar links with different neighboring nodes, i.e., (A _inf(X _i)∩A _inf(X _j)=∅)∧(A _sup(X _i)∩A _sup(X _j)=U _ij). If this region is empty, there are no changes in the predicates common to both bnodes, i.e., (A _sup(X _i)∩A _sup(X _j))−(A _inf(X _i)∩A _inf(X _j))=∅. If the positive and/or doubtful regions are not empty and smaller than the universe, we categorize bnodes as approximated in A _ij, represented by the symbol $\approx _{A_{\textit {ij}}}$, because they have predicates in common, i.e., $(b_{i} \approx _{A_{\textit {ij}}} b_{j}) \Leftrightarrow (\emptyset \neq (A_{\text {sup}}(X_{i}) \cap A_{\text {sup}}(X_{j})) \neq U_{\textit {ij}})$.

Finally, if all the elements are in the negative change region, we classify the bnodes as distinct in A _ij, represented by $\neq _{A_{\textit {ij}}}$, because they have independent links, i.e., $(b_{i} \neq _{A_{\textit {ij}}} b_{j}) \Leftrightarrow (A_{\text {sup}}(X_{i}) \cap A_{\text {sup}}(X_{j}) = \emptyset)$. On the other hand, if this region is empty, all the connections are common to both bnodes, i.e., A _sup(X _i)∩A _sup(X _j)=U _ij.

Therefore, we can evaluate the approximation between bnodes from these change regions. For this purpose, we need to extend the RST measures presented in ‘Some RST measures’ subsection to measure the approximation between sets X _i and X _j in A _ij, by considering the intersection of the approximation of these sets:

Definition5.

Change measures of X _i and X _j in A _ij

Internal change measure
$$ \varpi_{\text{Ainf}}\left(X_{i}, X_{j}\right) = \left| A_{\text{inf}}\left(X_{i}\right) \cap A_{\text{inf}}\left(X_{j}\right) \right| $$

(24)
External change measure
$$ \varpi_{\text{Asup}}\left(X_{i}, X_{j}\right) = \left| A_{\text{sup}}\left(X_{i}\right) \cap A_{\text{sup}}\left(X_{j}\right) \right| $$

(25)
Quality of the lower change approximation
$$ \begin{aligned} \gamma_{\text{Ainf}}\left(X_{i}, X_{j}\right) &= \frac{\varpi_{\text{Ainf}}\left(X_{i}, X_{j}\right)}{\left|U\right|}\\ &= \frac{\left| A_{inf}(X_{i}) \cap A_{inf}(X_{j}) \right|}{\left|U\right|} \end{aligned} $$

(26)
Quality of the upper change approximation
$$ \begin{aligned} \gamma_{\text{Asup}}\left(X_{i}, X_{j}\right) &= \frac{\varpi_{\text{Asup}}\left(X_{i}, X_{j}\right)}{\left|U\right|}\\ &= \frac{\left| A_{\text{sup}}(X_{i}) \cap A_{\text{sup}}(X_{j}) \right|}{\left|U\right|} \end{aligned} $$

(27)

Based on the measures given in Definition 5, we redefine the approximation between two bnodes b _i and b _j in Definition 6. γ _Ainf(X _i,X _j) provides a way of measuring the percentage of identical predicates considering the mapping between X _i and X _j, while γ _Asup(X _i,X _j) provides a way of measuring the approximation between the predicates of X _i and X _j.

Definition6.

Approximation between b _i and b _j in A _ij

$\left (b_{i} \equiv _{A_{\textit {ij}}} b_{j}\right) \Leftrightarrow \left (\gamma _{\text {Ainf}}(X_{i}, X_{j}) = 1\right)$;
$\left (b_{i} \approx _{A_{\textit {ij}}} b_{j}\right) \Leftrightarrow \left (0 < \gamma _{\text {Asup}}\left (X_{i}, X_{j}\right) < 1\right)$;
$\left (b_{i} \neq _{A_{\textit {ij}}} b_{j}\right) \Leftrightarrow \left (\gamma _{\text {Asup}}\left (X_{i}, X_{j}\right) = 0\right)$.

Exemplifying the modeling

To illustrate the construction of sets in an approximation space representing bnodes of RDF graphs, suppose we need to map a blank node modified in two subsequent versions G ₁ and G ₂ of a dataset. Figure 3 presents graphs representing the first pair of candidates. Figure 4 shows the positive, negative, and doubtful regions of sets X ₁ and X ₂ in the approximation space A ₁₂, while Figure 5 presents the positive, doubtful, and negative change regions in A ₁₂.

Now consider another bnode candidate b ₃∈G ₂ (labeled as ‘_:ProductB’), represented by X ₃, as shown in Figure 6a; then, Figure 6b presents the change regions for X ₁ and X ₃ in A ₁₃. For this example, we obtain the following values for sets X ₁, X ₂, and X ₃:

ϖ _Ainf(X ₁,X ₂)=5;
ϖ _Ainf(X ₁,X ₃)=3;
ϖ _Asup(X ₁,X ₂)=8;
ϖ _Asup(X ₁,X ₃)=10;
γ _Ainf(X ₁,X ₂)=5/10=0.5;
γ _Ainf(X ₁,X ₃)=3/12=0.25;
γ _Asup(X ₁,X ₂)=8/10=0.8;
γ _Asup(X ₁,X ₃)=10/12≈0.83.

Thus, we have $\phantom {\dot {i}\!}b_{1} \approx _{A_{12}} b_{2}$ and $\phantom {\dot {i}\!}b_{1} \approx _{A_{13}} b_{3}$, but as γ _Ainf(X ₁,X ₂)>γ _Ainf(X ₁,X ₃), we prefer the mapping between b ₁ and b ₂. We applied metric γ _Ainf(X _i,X _j) in the mapping between bnode pairs b _i and b _j, with the aim of reducing the delta between the versions. The greater is the value of the lower approximation quality, the higher is the equivalence between the bnode connections. In cases with equal values for γ _Ainf(X _i,X _j), we prioritize the pairs providing the greatest value for γ _Asup(X _i,X _j), because these are the bnodes with the closest approximations in terms of connections representing the same predicates.

We assume that mapping bnode pairs with higher equivalence or greater approximation between their predicates can reduce the delta size. In the next section, we use the approximation metrics γ _Ainf(X _i,X _j) and γ _Asup(X _i,X _j) to design the proposed mapping algorithm.

The ApproxMap method

In this section, we describe the strategies, data structures, and procedures designed to map bnodes in two RDF graphs. We call our mapping algorithm ApproxMap, because the project involves an analysis of the approximation between the sets representing the bnodes.

Heuristic strategies

Our heuristic method considers various strategies for reducing both the number of comparisons between blank nodes and the delta between the compared versions. We adopted the following strategies in the design of our method:

Two approximation metrics - we use metric γ _Asup(X _i,X _j) if the candidate pairs have the same γ _Ainf(X _i,X _j). A pair with a greater γ _Asup(X _i,X _j) has a higher similarity owing to the greater number of common predicates. We consider that mapping pairs with more similar predicates can help in reducing the delta size.
Two levels for bnode partitioning - the first level considers the existing hierarchy between directly connected bnodes, classifying the bnodes into four disjoint sets: roots, leaves, intermediates, and no interconnections. Then, in the second partitioning level, we organize the bnodes according to the number of connections with other nodes, allowing quick access to sets of bnodes with a particular number of links.
Unmapped neighboring bnodes are the same for incoming links but differ for outbound links - while neighboring bnodes are unmapped, URIs and literals play an important role in distinguishing blank nodes. The strategy adopted by Tzitzikas et al. [6], whereby all neighbors as considered the same, can increase the delta size, if the mapped neighbors differ in the final mapping. Therefore, we aim to mitigate this effect by adopting the strategy described above, which considers the possible impact of different neighbors when computing the delta. With prior mapping of neighboring bnodes, we can find a greater approximation between candidate pairs.
Bottom-up approach to map directly connected bnodes - bnodes in the higher levels are mapped based on prior mappings in the lower levels. We compare each bnode mainly with those in the same hierarchical level, thereby reducing the number of comparisons. Relaxation of the same neighborhood for incoming links is due to this approach.
Top-down approximation during bnode mapping - bnodes are mapped iteratively considering a decreasing approximation in the interval (0.0,1.0]. We start the mapping of bnodes with the maximum approximation and, in each iteration, we reduce the lower limit for the desired approximation. Using this approach, we are able to reduce the number of comparisons between bnodes if the datasets contain vastly differing numbers of bnode links. This is because we do not need to compare bnode pairs that differ greatly in their numbers of links, thereby preventing an approximation greater than or equal to the desired value.
Initial equivalent bnode mapping - we can reduce the number of comparisons between the remaining bnodes that have not yet been mapped. Moreover, during the mapping of equivalent bnodes, we can also reduce the comparisons by applying filters to select only those bnodes in the same hierarchical level and with the same number of links as the other nodes.

Our heuristic combines all these strategies in an attempt to produce a solution with a reduced delta size during the mapping of blank nodes of two RDF graphs. For this purpose, we use specific data structures, as described in the next section.

Data structures

In the first adopted partitioning level, we store the unmapped bnodes of each graph G _k in the data structure, T a b G _k, which is partitioned into four disjoint sets: roots, leaves, intermediates, or no interconnections. We use the operator ‘ [ ]’ to index the partitions of T a b G _k, where T a b G _k[i] denotes partition i of T a b G _k.

Bnodes without links to other bnodes are placed in T a b G _k[1]. In the case of directly connected bnodes, the division thereof occurs according to a hierarchical model as shown in Figure 7. T a b G _k[4] contains bnodes that are roots in this model; T a b G _k[2] stores leaf bnodes; and T a b G _k[3] contains bnodes that belong to intermediate layers of this hierarchy, connected to other bnodes by both incoming and outbound links. For simplicity, in Figure 7, we omit the labels of the elements. Moreover, despite the presence of URIs and literals in the figure, the partitioning covers only blank nodes.

Each partition of T a b G _k is further partitioned in a second level and indexed by the number of bnode links. This allows us to find bnodes with the same number of predicates quickly, where T a b G _k[i][j] returns a reference to the set of bnodes from partition i of T a b G _k, with j connections with neighboring nodes.

The ApproxMap algorithm also makes use of four arrays, with size equal to |B _k|, for each graph G _k: a l i a s _k, approxInf_k, a p p r o x S u p _k, and M _k. Considering that b _i∈B _k, a l i a s _k[i] stores the bnode currently mapped to b _i; approxInf_k[i] and a p p r o x S u p _k[i] refer, respectively, to the values of the lower and upper approximations, calculated for b _i and a l i a s _k[i]. Similarly, M _k[i] stores the bnode definitely mapped to b _i.

Before describing the ApproxMap method, we need to explain the process of finding bnode pairs with the greatest approximation during the mapping. In the next section, we discuss this process, which uses the data structures mentioned above.

Mapping bnode pairs

We implemented the mapping of bnodes in two RDF graphs in two phases. In the first phase, as shown in the pseudocode in Figure 8, we look for pairs of unmapped bnodes with the closest approximation. The FindApproximations algorithm takes as parameters, indexes m and n referring, respectively, to the desired partitions of T a b G ₁ and T a b G ₂, with 1≤m,n≤4, and parameter approx, where 0.0<approx<1.0, which denotes the lower boundary of the current desired approximation.

The algorithm looks for pairs with a value for the quality of lower approximation γ _Ainf(X _i,X _j) greater than or equal to the desired value indicated by approx; values below this limit are discarded. Variable b _m stores the current bnode with the closest approximation to b _i, while api_m and aps_m store, respectively, their lower and upper approximations, calculated by metrics γ _Ainf(X _i,X _j) and γ _Asup(X _i,X _j).

Considering subgraph G _i⊆G _k, as defined in Equation 16, |G _i| is the cardinality of G _i, i.e., the number of triples or connections of b _i. In addition, Φ _i is the set of possible p values for triples in the form (s,p,b _i)∈G _i, and Θ _i is the set of p values for triples (b _i,p,o)∈G _i.

In lines 5 and 6 of the algorithm, we use the values of variables l _inf and l _sup to reduce the comparison space using the top-down approximation approach discussed in the ‘Heuristic strategies’ subsection. We can only find an approximation greater than or equal to approx in the interval [l _inf,l _sup], considering our second partitioning level. In line 10, a further filtering takes place, whereby only bnodes with at least one predicate in common are compared.

After obtaining the lower approximation between b _i and the candidate b _j, in line 16, we check whether this new approximation is greater than that previously found. If so, the respective bnodes are marked as candidates for mapping, and any previous pairs are discarded. However, if the new value for γ _Ainf(X _i,X _j) is equal to that previously found, we compare the new value of γ _Asup(X _i,X _j), as shown in line 22. If this value is greater than the current value, the respective bnodes are also marked as candidates for mapping.

After the first phase, we have pairs of candidates with the greatest approximation for mapping, which is finalized in the second phase. Procedure M a p A p p r o x i m a t i o n s(m,a p p r o x), with 1≤m≤4, is used to carry out the mapping. Bnodes in T a b G ₁[m] with an approximation greater than or equal to parameter approx are permanently mapped.

Procedures FindApproximations and MapApproximations are executed to map similar bnodes. However, we can refine these procedures to filter unmapped bnodes, when looking for equivalent pairs to reduce the search space. Thus, we designed procedure M a p E q u i v a l e n t s(m) to map equivalent bnodes in T a b G ₁[m] and T a b G ₂[m], where 1≤m≤4. This procedure compares only bnodes with exactly the same incoming and outbound predicates. Thus, we permanently map only those bnode pairs with approximations equal to 1.0.

We also developed a procedure to map the remaining bnodes, after termination of the iterations for the adopted top-down approximation strategy. Procedure M a p B y O r d e r() compares bnodes in the same way as FindApproximations. However, the mapping is carried out directly between pairs with the greatest approximation according to the order defined by the partitioning of T a b G ₁, thereby ignoring the possibility of a closer relationship with another bnode pair.

Proposed method

Finally, we present method ApproxMap illustrated in Figure 9, which aims to map the bnodes of two graphs G ₁ and G ₂, considering a decreasing approximation in the interval (0.0,1.0). The mapping occurs between pairs in the same hierarchical level T a b G ₁[m] and T a b G ₂[m] considering a fixed step defined by parameter η ₁. To map bnodes of T a b G ₁[m] and T a b G ₂[n], with m≠n, we adopted the step defined by η ₂, where 0<η ₁<η ₂<1. Variable min stores the current desired value (or the lower boundary) for the approximation between bnodes.

The ApproxMap starts by mapping equivalent bnodes in T a b G _k[1] and T a b G _k[2], as shown in lines 1 and 2. During the mapping of T a b G _k[2], we consider relaxing the neighboring bnodes for inbound links. This mapping is performed only once, because these bnodes are leaves in the hierarchy and do not depend on previous mappings of other bnodes.

The rest of the algorithm includes a loop, defined between lines 4 and 45, that maps the bnodes using the bottom-up approach discussed in ‘Heuristic strategies’ subsection, where the mapping of bnodes contained in tables T a b G _k[3] and T a b G _k[4] depends on previous mappings of bnodes in lower levels of the hierarchy. The algorithm aims to map T a b G ₁[1] (lines 6 to 14), T a b G ₁[2] (lines 15 to 23), T a b G ₁[3] (lines 24 to 33), and T a b G ₁[4] (lines 34 to 43) in order.

Thus, for each iteration of the outer loop, the value of min is decremented according to step η ₂, as expressed in line 5. This value defines the minimum approximation required to map the bnodes in each partition of T a b G ₁ to the other different partitions in T a b G ₂. In the case of the same partitions, the mapping occurs in the inner loops, taking into account step η ₁, so that the current approximation is decremented in each iteration (lines 11, 20, 30, and 40), until it reaches the limit set in min. Just prior to termination of the algorithm, in line 46, the remaining bnodes are mapped after 1/η ₁ iterations.

We compared the bnodes of T a b G ₁[m] 1/η ₁ times with the ones in T a b G ₂[m], and a minor number of 1/η ₂ times with those in T a b G ₂[n], where m≠n. Therefore, during the search for bnode pairs with greater approximations, the outer loop provides the mapping of bnodes that change partitions between versions, while the inner loop provides the mapping of bnodes that remain in the same hierarchical partition for all versions.

Method analysis

The proposed method models bnodes as approximate sets, based on their classification as equivalent, similar, or distinct predicates in terms of their connections with other nodes. This organization by approximation classes allows the definition of metrics to measure the approximation between bnodes.

Considering the introductory example in Figure 1, algorithms A l g _Hung and A l g _Sign obtain a mapping resulting in a delta with size 4. Tzitzikas et al. focused on the mapping between pairs (_:1, _:6) and (_:2, _:7) because they considered connected bnodes to be the same, where dist_h(1,6)=0 and dist_h(1,7)=1. We emphasize the adoption of both bottom-up and different neighbor strategies in ApproxMap while mapping directly connected bnodes. The first iteration of ApproxMap results in the mapping of bnode pairs (_:3, _:8), (_:4, _:10), and (_:5, _:9), which have an approximation equal to 1.0. From this initial mapping, our method can map pairs (_:1, _:7) and (_:2, _:6), because γ _Ainf(X ₁,X ₆)=0.50, γ _Ainf(X ₁,X ₇)=0.67, γ _Ainf(X ₂,X ₆)=0.67, and γ _Ainf(X ₂,X ₇)=0.34. The mapping obtained by our method results in a smaller delta size of two triples.

On the other hand, during the ApproxMap method design, we assume that reducing the delta for individual bnode pairs also results in a reduction in the global delta size. However, this assumption does not produce the optimal delta in some situations, as illustrated in Figure 10. In this example, we assume that sets X ₁, X ₂, X ₃, and X ₄ represent, respectively, bnodes ‘_:Product1’, ‘_:Product2’, ‘_:Product3’, and ‘_:Product4’. Thus, we obtain the following approximation measurement: γ _Ainf(X ₁,X ₃)=0.50; γ _Ainf(X ₁,X ₄)=0.33; γ _Ainf(X ₂,X ₃)=0.40; and γ _Ainf(X ₂,X ₄)=0.14.

First, ApproxMap maps bnodes ‘_:Product1’ and ‘_:Product3’, corresponding to the pair with the closest approximation. The closest approximations of both ‘_:Product1’ and ‘_:Product2’ are to bnode ‘_:Product3’. However, this mapping represents the lowest cost of transforming some bnode in the first version into ‘_:Product3’. We can change ‘_:Product1’ to ‘_:Product3’ by including only a single triple. However, we would need to include an additional three triples to transform ‘_:Product2’ into ‘_:Product3’.

Therefore, ApproxMap also maps the remaining bnodes ‘_:Product2’ and ‘_:Product4’, resulting in a global delta containing seven triples. However, if we had initially mapped ‘_:Product1’ to ‘_:Product4’, the resulting delta would have size 5, as is the case using the Hungarian algorithm. This occurs because our hypothesis considers only a reduction in delta between individual pairs and not an assessment of the impact of this reduction in terms of the global delta size. Owing to the mapping of remaining bnodes, considering only unmapped bnodes pairs, the ApproxMap does not test all mapping possibilities, which can result in obtaining a local optimum.

Moreover, in ApproxMap, the mapping occurs in the order defined by the adopted partitioning. We also used some ordered structures during algorithm implementation, optimizing the comparisons between bnodes. The additional cost of insertion is already known for these structures, although this is beyond the scope of this article. This adopted order can affect the delta size, mainly, considering procedure MapByOrder. As before, this may occur because our method does not test all mapping possibilities.

We emphasize that great diversity among the bnode predicates in the same version is beneficial for method ApproxMap. For a better analysis, let us consider the approximation space A ₁₋₄, illustrated in Figure 11, and which we constructed from the union of all sets representing bnodes in the two versions in Figure 10. For simplicity, we omit the negative regions of the bnodes in Figure 11.

According to Figure 11, there are few differences between sets of the same version. In particular, X ₁ is a subset of all other sets, hindering the choice of its best mapping option. The inclusion of different values can contribute to a better choice of bnode pairs, as illustrated in Figure 12.

Figure 13 illustrates the approximation space $A^{'}_{1-4}$, considering all bnodes in the datasets in Figure 12. Now, the bnodes within the same version have greater diversity, allowing a better approximation measure between bnodes if we consider the different versions. In this case, the new values of predicate ‘type’ leads to a better choice of the candidates, where we have γ _Ainf(X ₁,X ₃)=0.25; γ _Ainf(X ₁,X ₄)=0.5; γ _Ainf(X ₂,X ₃)=0.5; and γ _Ainf(X ₂,X ₄)=0.11. Thus, the new delta obtained from ApproxMap contains five triples.

In particular, it may be difficult for ApproxMap to choose the best mapping option if there are multiple candidates representing approximately equal sets in an approximation space, i.e., sets with the same lower and upper approximations [7], as exemplified in Figure 14a. The adopted metrics may be insufficient to distinguish these sets.

Furthermore, in cases involving completely different datasets, ApproxMap compares all bnodes in the two datasets during the mapping, resulting in the maximum delta equal to the sum of the triples in the two datasets. We included some optimizations in ApproxMap, reducing the cost of comparing distinct pairs, by first checking for the presence of common predicates. The worst-case execution corresponds to a particular case of distinct datasets, where all bnodes have the same predicates. In this case, we obtain dispersed approximate sets representing the bnodes, i.e., sets with an empty lower approximation (γ _Ainf(X _i,X _j)=0) and an upper approximation equal to the set universe (γ _Asup(X _i,X _j)=|U|) [7], as shown in Figure 14b.

We use step η ₁ to control the number of comparisons between bnodes, where the total number is given by 1/η ₁×O(n ²). Thus, when η ₁ is considerably smaller than 1/n, where n gives the smallest number of bnodes in the datasets, in the worst case, the time complexity of the algorithm is O(n ²). Conversely, the best case execution of ApproMap occurs with equivalent datasets containing bnodes with varying numbers of connections and without any directly connected bnodes. In this case, we need to compare each bnode with exactly one bnode in the other version. Thus, the complexity of the best case is Ω(n).

Finally, we intend to apply ApproxMap to configuration management of software engineering projects, specifically to version control of RDF datasets. These projects are characterized by the manipulation of data, information, and knowledge in various types and formats, manually constructed based on the modularity principle, where complex elements are divided into smaller parts. Therefore, we expect great diversity between bnodes in the same version, justifying the application of ApproxMap in this context.

Because the datasets involved are usually constructed using an incremental development approach, we expect satisfactory performance of ApproxMap on similar versions, containing several approximately equivalent bnode pairs, as generally occurs in successive versions of software engineering artifacts. A recommended configuration management practice is to perform version control considering the low percentage of changes between versions. If this does not occur, larger deltas prevent the recovery of intermediate states between successive versions.

As future work, we propose a meticulous analysis of the impact of the adopted metrics and strategies on the mapping. We also intend verifying the applicability of other RST metrics that could provide better approximation measures between bnodes. As a further future work, we propose improving the performance of the algorithm, taking into account execution of some operations in parallel, such as comparison of approximate sets.

Results and discussion

In analyzing the performance of the ApproxMap algorithm, we considered both the delta size calculated from mapping pairs of RDF datasets and the time spent on this task. This allowed comparison of the results and values obtained for the A l g _Hung and A l g _Sign algorithms, presented by Tzitzikas et al. [6]. All experiments discussed in this section were executed on an Intel Core i7-3537U, 2.0 GHz processor, with 8 GB RAM and running Ubuntu 13.10. To correct any formatting or encoding issues, preprocessing was carried out on certain pairs of datasets.

Three metrics defined by Tzitzikas et al. [6] were used in the analysis of the experiments: b _density, b _len, and D _a. Let N and B denote, respectively, the sets of nodes and blank nodes of graph G, where B⊆N. Further, let conn(b) denote the set of nodes in G directly attached to b∈N. Then, we have b _density=a v g _b∈B(|conn(b)∩B|/|conn(b)|); b _len refers to the average maximum path length, with vertexes consisting only of bnodes; and D _a corresponds to the average number of bnode triples.

Except for the last experiment, we tested the ApproxMap algorithm with three different sets of parameters: η ₁=0.01 and η ₂=0.1; η ₁=0.05 and η ₂=0.125; and η ₁=0.05 and η ₂=0.25 where these tests are denoted, respectively, as A p p r o x M a p 1/10%, A p p r o x M a p 5/12%, and A p p r o x M a p 5/25%. We chose these steps empirically, considering the desired number of iterations. As future work, we propose further analysis of the choice of step values and calibration of ApproxMap.

We used the A p p r o x M a p 5/12% tests as the baseline for comparison when evaluating the impact of changes in η ₁ and η ₂ on the results. The A p p r o x M a p 5/12% test includes 20 iterations (1/η ₁) of the inner loop of the method, comparing each bnode with those in the same hierarchical partition in the second version. In addition, there are eight iterations (1/η ₂) of the outer loop, comparing the bnodes with those in the remaining partitions. The A p p r o x M a p 5/25% test was used to verify the impact of an increase in η ₂, reducing the comparisons between distinct partitions for 4 iterations. Finally, we used the A p p r o x M a p 1/10% tests to analyze the impact of a reduction in η ₁, increasing the comparisons between the same partitions for 100 iterations. In these tests, we also adjusted η ₂ to better fit η ₁, resulting in ten iterations of the outer loop.

We organized the experiments in three groups based on the type of dataset used in each: real, extracted from the Web (i.e., crawled), or synthetic datasets, as discussed in the following sections. The standard units for delta size and mapping time are, respectively, triples and milliseconds. We used a logarithmic scale for charts showing mapping times of the algorithms, thereby providing better visualization and comparison of the results.

Real datasets

In the first group of experiments, we used the same real datasets tested by Tzitzikas et al. [6]. Table 1 describes the main features of these datasets, where columns |B| and |G| denote, respectively, the average numbers of bnodes and triples in the version pairs. Measurements for the dataset Italian are the same for both files. In the Swedish dataset, the coefficient of variation (cv) is equal to 2.42%, 2.98%, and 0.19%, for |G|, |B|, and D _a, respectively.

Table 1

Information about real datasets

Dataset	\|B\|	\|G\|	D _a	b _density	b _len
Swedish	522	3,670	5.47	0.00	0.00
Italian	6,390	49,897	3.42	0.00	0.00

Table 2 gives the results obtained by the algorithms in the first experiment, considering the time to map blank nodes and the delta size calculated from this mapping. In terms of the delta, we obtained the same values for both datasets in all algorithm tests, with the exception of A l g _Sign. With respect to the execution time, considering the ratio between the time of the algorithms A l g _Hung and A p p r o x M a p 5/25%, we obtained 141 and 2,982, respectively, for the Swedish and Italian datasets. Thus, A l g _Hung required considerable additional time, particularly for dataset Italian.

Table 2

Results of the algorithms applied to real datasets

Dataset		Swedish	Italian
Delta (triples)	A p p r o x M a p 1/10%	297	6
	A p p r o x M a p 5/12%	297	6
	A p p r o x M a p 5/25%	297	6
	A l g _Hung	297	6
	A l g _Sign	423	6
Time (ms)	A p p r o x M a p 1/10%	113	170
	A p p r o x M a p 5/12%	36	158
	A p p r o x M a p 5/25%	34	153
	A l g _Hung	4,789	456,173
	A l g _Sign	37	59

Crawled datasets

Owing to the difficulty in finding appropriate real versioned datasets for the experiments, in the second group of experiments, we used an RDF crawler, LDSpider [21], to construct pairs of RDF dataset versions. We extracted some versions from randomly chosen links to common datasets in the linked open data (LOD) cloud [22], such as Dbpedia and DBPL, as well as FOAF Profiles.

We used LDSpider because of its dual crawling strategies [21]: breadth-first and load-balancing. Thus, we executed two experiments based on these strategies, where the maximum number of URIs was limited to generate reasonably sized files for the tests, considering the computational costs of the algorithms. In the first experiment using LDSpider, we adopted the load-balancing strategy, with the aim of obtaining pairs of files with approximately the same size. Table 3 gives information about the crawled datasets used. The first column denotes the instance number. All values in Table 3 were identical for both produced versions, with the exception of metric |G| in the second instance, where c v=0.73%.

Table 3

Crawled datasets using the load-balancing strategy

Instance number	\|B\|	\|G\|	D _a	b _density	b _len
1	19	1,048	9.00	0.01	0.11
2	83	11,555	7.31	0.00	0.00
3	361	28,208	5.93	0.00	0.00
4	362	28,219	5.96	0.00	0.00
5	893	15,337	4.40	0.00	0.02

Figures 15 and 16 illustrate the results for the datasets described in Table 3. To further analyze the delta reduction potential of the algorithms, the deltas are presented as a percentage of the average number of triples, i.e., Δ(G ₁,G ₂)/|G|. As seen in Figure 15, all algorithms achieved the same delta reduction on all datasets, except for A l g _Sign, which showed an increase of 0.08 in the delta reduction potential, in the last instance, compared with the potential of the other algorithms.

For bnode mapping, algorithm A l g _Hung was the slowest. Considering the differences between the mapping times of the algorithms presented in Figure 16, compared with A p p r o x M a p 5/25%, A l g _Hung showed an increase in mapping time between 0.50 and 3.16, on the adopted logarithmic scale. A p p r o x M a p 5/25% was faster than A l g _Sign in two instances, with the maximum time increase for A l g _Sign equal to 0.78. A l g _Sign was faster in the remaining instances, with an increase in time for A p p r o x M a p 5/25% less than 1.03. Finally, considering the differences between steps η ₁ and η ₂, compared with A p p r o x M a p 5/12%, the mapping time for A p p r o x M a p 1/10% increased by between 0.38 and 0.60, while A p p r o x M a p 5/25% showed a maximum reduction in mapping time of 0.06.

In the second experiment using LDSpider, we extracted the first version using the breadth-first strategy and the second using the load-balancing strategy. Once again, we aimed to create files with approximately the same size but with major differences due to the change in strategy. Table 4 shows features of the instances considered in this experiment. Detailed information is given in this table, because there are considerable differences between some versions.

Table 4

Crawled datasets with breadth-first/load-balancing strategy

Instance number	\|B\|		\|G\|		D _a		b _density		b _len
	File 1	File 2	File 1	File 2	File 1	File 2	File 1	File 2	File 1	File 2
1	169	19	4,355	1,048	5.73	9.00	0.21	0.01	16.26	0.11
2	190	83	11,892	11,470	5.82	7.31	0.07	0.00	1.67	0.00
3	1,246	893	24,364	15,337	5.13	4.40	0.10	0.00	10.88	0.02
4	1,963	361	27,650	28,208	6.75	5.93	0.00	0.00	0.00	0.00
5	1,967	362	28,031	28,219	6.74	5.96	0.00	0.00	0.01	0.00

Figure 17 shows the delta reduction potential of the algorithms in these tests. A l g _Hung showed an increase in its delta reduction percentage of 1.26, while the increase in potential of A l g _Sign was less than 1.42, when compared with all tests using ApproxMap.

Figure 18 compares the mapping times on a logarithmic scale. A l g _Hung once again performed the worst. Compared with A p p r o x M a p 5/25%, the mapping times of A l g _Hung increased by between 0.11 and 1.92. Compared with A l g _Sign, the increase in mapping times of A p p r o x M a p 5/25% varied between 0.46 and 1.59. Besides, we also verified a time increase for the ApproxMap tests; the mapping time of A p p r o x M a p 1/10% increased by between 0.42 and 0.60, while that of A p p r o x M a p 5/25% decreased by 0.12, compared with A p p r o x M a p 5/12%.

Synthetic datasets

In this final group of experiments, to evaluate the algorithms in the mapping of datasets with some specific features, e.g., directly connected bnodes or equivalent datasets, we generated pairs of synthetic datasets for use in the tests.

Datasets from adapted Univ-Bench artificial generator

Initially, we considered the datasets used by Tzitzikas et al. [6], the generator of which was based on the Univ-Bench artificial (UBA) data generator [23]. Table 5 lists the features of the dataset pairs tested in this experiment, where all datasets have 240 bnodes. In this table, column Δ _opt/|G| displays the ratio between the optimal values, represented by Δ _opt(G ₁,G ₂), and |G|. For the average values shown in this table, we have c v=16.17% and c v=1.3% to b _len, respectively, for instances 4 and 9, and c v<0.22% in the other cases.

Table 5

Synthetic datasets generated by Tzitzikas et al. [ 6 ]

Instance number	\|G\|	D _a	b _density	b _len	Δ _opt /\|G\| (%)
1	5,846	13.4	0	0	1
2	5,025	10.5	0.1	1	0.5
3	2,381	7	0.15	1	1.5
4	1,628	5	0.2	1	1.5
5	1,636	5	0.2	1.15	1
6	1,399	4	0.25	1.15	1.7
7	919	3	0.32	1.15	3.2
8	909	3.25	0.4	1.35	2.7
9	1,001	3.94	0.5	21.5	2.5

Figure 19 shows the delta sizes obtained for the nine version pairs. ApproxMap and A l g _Hung found the optimal delta in five instances. A l g _Hung found a smaller delta than A p p r o x M a p 5/25% in instances 4 and 7, with the maximum decrease in delta equal to 0.65. Moreover, in the latter instance, A l g _Hung was only surpassed by A p p r o x M a p 5/25%. We confirmed the smallest delta reduction potential in instance 8, where ApproxMap, A l g _Hung, and A l g _Sign showed increases of 5.34, 7.98, and 8.86, respectively, when compared with the optimal reduction potential.

Figure 20 shows the results for the same nine pairs but considering the second version in reverse order. Contrary to the results of the algorithms proposed by Tzitzikas et al. [6], the bnode order did not affect the results of ApproxMap. For the final four instances, compared with the optimal reduction potential, Alg _Hung showed an increase varying from 6.72 to 19.64. Similarly, for these instances, the increase in the reduction potential of Alg _Sign varied from 8.86 to 20.75, compared with the optimal values.

Figure 21 shows the results for the mapping times of the algorithms on a logarithmic scale. Alg _Hung achieved the worst performance, with increased mapping time varying between 1.53 and 2.45, compared with ApproxMap 5/25%. In the final three instances, Alg _Sign achieved better performance than ApproxMap 5/25%, with decreased mapping time of between 0.12 and 0.65. ApproxMap 5/25% performed better in the first instance, executing faster than Alg _Sign with a decrease in time of 0.98. Finally, compared with ApproxMap 5/12%, the mapping time of ApproxMap 1/10% showed a maximum increase of 0.52, while that of ApproxMap 5/25% decreased by 0.16.

Datasets with directly connected bnodes

In the final three instances of the nine pairs of synthetic datasets used above, all algorithms obtained higher deltas than the optimal values, when applied to datasets with a marked increase in the number of directly connected bnodes. To assess the influence of b _density on delta size, we conducted a new experiment with identical versions of the datasets. All algorithms found an empty delta for the original order of the datasets. However, considering a reverse order in the second version, Figure 22 shows the results in terms of the deviation from the optimal delta, i.e., $dx = \frac {\Delta (G_{1}, G_{2}) - \Delta _{\text {opt}}(G_{1}, G_{2})}{\Delta _{\text {opt}}(G_{1}, G_{2})+1}$, where Δ _opt denotes the optimal value. With an increase in b _density, the reverse ordering once again influenced the results of Alg _Hung and Alg _Sign. These algorithms achieved the same increase in dx in the final three instances, whereas ApproxMap produced an empty delta.

To analyze the performance of the algorithms, considering datasets with a higher number of directly connected bnodes, we developed an RDF dataset generator, based on that included in the Berlin SPARQL Benchmark (BSBM) [24]. We used this generator to produce pairs of file versions with an average b _density of 0.34%, and c v=7.25%. We discuss the experiments using this generator in the next section.

Datasets from adapted BSBM generator

Our adapted generator is capable of producing two versions of an e-commerce portal, which is used by vendors to offer various products and consumers to submit reviews about these products. The versions contain descriptions for five different types of resources, as well as three different types of blank nodes. We determined these quantities empirically to obtain the desired value for b _density. Thus, we defined the elements corresponding to products, their types, and characteristics as blank nodes, although the portal also included a hierarchy of product types.

In all experiments, 74.73% of the triples contained bnodes, with the coefficient of variation, c v=1.28%. The high percentage of bnodes was acceptable because triples without bnodes could be mapped directly and this was not our concern. Moreover, except for the last experiment in which we tested large datasets, we limited the maximum number of bnodes to 2,000, so that the datasets could be tested with all algorithms. We constructed the version pairs in such a way that ensured that changes occurred in isolation. The intersections of sets of equivalent, added, or removed triples between versions were empty, considering all possible bnode mappings. Based on this, we obtained by construction the optimal delta size for the tested pairs.

The adapted generator accepts as input the number of products sold on the e-commerce portal and then, determines the number of other bnodes (product types and characteristics) in terms of this input number. As a result, the values of some metrics were affected by the numbers of bnodes, such as the average maximum path length (b _len), due to variations in the product type hierarchy. However, this metric does not affect the computational cost of ApproxMap. We dealt with bnode hierarchies using the adopted bottom-up strategy. Similarly, the absence of bnode interconnections (b _density=0) did not affect ApproxMap because, in this case, it merely grouped the bnodes in the same hierarchical partition. A meticulous analysis of the impact of these metrics on delta size is suggested as a future work.

In the next sections, we describe the five experiments performed using datasets produced by our adapted generator. These experiments consider increases in the version and delta sizes, identical or different versions, as well as large datasets.

Changing the size of the datasets

The first experiment using our adapted generator aimed to analyze the impact of an increase in the number of bnodes. For the generation of datasets, we set a fixed ratio of 50% of equivalent elements among pairs, to assess the impact of an increase in version size assuming a moderate delta size.

Table 6 gives the practical information about this experiment involving five version pairs. Considering all the averages shown in this table, we obtained a cv smaller than 0.98%, except for values b _len, which yields a maximum value of c v=4.67%. In this table, column Δ _opt/|G| displays the ratio between the optimal values, represented by Δ _opt(G ₁,G ₂), and |G|.

Table 6

Datasets with varying version sizes

Instance	\|B\|	\|G\|	D _a	b _density	b _len	Δ _opt /\|G\| (%)
number
1	400	3,390	7.29	0.29	60.42	53.39
2	800	7,000	7.78	0.33	127.34	53.02
3	1,200	10,806	8.30	0.37	212.71	52.74
4	1,600	14,061	7.88	0.34	200.95	52.49
5	2,000	17,541	7.86	0.34	270.09	52.71

Figures 23 and 24 show the results obtained for these datasets, in terms of delta sizes and mapping times of the algorithms, respectively. A l g _Sign achieved the worst performance in terms of delta size, with an increase in delta size varying between 33.6 and 36.11. Alg _Hung showed an increase in the reduction percentage ranging from 2.6 to 8.88. Finally, the increase in the reduction potential of ApproxMap 1/10%, ApproxMap 5/12%, and ApproxMap 5/25% was, respectively, less than 4.28, 4.53, and 4.7 compared with the optimal reduction potential.

Regarding bnode mapping times, Alg _Hung was slower than ApproxMap 5/25%, with an increase in time varying between 0.97 and 1.44 on the logarithmic scale. Compared with the Alg _Sign algorithm, the increase in time for ApproxMap 5/25% varied between 0.65 and 1.70. Finally, the maximum increase in the execution time of ApproxMap 1/10% compared with that of ApproxMap 5/12% was equal to 0.45. Similarly, ApproxMap 5/12% showed an increase smaller than 0.12, compared with ApproxMap 5/25%.

Changing delta size

While the first experiment using our generator considered varying numbers of bnodes, the second experiment investigated variations in the differences between version pairs. In this experiment, we adopted datasets with a fixed number of 2,000 bnodes, varying the percentage of different elements between 15% and 90% in fixed steps of 15%. With the choice of these percentages, the six pairs generated in this experiment were grouped in doubles to represent the following change levels between versions: low, medium, and high. Table 7 gives information about these pairs. Considering all the averages shown in this table, we obtained a cv of less than 2.22%.

Table 7

Datasets with varying delta sizes

Instance number	\|G\|	D _a	b _density	b _len	Δ _opt /\|G\| (%)
1	17,895	8.47	0.38	231.60	15.86
2	17,639	7.99	0.35	304.20	31.73
3	17,868	8.19	0.37	326.40	47.34
4	17,841	8.13	0.36	265.72	62.91
5	18,014	8.28	0.37	340.61	78.35
6	17,695	7.95	0.34	328.19	94.06

To facilitate impact analysis of increasing delta sizes, Figure 25 shows the results of the algorithms, in terms of the percentage difference between the deltas found and the optimal values in terms of these optimal values, i.e., (Δ(G ₁,G ₂)−Δ _opt(G ₁,G ₂))/Δ _opt(G ₁,G ₂).

As before, A l g _Sign performed the worst in terms of delta size, with the distance to the optimal delta varying between 51.61 and 88.3. A l g _Hung showed a distance to the optimal ranging from 11.45 to 21.28, while A p p r o x M a p 1/10% showed one varying between 1.36 and 7.87. For A p p r o x M a p 5/12% and A p p r o x M a p 5/25%, the distances to the optimal varied from 0.85 to 8.84 and from 5 to 9.39, respectively.

Moreover, Figure 26 shows the results for the mapping times of the algorithms. The time increase for A l g _Hung varied between 1.32 and 1.47, compared with A p p r o x M a p 5/25%. But, when compared with A l g _Sign, A p p r o x M a p 5/25% required an increased time ranging from 1.54 to 1.92, on the logarithmic scale. Finally, considering A p p r o x M a p 5/12%, we observed a time increase less than 0.5 for A p p r o x M a p 1/10%, and a time decrease less than 0.13 for A p p r o x M a p 5/25%.

Identical datasets

For a better analysis of the algorithms behavior, the next two experiments considered extreme cases, with the datasets either identical or completely different. In the first case, we compared the second version of the datasets from the first experiment using our adapted generator, with a version created by application of the delta in the first version, i.e., $G^{'}_{2} = G_{1} + \Delta $. With this, we validated the deltas previously found by ApproxMap. No differences were found by any of the algorithms for the identical datasets, even when considering the second version in reverse order.

Figure 27 shows the time spent comparing the dataset pairs. Compared with A p p r o x M a p 5/25%, A l g _Hung required an increased time ranging from 1.24 to 1.51, while A l g _Sign required decreased time varying between 0.97 and 1.76. Compared with A p p r o x M a p 5/12%, the increase in execution time of A p p r o x M a p 1/10% was less than 0.1, while that for A p p r o x M a p 5/25% time was less than 0.04, on the adopted logarithmic scale.

Different datasets

In this experiment, we also based the generation of different datasets on the second versions of the datasets produced in the first experiment using our generator. In this case, we changed all the elements included in the triples and obtained deltas equal to |G ₁|+|G ₂|. Figure 28 shows the results for these pairs, with mapping times expressed on a logarithmic scale.

Compared with A p p r o x M a p 5/25%, A l g _Hung required increased time varying between 1.4 and 2.21, while the reduced time requirement of A l g _Sign varied from 0.94 to 1.92. Compared with the time requirement of A p p r o x M a p 5/12%, A p p r o x M a p 1/10% showed an increase ranging from 0.4 to 0.44, while the time reduction for A p p r o x M a p 5/25% varied between 0.11 and 0.15, on the logarithmic scale.

Large datasets

Finally, the last experiment considered the behavior of ApproMap and A l g _Sign when mapping large datasets. We could not test A l g _Hung in this experiment, owing to its high computational cost. With the aim of reducing the number of comparisons between bnodes, we adopted steps η ₁=0.2 and η ₂=0.5 for ApproxMap, which is referred to as A p p r o x M a p 20/50%. Thus, with this choice of steps, there are five iterations comparing bnodes in the same hierarchical partitions and only two iterations comparing bnodes in different partitions.

In the construction of the dataset pairs, the number of bnodes varied between 20,000 and 100,000, in fixed steps of 20,000. We created these five pairs with an average number of triples (|G|) of 183,732; 356,176; 562,828; 754,524; and 958,038 assuming a maximum value of c v=0.37%. We created these datasets with 25% different elements, with the average size of the optimal delta equal to 26.41% of the triples with c v=0.11%. The adopted values in this experiment were chosen empirically with the aim of reducing the mapping times of the algorithms. We did not test the algorithms with datasets larger than those generated in this experiment, owing to the computational cost of ApproxMap. However, the considered instances were sufficient to evaluate the behavior of the algorithm with large datasets. Moreover, the construction of a large dataset is not common practice in the application context of our method, that is, software development projects, with stimulated techniques such as modularization.

Figures 29 and 30 show, respectively, the distance between the results found and the optimal delta size and the algorithms’ mapping times. As observed previously, A l g _Sign underperformed in terms of delta size, achieving a distance to the optimal delta ranging from 81.87 to 110.87. Furthermore, for A p p r o x M a p 20/50%, the distance to the optimal delta varied from 16.76 to 27.33. Regarding time cost, the increased time required by ApproxMap relative to that of A l g _Sign varied from 2.86 to 3.12 on the logarithmic scale.

Analysis of results

Satisfactory results of ApproxMap in the experiments confirm our hypothesis that mapping bnode pairs with the highest approximation can assist in reducing the delta size. Considering the tests where optimal delta values are known, ApproxMap obtained the optimal delta size in 59% of the tests. A l g _Hung and A l g _Sign found optimal solutions, respectively, in 50% and 30% of the test cases.

Considering all experiments, except the final one with large datasets, ApproxMap found a delta size equal to that of A l g _Hung in 55% of the tests and smaller than that of A l g _Hung in 40% of the cases, except in the tests with steps 5/25%, where the delta found by ApproxMap was smaller in 41% of the test cases. Compared with A l g _Sign, ApproxMap obtained a smaller delta in 67% of the cases and the same delta in 33% of the cases. In the experiment with large datasets, ApproxMap performed better than A l g _S i g n in all cases. Moreover, when compared to A l g _Sign, A l g _Hung found the same delta in 38% of the tests and a smaller delta in 60% of the tested cases.

Regarding mapping time, ApproxMap was faster than A l g _Hung in 84% of the tests and slower in the remaining 16%, except in the tests with steps 1/10%, where it was outperformed in 21% of the tested instances. ApproxMap was faster than A l g _Sign in 14%, 21%, and 24% of the tests with steps 1/10%, 5/12%, and 5/25%, respectively, and outperformed in the other cases. In the experiment with large datasets, A l g _Sign was faster than ApproxMap in all tests. A l g _Sign was also faster than A l g _Hung in all the tests conducted with these algorithms.

Based on the experimental results, the empirically defined values for parameters η ₁ and η ₂ are considered to be satisfactory. Considering the tests with steps 5/12% as our reference, the decrease in η ₁ from 5% to 1% (steps 1/10%) caused a reduction in the delta size in 12% of the tests, while we obtained the same delta in 81% of the cases. However, the consequent increase in mapping time was confirmed in 98% of the cases, while it remained the same in the other 2% of cases. On the other hand, with the increase in η ₂ from 12.5% to 25% (steps 5/25%), a delta increase occurred in 17% of the tests, while we obtained the same delta in 78% of cases. However, a consequent reduction in mapping time occurred in 64% of the instances, while the time remained the same in 28% of the tested cases.

Furthermore, considering the impact of interconnected bnodes in the experiments, in cases without directly connected bnodes (with b _density=0), ApproxMap had the same delta size as A l g _Hung in all the tests. But, in cases where b _density>0, ApproxMap had the same delta size in 47% of the cases, and a smaller size than that in A l g _Hung in 47% of the tests, with the exception of tests with steps 5/25%, where the size was smaller in 49% of the tested instances.

On the other hand, analyzing the algorithms’ performance in the experiments with equivalent pairs, ApproxMap was faster than A l g _Hung in all tests. When considering the ratio between the time spent by these algorithms, the mapping time of A l g _Hung was up to 283 times greater than that of A p p r o x M a p 5/25%. We also emphasize the results for the real dataset Italian, whose delta contained no triples with bnodes. In this case, A l g _Hung required a greater mapping time than A p p r o x M a p 5/25% with a ratio of 2,982. However, A l g _Hung yielded a nonempty delta in 25% of the tests with equivalent datasets but with the datasets in reverse order. The bnode order did not affect ApproxMap in the experiments, because it imposes an internal ordering.

Based on these results, we can state that the ApproxMap method obtained satisfactory performance in the experiments, and its application is recommended in the versioning of RDF datasets. We intend to apply this algorithm in the design of an SCM method, as part of an integrated environment of tools for software engineering projects, based on the Semantic Web standards. Moreover, we emphasize the satisfactory performance of ApproxMap in mapping datasets with large numbers of equivalent elements. Thus, we recommend its application for version control following the recommended practices for SCM, considering a low percentage of changes between versions.

Conclusions

This paper aimed to develop a heuristic method for mapping blank nodes. The proposed method, called ApproxMap, applied extended concepts of RST, presented by Pawlak [7], in the handling of imprecision in bnode mapping. RST provided the necessary support to obtain a mapping between bnodes, seeking closer approximations between bnodes of the considered versions. The proposed modeling of blank nodes as approximate sets in an approximation space is an important contribution of this article. This modeling can be reused in other research domains involving blank node mapping.

In our method, we determined the number of comparisons between bnodes as parameter η ₁. Considering small values for the ratio 1/η ₁, the proposed algorithm has worst-case time complexity of O(n ²), involving two completely different datasets, whose bnodes have the same predicates.

ApproxMap showed satisfactory performance in our groups of experiments, as the algorithm that obtained solutions closest to the optimal values. This algorithm succeeded in finding the optimal delta size in 59% of the tests involving optimal values. Considering all tests with different values for parameters η ₁ and η ₂, ApproxMap achieved a delta size smaller than or equal to those of A l g _Hung and A l g _Sign, respectively, in at least 95% and 100% of the tested cases. Regarding mapping time, ApproxMap was faster than A l g _Hung in at least 79% of the instances and slower than A l g _Sign in at least 76% of the tests.

Despite its mapping time being greater than that of A l g _Sign, which has a time cost of n·logn, we recommend applying ApproxMap in various situations, particularly those involving similar versions and directly connected bnodes. Great diversity between the bnodes in the same version is beneficial for ApproxMap. Thus, our algorithm can be successfully applied in RDF dataset versioning, such as that produced by software processes with iterative and incremental development.

As future work, we propose the creation of a parallel version of the ApproxMap algorithm to reduce the time required to compare bnodes of the two RDF bases. Furthermore, we propose a meticulous analysis of the appropriate choice of input steps η ₁ and η ₂, and of the impact of the adopted metrics and strategies on delta size. Besides, we also intend investigating other RST metrics.

Acknowledgements

Many thanks to Christina Lantzaki and Yannis Tzitzikas for their help in the execution of tests using the Alg _Hung and Alg _Sign algorithms and also for making their synthetic datasets available. We also thank the reviewers for their help in improving the article.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JAM developed algorithms, performed experiments, and drafted the manuscript. AMC participated in design and coordination of the study. Both authors read and approved the final manuscript.

Vorheriger Artikel Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese

Nächster Artikel Open information extraction based on lexical semantics

Klyne G, Carroll JJ, McBride B (2014) RDF 1.1 concepts and abstract syntax. World Wide Web Consortium, Recommendation. http://www.w3.org/TR/rdf11-concepts.

Monte-Mor JA, Cunha AM (2014) Galo: a semantic method for software configuration management In: Information Technology: New Generations (ITNG), 2014 11th International Conference On, 33–39.

Antoniou G, van Hatrmelen F (2004) A Semantic Web prime. The MIT Press, London, England. p. 238.

Lee TB, Connolly D (2001) Delta: an ontology for the distribution of differences between RDF graphs. Technical report, W3C. http://www.w3.org/DesignIssues/Diff.

Zeginis D, Tzitzikas Y, Christophides V (2011) On computing deltas of RDF/s knowledge bases. ACM Trans Web 5(3): 14–11436.CrossRef

Tzitzikas Y, Lantzaki C, Zeginis D (2012) Blank node matching and RDF/s comparison functions In: Proceedings of the 11th International Conference on The Semantic Web - Volume Part I. ISWC’12, 591–607.. Springer, Berlin, Heidelberg.

Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11: 341–356.CrossRefMATHMathSciNet

do Carmo Nicoletti M, Uchôa JQ, Baptistini MTZ (2001) Rough relation properties. Int J Appl Math Comput Sci 11(3): 621–635.MATHMathSciNet

Carroll JJ (2002) Matching RDF graphs In: Proceedings of the First International Semantic Web Conference on The Semantic Web. ISWC ’02, 5–15.. Springer, London, UK.

10.

Noy NF, Kunnatur H, Klein M, Musen MA (2004) Tracking changes during ontology evolution In: ISWC2004, Proceeding of the 3rd International Semantic Web Conference, Hiroshima, Japan, November 7-11, 2004, 259–273.. Springer, Berlin, Heidelberg.

11.

Noy NF, Musen MA (2002) Promptdiff: a fixed-point algorithm for comparing ontology versions In: Eighteenth National Conference on Artificial Intelligence, 744–750.. American Association for Artificial Intelligence, Menlo Park, CA, USA.

12.

Noy NF, Musen MA (2004) Ontology versioning in an ontology management framework. IEEE Intell Syst 19(4): 6–13.CrossRef

13.

Auer S, Herre H (2006) A versioning and evolution framework for RDF knowledge bases In: Proceedings of the 6th International Andrei Ershov Memorial Conference on Perspectives of Systems Informatics. PSI’06, 55–69.. Springer, Berlin, Heidelberg.

14.

Völkel M, Groza T (2006) SemVersion: An RDF-based ontology versioning system. In: Nunes MB (ed)Proceedings of IADIS International Conference on WWW/Internet (IADIS 2006), 195–202, Murcia, Spain.

15.

Cassidy S, Ballantine J (2007) Version control for RDF triple stores. In: Filipe J, Shishkov B, Helfert M (eds)ICSOFT 2007, Proceedings of the Second International Conference on Software and Data Technologies, Volume ISDM/EHST/DC, Barcelona, Spain, July 22-25, 2007, 5–12.. INSTICC Press, Setubal, Portugal.

16.

Im D-H, Lee S-W, Kim H-J (2012) A version management framework for RDF triple stores. Int J Softw Eng Knowl Eng 22(1): 85–106.CrossRefMathSciNet

17.

Zeginis D, Tzitzikas Y, Christophides V (2007) On the foundations of computing deltas between RDF models In: Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference. ISWC’07/ASWC’07, 637–651.. Springer, Berlin, Heidelberg.

18.

Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res. Logist. Quart 2: 83–97.CrossRefMathSciNet

19.

Uchôa JQ (1998) Representação e indução de conhecimento usando teoria de conjuntos aproximados. Master’s thesis, Universidade Federal de São Carlos, São Carlos, Brasil.

20.

Pawlak Z, Skowron A (2007) Rough sets: some extensions. Inform Sci 177(1): 28–40.CrossRefMATHMathSciNet

21.

Isele R, Umbrich J, Bizer C, Harth A (2010) Ldspider: An open-source crawling framework for the web of linked data. In: Polleres A Chen H (eds)ISWC Posters & Demos. CEUR Workshop Proceedings.. CEUR-WS.org.

22.

Bizer C, Heath T, Berners-Lee T (2009) Linked data - the story so far. Int J Semantic Web Inf Syst 5(3): 1–22.CrossRef

23.

Guo Y, Pan Z, Heflin J (2005) LUBM: a benchmark for owl knowledge base systems. Web Semant 3(2-3): 158–182.CrossRef

24.

Bizer C, Schultz A (2009) The Berlin SPARQL benchmark. Int J Semantic Web Inform Syst 5(2): 1–24.CrossRef

Titel: ApproxMap - a method for mapping blank nodes in RDF datasets
verfasst von: Juliano de Almeida Monte-Mor
Adilson Marques da Cunha
Publikationsdatum: 01.12.2015
Verlag: Springer London
Erschienen in: Journal of the Brazilian Computer Society / Ausgabe 1/2015
Print ISSN: 0104-6500
Elektronische ISSN: 1678-4804
DOI: https://doi.org/10.1186/s13173-015-0022-3

Springer Professional

Abstract

Background

Methods

Results

Conclusions

Competing interests

Authors’ contributions

Background

Related work

Problem description

Rough set theory

Basic concepts

Some RST measures

Methods

Blank nodes as rough sets

Extending the RST concepts

Exemplifying the modeling

The ApproxMap method

Heuristic strategies

Data structures

Mapping bnode pairs

Proposed method

Method analysis

Results and discussion

Real datasets

Crawled datasets

Synthetic datasets

Datasets from adapted Univ-Bench artificial generator

Datasets with directly connected bnodes

Datasets from adapted BSBM generator

Changing the size of the datasets

Changing delta size

Identical datasets

Different datasets

Large datasets

Analysis of results

Conclusions

Acknowledgements

Competing interests

Authors’ contributions

Weitere Artikel der Ausgabe 1/2015

A novel caching algorithm for VoD proxy implementation and its evaluation including a new set of metrics for efficiency analysis

Testing of aspect-oriented programs: difficulties and lessons learned based on theoretical and practical experience

Reachability-based model reduction for Markov decision process

A hybrid particle swarm optimization and harmony search algorithm approach for multi-objective test case selection

Improving workflow design by mining reusable tasks

Generalized probabilistic satisfiability through integer programming

Premium Partner