Background
Related work
Problem description
-
M(u r i)=u r i, for each u r i∈W 1∩N 1;
-
M(l i t)=l i t, for each l i t∈L 1;
-
M maps bnodes to bnodes (i.e., for each b∈B 1 it holds that M(b)∈B 2); and
-
triple (s,p,o) is in G 1 if, and only if, triple (M(s),p,M(o)) is in G 2.
Rough set theory
Basic concepts
-
Lower approximation of X in A - formed by the union of all elementary sets of A fully contained in X, i.e., the largest definable set in A contained in X:$$ A_{\text{inf}}(X) = \left\{x \in U | [x]_{R} \subseteq X\right\}. $$(7)
-
Upper approximation of X in A - formed by the union of all elementary sets of A having a nonempty intersection with X, i.e., the smallest definable set in A containing X:$$ A_{\text{sup}}(X) = \left\{x \in U | [x]_{R} \cap X \neq \emptyset\right\}. $$(8)
-
Positive region of X in A - formed by the union of all elementary sets of U fully contained in X:$$ \text{pos}\,(X) = A_{\text{inf}}\left(X\right). $$(9)
-
Negative region of X in A - formed by the elementary sets of U that have no elements in X:$$ \text{neg}\;(X) = U - A_{\text{sup}}(X). $$(10)
-
Doubtful region of X in A - also called the boundary of X, formed by the elementary sets of U that belong to the upper approximation, but do not belong to the lower approximation. The membership of an element of this region to set X is uncertain, based only on the equivalence classes of A:$$ \text{duv}\,(X) = A_{\text{sup}}(X) - A_{\text{inf}}\,(X). $$(11)
Some RST measures
-
Internal measure of X in A$$ \varpi_{\text{Ainf}}\,(X) = \left| A_{\text{inf}}\,(X)\right| $$(12)
-
External measure of X in A$$ \varpi_{\text{Asup}}(X) = \left| A_{\text{sup}}(X)\right| $$(13)
-
Quality of the lower approximation of X in A$$ \gamma_{\text{Ainf}}(X) = \frac{\varpi_{\text{Ainf}}(X)}{\left|U\right|} = \frac{\left| A_{\text{inf}}(X)\right|}{\left|U\right|} $$(14)
-
Quality of the upper approximation of X in A$$ \gamma_{\text{Asup}}(X) = \frac{\varpi_{\text{Asup}}(X)}{\left|U\right|} = \frac{\left| A_{\text{sup}}(X)\right|}{\left|U\right|} $$(15)
Methods
Blank nodes as rough sets
Extending the RST concepts
-
Positive change region - formed by the union of all elementary sets of U ij contained entirely in both X i and X j :$$ \text{pos}\left(X_{i}, X_{j}\right) = A_{\text{inf}}\left(X_{i}\right) \cap A_{\text{inf}}\left(X_{j}\right). $$(21)
-
Negative change region - formed by elementary sets of U ij that have no elements in X i or X j :$$ \text{neg}\left(X_{i}, X_{j}\right) = U_{ij} - \left(A_{\text{sup}}\left(X_{i}\right) \cap A_{\text{sup}}\left(X_{j}\right) \right). $$(22)
-
Doubtful change region - formed by elementary sets of U ij partially contained in X i or X j . In this case, X i or X j , but not both, may integrally contain elementary sets of U ij :$$ \begin{aligned} {}\! \text{duv}\left(X_{i}, X_{j}\right) \,=\, (A_{\text{sup}}\!\left(X_{i}\right) \!\cap\! A_{\text{sup}}\!\left(X_{j}\right)) \,-\, \left(A_{\text{inf}}\left(X_{i}\right) \!\cap\! A_{\text{inf}}\left(X_{j}\right)\! \right)\!. \end{aligned} $$(23)
-
Internal change measure$$ \varpi_{\text{Ainf}}\left(X_{i}, X_{j}\right) = \left| A_{\text{inf}}\left(X_{i}\right) \cap A_{\text{inf}}\left(X_{j}\right) \right| $$(24)
-
External change measure$$ \varpi_{\text{Asup}}\left(X_{i}, X_{j}\right) = \left| A_{\text{sup}}\left(X_{i}\right) \cap A_{\text{sup}}\left(X_{j}\right) \right| $$(25)
-
Quality of the lower change approximation$$ \begin{aligned} \gamma_{\text{Ainf}}\left(X_{i}, X_{j}\right) &= \frac{\varpi_{\text{Ainf}}\left(X_{i}, X_{j}\right)}{\left|U\right|}\\ &= \frac{\left| A_{inf}(X_{i}) \cap A_{inf}(X_{j}) \right|}{\left|U\right|} \end{aligned} $$(26)
-
Quality of the upper change approximation$$ \begin{aligned} \gamma_{\text{Asup}}\left(X_{i}, X_{j}\right) &= \frac{\varpi_{\text{Asup}}\left(X_{i}, X_{j}\right)}{\left|U\right|}\\ &= \frac{\left| A_{\text{sup}}(X_{i}) \cap A_{\text{sup}}(X_{j}) \right|}{\left|U\right|} \end{aligned} $$(27)
-
\(\left (b_{i} \equiv _{A_{\textit {ij}}} b_{j}\right) \Leftrightarrow \left (\gamma _{\text {Ainf}}(X_{i}, X_{j}) = 1\right)\);
-
\(\left (b_{i} \approx _{A_{\textit {ij}}} b_{j}\right) \Leftrightarrow \left (0 < \gamma _{\text {Asup}}\left (X_{i}, X_{j}\right) < 1\right)\);
-
\(\left (b_{i} \neq _{A_{\textit {ij}}} b_{j}\right) \Leftrightarrow \left (\gamma _{\text {Asup}}\left (X_{i}, X_{j}\right) = 0\right)\).
Exemplifying the modeling
-
ϖ Ainf(X 1,X 2)=5;
-
ϖ Ainf(X 1,X 3)=3;
-
ϖ Asup(X 1,X 2)=8;
-
ϖ Asup(X 1,X 3)=10;
-
γ Ainf(X 1,X 2)=5/10=0.5;
-
γ Ainf(X 1,X 3)=3/12=0.25;
-
γ Asup(X 1,X 2)=8/10=0.8;
-
γ Asup(X 1,X 3)=10/12≈0.83.
The ApproxMap method
Heuristic strategies
-
Two approximation metrics - we use metric γ Asup(X i ,X j ) if the candidate pairs have the same γ Ainf(X i ,X j ). A pair with a greater γ Asup(X i ,X j ) has a higher similarity owing to the greater number of common predicates. We consider that mapping pairs with more similar predicates can help in reducing the delta size.
-
Two levels for bnode partitioning - the first level considers the existing hierarchy between directly connected bnodes, classifying the bnodes into four disjoint sets: roots, leaves, intermediates, and no interconnections. Then, in the second partitioning level, we organize the bnodes according to the number of connections with other nodes, allowing quick access to sets of bnodes with a particular number of links.
-
Unmapped neighboring bnodes are the same for incoming links but differ for outbound links - while neighboring bnodes are unmapped, URIs and literals play an important role in distinguishing blank nodes. The strategy adopted by Tzitzikas et al. [6], whereby all neighbors as considered the same, can increase the delta size, if the mapped neighbors differ in the final mapping. Therefore, we aim to mitigate this effect by adopting the strategy described above, which considers the possible impact of different neighbors when computing the delta. With prior mapping of neighboring bnodes, we can find a greater approximation between candidate pairs.
-
Bottom-up approach to map directly connected bnodes - bnodes in the higher levels are mapped based on prior mappings in the lower levels. We compare each bnode mainly with those in the same hierarchical level, thereby reducing the number of comparisons. Relaxation of the same neighborhood for incoming links is due to this approach.
-
Top-down approximation during bnode mapping - bnodes are mapped iteratively considering a decreasing approximation in the interval (0.0,1.0]. We start the mapping of bnodes with the maximum approximation and, in each iteration, we reduce the lower limit for the desired approximation. Using this approach, we are able to reduce the number of comparisons between bnodes if the datasets contain vastly differing numbers of bnode links. This is because we do not need to compare bnode pairs that differ greatly in their numbers of links, thereby preventing an approximation greater than or equal to the desired value.
-
Initial equivalent bnode mapping - we can reduce the number of comparisons between the remaining bnodes that have not yet been mapped. Moreover, during the mapping of equivalent bnodes, we can also reduce the comparisons by applying filters to select only those bnodes in the same hierarchical level and with the same number of links as the other nodes.
Data structures
Mapping bnode pairs
Proposed method
Method analysis
Results and discussion
Real datasets
Dataset
|
|B|
|
|G|
|
D
a
|
b
density
|
b
len
|
---|---|---|---|---|---|
Swedish
| 522 | 3,670 | 5.47 | 0.00 | 0.00 |
Italian
| 6,390 | 49,897 | 3.42 | 0.00 | 0.00 |
Dataset
|
Swedish
|
Italian
| |
---|---|---|---|
Delta (triples) |
A
p
p
r
o
x
M
a
p 1/10%
| 297 | 6 |
A
p
p
r
o
x
M
a
p 5/12%
| 297 | 6 | |
A
p
p
r
o
x
M
a
p 5/25%
| 297 | 6 | |
A
l
g
Hung
| 297 | 6 | |
A
l
g
Sign
| 423 | 6 | |
Time (ms) |
A
p
p
r
o
x
M
a
p 1/10%
| 113 | 170 |
A
p
p
r
o
x
M
a
p 5/12%
| 36 | 158 | |
A
p
p
r
o
x
M
a
p 5/25%
| 34 | 153 | |
A
l
g
Hung
| 4,789 | 456,173 | |
A
l
g
Sign
| 37 | 59 |
Crawled datasets
Instance number
|
|B|
|
|G|
|
D
a
|
b
density
|
b
len
|
---|---|---|---|---|---|
1 | 19 | 1,048 | 9.00 | 0.01 | 0.11 |
2 | 83 | 11,555 | 7.31 | 0.00 | 0.00 |
3 | 361 | 28,208 | 5.93 | 0.00 | 0.00 |
4 | 362 | 28,219 | 5.96 | 0.00 | 0.00 |
5 | 893 | 15,337 | 4.40 | 0.00 | 0.02 |
Instance number
|
|B|
|
|G|
|
D
a
|
b
density
|
b
len
| |||||
---|---|---|---|---|---|---|---|---|---|---|
File 1
|
File 2
|
File 1
|
File 2
|
File 1
|
File 2
|
File 1
|
File 2
|
File 1
|
File 2
| |
1 | 169 | 19 | 4,355 | 1,048 | 5.73 | 9.00 | 0.21 | 0.01 | 16.26 | 0.11 |
2 | 190 | 83 | 11,892 | 11,470 | 5.82 | 7.31 | 0.07 | 0.00 | 1.67 | 0.00 |
3 | 1,246 | 893 | 24,364 | 15,337 | 5.13 | 4.40 | 0.10 | 0.00 | 10.88 | 0.02 |
4 | 1,963 | 361 | 27,650 | 28,208 | 6.75 | 5.93 | 0.00 | 0.00 | 0.00 | 0.00 |
5 | 1,967 | 362 | 28,031 | 28,219 | 6.74 | 5.96 | 0.00 | 0.00 | 0.01 | 0.00 |
Synthetic datasets
Datasets from adapted Univ-Bench artificial generator
Instance number
|
|G|
|
D
a
|
b
density
|
b
len
|
Δ
opt
/|G|
(%)
|
---|---|---|---|---|---|
1 | 5,846 | 13.4 | 0 | 0 | 1 |
2 | 5,025 | 10.5 | 0.1 | 1 | 0.5 |
3 | 2,381 | 7 | 0.15 | 1 | 1.5 |
4 | 1,628 | 5 | 0.2 | 1 | 1.5 |
5 | 1,636 | 5 | 0.2 | 1.15 | 1 |
6 | 1,399 | 4 | 0.25 | 1.15 | 1.7 |
7 | 919 | 3 | 0.32 | 1.15 | 3.2 |
8 | 909 | 3.25 | 0.4 | 1.35 | 2.7 |
9 | 1,001 | 3.94 | 0.5 | 21.5 | 2.5 |
Datasets with directly connected bnodes
Datasets from adapted BSBM generator
Changing the size of the datasets
Instance
|
|B|
|
|G|
|
D
a
|
b
density
|
b
len
|
Δ
opt
/|G|
(%)
|
---|---|---|---|---|---|---|
number
| ||||||
1 | 400 | 3,390 | 7.29 | 0.29 | 60.42 | 53.39 |
2 | 800 | 7,000 | 7.78 | 0.33 | 127.34 | 53.02 |
3 | 1,200 | 10,806 | 8.30 | 0.37 | 212.71 | 52.74 |
4 | 1,600 | 14,061 | 7.88 | 0.34 | 200.95 | 52.49 |
5 | 2,000 | 17,541 | 7.86 | 0.34 | 270.09 | 52.71 |
Changing delta size
Instance number
|
|G|
|
D
a
|
b
density
|
b
len
|
Δ
opt
/|G|
(%)
|
---|---|---|---|---|---|
1 | 17,895 | 8.47 | 0.38 | 231.60 | 15.86 |
2 | 17,639 | 7.99 | 0.35 | 304.20 | 31.73 |
3 | 17,868 | 8.19 | 0.37 | 326.40 | 47.34 |
4 | 17,841 | 8.13 | 0.36 | 265.72 | 62.91 |
5 | 18,014 | 8.28 | 0.37 | 340.61 | 78.35 |
6 | 17,695 | 7.95 | 0.34 | 328.19 | 94.06 |