1 Introduction
subjects
, predicates
, and objects
are all resources identified by IRIs.1 Objects can also be literals (e.g., a number, a string), which can be annotated with optional type information, called datatype. This latter is a classification of data, which defines types of RDF, adopted from XML Schema [25]. There are two classes of datatypes: simple and complex. Simple datatypes can be primitive (e.g., boolean
, float
), derived (e.g., long
, int
derived from decimal
), or user defined, which are built from primitive and derived datatypes by constraining some of its properties (e.g., range, precision, length, format). Complex datatypes contain elements defined as either simple or complex datatypes."20.000"
and "20.0"
, these objects are considered as different, because of the missing datatype. However, if they were annotated as follows: "20.000"
^^xml:decimal
and "20.0"
^^xml:decimal
, then one can conclude that both objects are identical. Works on XML Schema matching proved that the presence of datatype information, constraints, and annotations on an object improves the similarity between two documents (up to 14%) [2]. Moreover, recent studies in the context of XML/RDF document matching have performed an analysis of datatypes to increase the compatibility/integration among data [9, 18, 23, 29]. However, a huge quantity of RDF documents is incomplete or inconsistent in terms of datatypes [15, 27]. Hence, when datatypes are missing, datatype inference emerges as a new challenge in order to obtain more accurate RDF document matching results.-
Several theoretical studies were conducted in the context of XML Schema Definition (XSD) [6, 7, 14]. They mainly infer simple datatypes by a hierarchy among the candidate datatypes obtained by a pattern-matching process on the format of the values, i.e., the characters that make unique a datatype, which is called lexical space according to the W3C Recommendation [25]. These works consider a limited number of simple datatypes (e.g.,
date
,decimal
,integer
,boolean
, andstring
) and choose the most specific datatype among the candidates. However, datatypes, asgYear
(e.g.,1999
), cannot be determined (since it is identified as aninteger
). Also, the other theoretical studies in the context of programming languages and OWL have focused on inferring complex datatype through axioms, assigned operations, and inference rules [11, 16, 26], without considering simple datatypes. -
There are many tools available on the Web that infer datatypes by mainly a pattern-matching process of the lexical spaces. As unknown inference criteria are used, each tool provides different datatypes for the same XML data.
string
, decimal
, and base64Binary
are candidate datatypes for the literal value “1”. Moreover, primitive datatypes were only considered in the study, but derived datatypes are also part of simple datatypes; thus, the inference is incomplete for simple datatypes. For that, we extend our inference datatype framework, called RDF Datatype in Ferring Framework (RDF-F), by proposing a new process based on the modification of the existing literal values through new non-ambiguous lexical spaces as an alternative to the four-step process to infer simple datatypes (primitive and derived). We focus on eliminating the ambiguity among lexical space representations, due to the high performance obtained in our previous study.2 Motivating Scenario
Light Switch
, with a property (predicate) isLight
, whose datatype is boolean
. However, they are represented with different lexical spaces: binary lexical space with a value 1
in Fig. 1a and string lexical space with a value true
in Fig. 1b. In both cases, isLight
property expresses the state of the light switch (i.e., turned on or turned off). Figure 1c shows the concept Light Bulb
, with a property Light
, whose datatype is float
, and property weight
with datatype double
.
Light Bulb
concept is different from the other ones. Indeed, the Light
property is expressed with float
values, expressing the light intensity, that has nothing to do with light switch state (i.e., turned on or turned off).1
is not compatible with the value true
, which can be considered as a string
datatype instead of boolean
). Moreover, the integration of concept Light Switch
from Fig. 1a with concept Light Bulb
from Fig. 1c will be possible, even though it is incorrect. The Light
properties of both respective documents are compatible because the lexical spaces of their values are the same (1 and 1250, respectively, can be integer
). With the presence of datatype information, we can avoid this ambiguity even if the lexical spaces of the values are compatible.3 Related Work
-
Consideration of simple datatypes, since this is the scope of the work;
-
Analysis of local information, such as object values and predicates;
-
Analysis of external information, since the Semantic Web allows the integration of external resources;
-
Suitability for the Semantic Web, the whole method should be objective, complete, and applicable for any domain.
3.1 Theoretical Approaches
3.1.1 Hierarchy-Based Approaches
date
, decimal
, integer
, boolean
, and string
). They propose a hierarchy between the reduced datatypes according to the lexical spaces of the W3C Recommendation (see Fig. 2). The proposal returns the most specific datatype that subsumes the candidate datatypes obtained by the pattern matching of the values. However, a gYear
value is reduced to integer
, which is incorrect. In the same context, the author of [6, 7] proposes an inference method based on a hierarchy applied to a set of candidate datatypes. This set contains some derived datatypes of numeric group as nonNegativeInterger
, unsignedInteger
, and unsignedShort
. In this case, the smallest datatype is chosen. For example, for a literal value 1999, whose datatype is gYear
, the smallest among the candidate datatypes is unsignedShort
, according to the hierarchy shown in Fig. 3. Table 1 shows the lexical spaces of simple datatypes according to the W3C.
Datatype | Lexical space | Examples |
---|---|---|
string | Any character | “Example 123” |
duration | PnYnMnDTnHnMNS | P1Y2M3DT10H30M |
dateTime | CCYY-MM-DDThh:mm:ss-UTC | 1999-05-31T13:20:00-05:00 |
time | hh:mm:ss | 13:20:00-05:00 |
date | CCYY-MM-DD | 1999-05-31 |
gYearMonth | CCYY-MM | 1999-05 |
gYear | CCYY | 1999 |
gMonthDay | –MM-DD | –05-31 |
gDay | –DD | –31 |
gMonth | –MM– | –05 |
boolean | true, false, 1, 0 | false |
base64Binary | Base64-encoded | 0YZZ |
hexBinary | Hex-encoded | 0FB7 |
float | 32-bit floating point type | 12.78e–2, 1999 |
decimal | Arbitrary precision | 12.78e–2, 1999 |
double | 64-bit floating point type | 12.78e–2, 1999 |
3.1.2 Function-Based Approaches
date
and integer
are mainly inferred by a pattern-matching process of the value format using the lexical spaces. However, several simple datatypes having intersection among their lexical spaces as gYear
and integer
cannot be inferred using this pattern-matching process.3.1.3 Knowledge-Based Approaches
dbr:Barack Obama
) and datatype property (syntactic type, e.g., xsd:string
). They propose an approach to infer the semantic type of string literals using the word detection technique called Stanford CoreNLP3 to identify the principal term and the UMBC4 semantic similarity service to discover the semantic class. However, a semantic type is not always related to the same datatype, since it depends on the datatype defined in the structure. For example, the same data can be expressed as a string
or integer
according to two different ontologies.3.2 Tools
weight
and isLight
from the following XML document extracted from Fig. 1 have different inferred datatypes according to these three tools: -
XMLgrid infers
weight
asdouble
andisLight
asint
; -
FreeFormatted infers
weight
asfloat
andisLight
asbyte
; -
While XmlSchemaInference infers
weight
asdecimal
andisLight
asunsignedByte
.
Work | Inference method | Requirements | ||||
---|---|---|---|---|---|---|
Data criteria | Suitability | |||||
Simple datatypes | Local | External | XML/XSD | RDF–OWL | ||
Hierarchy/lexical space | Reduced set | ✓ | X | ✓ | X | |
Functions (axioms, operations, and constructors) | Only complex | ✓ | X | ✓ | X | |
Knowledge (inference rules) | Only complex | ✓ | X | X | ✓ | |
[13] | Knowledge (semantic analysis) | Only string | ✓ | ✓ | X | ✓ |
Not provided | Not provided | ✓ | X | ✓ | X | |
[10] | IRI information, Lexical space, Semantic analysis, Generalization | Only primitive | ✓ | ✓ | X | ✓ |
4 RDF Terminologies and Definitions
rdfs:domain
property designates the type of subject that can be associated with a predicate and the rdfs:range
property designates the type of object. The Semantic Web proposes an implicit representation of the datatype property in the literal object as a description of the value (e.g., "value"
^^xml:string
). Definition 1 presents the formal definition of a simple datatype according to W3C [17].boolean
from Fig. 1a has the following characteristics:-
\(VS({\texttt {boolean}})\) = {true, false};
-
\(LS({\texttt {boolean}})\) = \(\{``{\mathrm{true}}{\hbox {''}}, ``{\mathrm{false}}{\hbox {''}},``1{\hbox {''}},``0{\hbox {''}}\}\);
-
\(L2V({\texttt {boolean}})\) = \(\{``{\mathrm{true}}{\hbox {''}} \Rightarrow {\mathrm{true}}, ``{\mathrm{false}}{\hbox {''}} \Rightarrow {\mathrm{false}}, ``1{\hbox {''}} \Rightarrow {\mathrm{true}}, ``0{\hbox {''}} \Rightarrow {\mathrm{false}}\}\)
Set | Description |
---|---|
I
| A set of IRIs defined as: \(I= \{ i \mid i\ {\mathrm{is}}\ {\mathrm{an}} \ {\mathrm{IRI}} \}\) |
L
| A set of literal nodes defined as: \(L= \{l \mid l\ {\mathrm{is}} {\mathrm{a}} {\mathrm{literal}} {\mathrm{node}}\}\) |
BN
| A set of blank nodes defined as: \(BN= \{bn \mid bn {\mathrm{is}} {\mathrm{a}} {\mathrm{blank}} {\mathrm{node}}\}\) |
DT | A set of datatypes defined as: \(DT= \{{\mathrm{d}}t \mid {\mathrm{d}}t {\mathrm{is}} {\mathrm{a}}\ {\mathrm{data}}{\mathrm{type}}\}\) |
SDT | The set of simple datatypes proposed by the W3C defined as: SDT= \(\{\) string , duration , dateTime , time , date , gYearMonth , gDay , gMonth , boolean , base64Binary , hexBinary , float , decimal , double \(\}\) |
-
\(s \in I \cup BN\) represents the subject to be described;
-
p is a property defined as an IRI in the form \({\texttt {namespace\_prefix:property\_name}}\); \(namespace\_prefix\) is a local identifier of the IRI, where the property (\(property\_name\)) is defined;
-
\(o \in I \cup BN \cup L\) describes the object.
-
\(t_1\): \({\langle \mathrm{Light}\,\mathrm{Switch},\,\mathrm{house}\!:\!\mathrm{is}\,\mathrm{Light},1\rangle }\)
-
\(t_2\): \({\langle \mathrm{Light}\,\mathrm{Switch},\, \mathrm{house}\!:\!\mathrm{is}\,\mathrm{Light},\,\mathrm{true}\rangle }\)
-
\(t_3\): \({\langle \mathrm{Light}\,\mathrm{Bulb},\,\mathrm{light}\!:\!\mathrm{Light},1250\rangle }\)
-
\(t_4\): \({\langle \mathrm{Light}\,\mathrm{Bulb},\, \mathrm{dbp}\!:\!\mathrm{weight},30.00\rangle }\)
5 RDF-F: Our Inference Process Approach
5.1 RDF-F: Four-Step Process
5.1.1 Predicate Information Analysis (Step 1)
p
establishes the relationship between the subject s
and the object o
, making the object value o
a characteristic of s
. Information (properties) such as rdfs:domain
and rdfs:range
can be associated with each predicate to determine the type of subject and object, respectively. To deduce the simple datatype of a particular literal object, we propose to inspect the property rdfs:range
, when exists. We formally describe this Step 1 with the following definitions and rule.-
\(s_i = t\cdot p \mid\) \(s_i\) is the subject of a triple \(t_i\);
-
\(p_i\) is an RDF defined property \(\in \{\)rdfs:type, rdfs:label, rdfs:range\(\}\);
-
\(o_i\) is the object of \(t_i\).
dbp:weight
presented in Fig. 1c.dbp:weight
Subject | Predicate (property) | Object (value) |
---|---|---|
dbp:weight | rdf:type | owl:DatatypeProperty |
dbp:weight | rdfs:label | gewicht (g) (de) |
dbp:weight | rdfs:label | gewicht (g) (nl) |
dbp:weight | rdfs:label | peso (g) (pt) |
dbp:weight | rdfs:label | poids (g) (fr) |
dbp:weight | rdfs:label | weight (g) (en) |
dbp:weight | rdfs:label | weight (g) (en) |
dbp:weight | rdfs:range | xsd:double |
dbp:weight | prov:wasDerivedFrom |
rdfs:range
property, defined as:dbp:weight
(see Table 4), the Predicate Range Information function returns the value xsd:double
.rdfs:range
property exists from the set of triples extracted by Definition 3. Algorithm 1 is the pseudo-code of how this rule can be implemented in high-level programming language.dbp:weight
(more specifically, http://www.dbpedia.org/ontology/weight) is the predicate, we can get the list of triples shown in Table 4. (Each row represents a triple.) If among these triples, there is the property rdfs:range
, then its associated object value, which is the datatype, is returned (lines 3 to 5). Otherwise, an unknown datatype is returned (lines 7—Definition 4).xsd:double
.?subject ?predicate ?literal
are the triple to be analyzed; ?predicate
is the analyzed predicate; and ?datatype
is the returned result.5.1.2 Datatype Lexical Space Analysis (Step 2)
1999-05-31
matches with the format CCYY-MM-DD
, which is the lexical space of datatype date
). However, in other cases (such as boolean
, gYear
, decimal
, double
, float
, integer
, base64Binary
, and hexBinary
), the lexical spaces of datatypes have common characteristics, leading to a certain ambiguity (e.g., value 1999
matches with lexical spaces of gYear
and float
– see Table 1). Figure 5 illustrates graphically the lexical space intersections of W3C simple datatypes (primitives and integer).
1
presented in Fig. 1a is: CDT(1)={float, decimal, double, hexBinary, base64Binary, integer, boolean, string}
.string
is a candidate datatype, since it has the most general lexical space (see Fig. 5); if the number of candidate datatypes is one, then the only datatype, which is string
, is returned. If the number of candidate datatypes is two, then the other datatype is returned. Otherwise, we have an ambiguous case and any datatype, different from string
, can be provided. Hence, the inference process remains incomplete due to the ambiguous cases and further analysis is needed.string
, because any object value is a string
(line 1). According to the lexical spaces defined by the W3C (see Table 1), the list of candidate datatypes is generated by a pattern-matching process (line 2 in Algorithm 2—Definition 6) following the order obtained from the lexical space intersections. If the number of candidate datatypes is more than 2, we are under an ambiguous case, since the lexical space of the literal value matches with several lexical spaces of the datatypes (lines 3–4 of Algorithm 2). If we have only string
as a candidate datatype, then this is the returned information (line 7 of Algorithm 2). If we get two candidate datatypes, one of them is a string datatype and the other one is the datatype returned for the object value (line 9 of Algorithm 2).5.1.3 Predicate Semantic Analysis (Step 3)
boolean
, gYear
, decimal
, double
, float
, integer
, base64Binary
, and hexBinary
are ambiguous. However, the ambiguity of boolean
, gYear
, and integer
, in some specific scenarios, can be resolved by examining the context of its predicate according to a knowledge base.dbp:dateOfBirth
has the context date, and then it is possible to assume gYear
as the datatype; the predicate dbp:era
has the context period and the datatype assigned can be integer
; however, for predicate dbp:salary
, it is possible to assign datatypes decimal
, double
, or float
; the ambiguous case persists.-
Similarity (sim): Given two entities n and m, Similarity is a function, denoted as \({{\mathrm{sim}}}(n,m)\), that returns the value of the relation among both entities:$$\begin{aligned} {\mathrm{sim}}(n,m) = {\mathrm{A}}\ {\mathrm{relation}}\ {\mathrm{value}} \in [0,1]\ {\mathrm{between}}\ {n}\ and\ {m}\ {\mathrm{according}}\ {\mathrm{to}}\ {\mathrm{KB}}. \end{aligned}$$
-
IsPlural (IP): Given an entity n, IsPlural is a function, denoted as IP(n), that returns True if the entity n is plural:$$\begin{aligned} {\mathrm{IP}}(n) = {\left\{ \begin{array}{ll} {{\mathrm{True}}} &\quad {\mathrm{if}}\ n\ {\mathrm{is}}\ {\mathrm{plural}}\ {\mathrm{according}}\ {\mathrm{to}}\ {\mathrm{KB}}; \\ {\mathrm{False }}&\quad {{\text{otherwise}}}. \end{array}\right. } \end{aligned}$$
-
IsCondition (IC): Given an entity n, IsCondition is a function, denoted as IC(n), that returns True if the entity n is a condition:In this scenario, our knowledge base is reduced to the relations among words.$$\begin{aligned} IC(n) = {\left\{ \begin{array}{ll} {{\mathrm{True}}} &\quad {\mathrm{if}}\ n\ {\mathrm{is}}\ {\mathrm{a}}\ {\mathrm{condition}}\ {\mathrm{according}}\ {\mathrm{to}}\ KB; \\ {{\mathrm{False}}} &\quad {{\text{otherwise}}}. \end{array}\right. } \end{aligned}$$
weight
is: \(\mathtt{CT = \{\langle weight,load,0.8\rangle , \langle weight,heaviness,0.5\rangle , \langle weight,obesity,0.4\rangle , \langle weight,size,0.3\rangle \}}\).-
If date is in the context (e.g., \(\langle word\),date,\(0.5\rangle\), with \(h=0.5\)) and the literal value is a number (e.g., 1999), then the datatype is
gYear
becausegYear
(1999) is a part of datatypedate
(1999-05-31); -
If period is in the context (e.g., \(\langle word,\)period,\(0.5\rangle\), with \(h=0.5\)) and the literal value is a number (e.g., 3 months), then the datatype is
integer
because it is about quantity. -
However, if the context is date, the word from which we obtain the context cannot be plural, since plural words express quantities. Thus, in this case the word is related to the datatype
integer
according to our scenarios.
boolean
, we assume that a word is defined as a condition in a knowledge base (e.g., WordNet).5.1.4 Generalization of Numeric and Binary Groups (Step 4)
decimal
, double
, float
, integer
, base64Binary
, and hexBinary
, we propose two groups of datatypes: Numeric and Binary. In each group, we define a total order among the datatypes by considering lexical space intersection (see Fig. 5). Hence, for the Numeric group, we have decimal
> double
> float
> integer
and in the Binary group, base64Binary
> hexBinary
. According to these groups, we return the most general datatype, if all candidate datatypes belong only to one of these two groups.string
is always part of candidate datatypes. We formally define our fourth inference rule as follows.decimal
and base64Binary
as candidate datatypes because of similar value representations and our inference approach cannot determinate the most appropriate datatype.decimal
and base64Binary
) (line 2 in Algorithm 4). If the list of candidate datatypes has only a value, the datatype is string
(line 4 in Algorithm 4); however, if there are two, the datatype is the second one (line 6 in Algorithm 4), since the first one is always string
. If there are more than two datatypes, the ambiguity persists and this step is not able to produce a result.5.2 RDF-F: Non-ambiguous Lexical-Space-Matching Process
float
representation is: \(\mathtt{f{[}+-{]}?({[}0-9{]}*{[}.{]})?(E|e)?{[}0-9{]}+}\).Simple datatypes | W3C | Proposal | |
---|---|---|---|
Primitive | boolean | (1\(\mid\)0\(\mid\)true\(\mid\)false) | b(1\(\mid\)0\(\mid\)true\(\mid\)false) |
gYear | [1–9]{1,4} | y[1–9]{1,4} | |
decimal | [+-]?([0–9]*[.])?[0–9]+ | (de)[+-]?([0–9]*[.])?[0–9]+ | |
float | [+-]?([0–9]*[.])?(E|e)?[0–9]+ | f[+-]?([0–9]*[.])?(E|e)?[0–9]+ | |
double | [+-]?([0–9]*[.])?(E|e)?[0–9]+ | d[+-]?([0–9]*[.])?(E|e)?[0–9]+ | |
hexBinary | 0[xX][0–9a-fA-F]+ | hB0[xX][0–9a-fA-F]+ | |
Derived | integer | [+-]?[0–9]+ | I[+-]?[0–9]+ |
negativeInteger | -[0–9]+ | nI(-[0–9]+) | |
nonNegativeInteger | 0\(\mid\)(\(\backslash\)+?[0–9]+) | nNI(0\(\mid\)(\(\backslash\)+?[0–9]+)) | |
positiveInteger | \(\backslash\)+?[1–9]+[0–9]* | pI\(\backslash\)+?[1–9]+[0–9]* | |
nonPositiveInteger | 0\(\mid\)(-[0–9]+) | nPI(0\(\mid\)(-[0–9]+)) | |
long | [+-]?[0–9]+ | l[+-]?[0–9]+ | |
int | [+-]?[0–9]+ | i[+-]?[0–9]+ | |
short | [+-]?[0–9]+ | s[+-]?[0–9]+ | |
unsignedLong | [0–9]+ | uL[0–9]+ | |
unsignedInt | [0–9]+ | uI[0–9]+ | |
unsignedShort | [0–9]+ | uS[0–9]+ |
6 Complexity Analysis
-
In Step 1, the predicate information of each triple is extracted to search the
rdfs:range
property, since the number of properties associated with the predicate of each triple (Definition 3) is constant, and then its execution order is of O(n). -
In Step 2, for each triple a pattern matching is executed for all simple datatypes (finite number of executions); thus, it is of linear order (O(n)).
-
In Step 3, for each triple, its set of contexts is extracted to determine the best related work (in a constant time); thus, its time complexity is also O(n).
-
Finally, Step 4 reduces the finite set of candidate datatypes (generalization) in a linear order (O(n)).
7 Experimental Evaluation
integer
, gYear
, date
, gMonthDay
, float
, nonNegative
, double
, Integer
, and decimal
). Consequently, we chose DBpedia as the dataset to perform our experiments.-
Case 1: 5603 RDF documents gathered from DBpedia person data,9 in which 1059822 triples, 38292 literal objects, and 8 different datatypes are available.
-
Case 2: the whole DBpedia person data as a unique RDF document with 16842176 triples, in which only datatypes
date
,gMonthDay
, andgYear
are presented.
7.1 Accuracy Evaluation
7.1.1 Four-Step Inference Process
Four-step inference process | Accuracy evaluation | |||||
---|---|---|---|---|---|---|
Valid | Invalid | Ambiguous | Precision (%) | Recall (%) | F-score (%) | |
Case 1: Step 1 | 24,033 | 26 | 14,233 | 99.89 | 62.81 | 77.12 |
Case 1: Step 2 | 16,898 | 537 | 20,857 | 96.92 | 44.76 | 61.24 |
Case 1: Step 3 | 2480 | 119 | 35,812 | 95.20 | 6.18 | 11.62 |
Case 1: Step 4 | 16,899 | 1962 | 19,431 | 89.60 | 46.52 | 61.24 |
Case 1: Step \(1 + 2\) | 33,771 | 281 | 4240 | 99.17 | 88.85 | 93.73 |
Case 1: Step \(1 + 3\) | 26,394 | 145 | 11,753 | 99.45 | 69.19 | 81.61 |
Case 1: Step \(2 + 3\) | 19,259 | 656 | 18,377 | 96.71 | 51.17 | 66.93 |
Case 1: Step \(1 + 4\) | 33,772 | 999 | 3521 | 97.13 | 90.56 | 93.73 |
Case 1: Step \(2 + 4\) | 16,899 | 1962 | 19,431 | 89.60 | 46.52 | 61.24 |
Case 1: Step \(1 + 2 + 3\) | 36,132 | 400 | 1760 | 98.91 | 95.36 | 97.10 |
Case 1: Step \(1 + 2 + 4\) | 33,772 | 999 | 3521 | 97.13 | 90.56 | 93.73 |
Case 1: Step \(2 + 3 + 4\) | 19,260 | 1811 | 17,221 | 91.41 | 52.79 | 66.93 |
Case 1: whole process | 36,132 | 551 | 1609 | 97.71 | 96.50 | 97.10 |
Case 2: whole process | 2,250,402 | 710,234 | 0 | 76.01 | 100.00 | 86.37 |
date
was not correctly inferred 7 times; however, according to the W3C Recommendation, its lexical space representation is unique and the datatype can be inferred by a simple lexical space matching; regarding the data, these 7 cases have the format YY-MM-DD instead of CCYY-MM-DD, which is the cause of the incorrect inferences (inconsistencies of the data).Datatype | Valid | Invalid | Ambiguous | Precision (%) | Recall (%) | Case 1: F-score (%) |
---|---|---|---|---|---|---|
integer | 13,567 | 424 | 1311 | 96.37 | 91.72 | 93.99 |
gYear | 5067 | 1 | 0 | 99.98 | 100 | 99.99 |
date | 16,446 | 7 | 0 | 99.91 | 100 | 99.98 |
gMonthDay | 459 | 0 | 0 | 100 | 100 | 100 |
float | 0 | 142 | 0 | 0 | NaN | NaN |
double | 266 | 1 | 0 | 100 | 99.63 | 99.81 |
nonNegativeInteger | 77 | 0 | 0 | 100 | 100 | 100 |
decimal | 0 | 0 | 1 | NaN | 0 | NaN |
Complex | 250 | 273 | 0 | 47.80 | 100 | 64.68 |
Total | 36,132 | 934 | 1226 | 97.71 | 96.50 | 97.10 |
dbo:deathDate
should have the datatype property date
, but in the queried datasets, it was set as gYear
).Work | Precision (%) | Recall (%) | F-score (%) |
---|---|---|---|
Xstruct | 83.28 | 100 | 90.88 |
XMLgrid | 83.61 | 100 | 91.07 |
FreeFormatted | 43.32 | 100 | 60.45 |
XMLMicrosoft | 43.23 | 100 | 60.36 |
Four-step process | 97.71 | 96.50 | 97.10 |
Availability of datatypes (%) | Precision (%) | Recall (%) | F-score (%) |
---|---|---|---|
0 | 97.71 | 96.50 | 97.10 |
25 | 97.78 | 96.47 | 97.12 |
50 | 97.66 | 96.66 | 97.16 |
75 | 97.64 | 96.91 | 97.27 |
7.1.2 Non-ambiguous Lexical-Space-Matching Process
date
was considered as string
in 7 cases, where the lexical space representation did not match with the current W3C lexical spaces. No complex datatypes that are also present in Case 1 were inferred. The total Precision, Recall, and F-score values are 99.98%, 98.98%, and 99.30%, respectively. Comparing the obtained values with the four-step process, we can observe a better accuracy, i.e., 97.27% for four-step inference process and 99.98% for non-ambiguous lexical-space-matching process.Datatype | Valid | Invalid | Ambiguous | Precision (%) | Recall (%) | Case 1: F-score (%) |
---|---|---|---|---|---|---|
integer | 15,302 | 0 | 0 | 100.00 | 100.00 | 100.00 |
gYear | 5068 | 0 | 0 | 100.00 | 100.00 | 100.00 |
date | 16,446 | 7 | 0 | 99.91 | 100.00 | 99.98 |
gMonthDay | 459 | 0 | 0 | 100 | 100 | 100 |
float | 142 | 0 | 0 | 100.00 | 100.00 | 100.00 |
double | 267 | 0 | 0 | 100 | 99.63 | 99.81 |
nonNegative Integer | 77 | 0 | 0 | 100.00 | 100.00 | 100 |
decimal | 1 | 0 | 0 | 100.00 | 100.00 | 100.00 |
Complex | 0 | 0 | 523 | 47.80 | 100 | 64.68 |
Total | 36,132 | 934 | 1226 | 99.98 | 98.98 | 99.30 |
7.2 Performance Evaluation
7.2.1 Four-Step Inference Process
Four-step inference process | Performance evaluation | |
---|---|---|
Execution time (s) | Cache building time (s) | |
Case 1: Step 1 | 31.336 | 11.582 |
Case 1: Step 2 | 15.939 | 15.939 |
Case 1: Step 3 | 243.826 | 40.764 |
Case 1: Step 4 | 17.879 | 17.879 |
Case 1: Step 1 + Step 2 | 33.216 | 13.966 |
Case 1: whole approach | 53.247 | 14.236 |
Case 2: whole approach | – | 59.282 |
7.2.2 Non-ambiguous Lexical-Space-Matching Process
Non-ambiguous LS matching | Performance evaluation | |
---|---|---|
Jena sources (s) | Modified Jena sources (s) | |
Case 1 | 11.192 | 11.955 |
7.3 Discussion and Comparison
Inference process | Accuracy evaluation | Performance evaluation (s) | ||
---|---|---|---|---|
Precision (%) | Recall (%) | F-score (%) | ||
Four-step [10] | 97.71 | 96.50 | 97.10 | 14.236 (cache building) |
Non-ambiguous LS matching | 99.98 | 98.98 | 99.30 | 11.955 |
Inference process | Method | Requirements | ||||
---|---|---|---|---|---|---|
Data criteria | Suitability | |||||
Simple datatypes | Local | External | XML/XSD | RDF–OWL | ||
Four-step [10] | IRI information Lexical space Semantic analysis Generalization | Only primitive | ✓ | ✓ | X | ✓ |
Non-ambiguous LS matching | Lexical space | Primitive and derived | ✓ | X | ✓ | ✓ |
float
and the values themselves are equal. With this process, we demonstrate the feasibility of an appropriate approach, when having non-ambiguous lexical spaces. We proposed simple lexical space modifications, but more sophisticated proposals need to be devised.