1 Introduction
-
The terminological heterogeneity problem occurs when the same entities in different ontologies are represented differently, e.g., openingDate vs. establishedDate and shortDescription vs. abstract.
-
The conceptual heterogeneity, which is also called semantic heterogeneity in [9] and logical mismatch in [18], occurs due to the use of different axioms for defining concepts or due to the use of totally different concepts. For example, most airport instances are described with the type of db-onto:Airport that is a subClass of db-onto:Infrastructure (db-onto:Infrastructure is a subClass of db-onto:ArchitecturalStructure). However, some airports are described using db-onto:- Building, which is a subClass of db-onto:ArchitecturalStructure.
-
Top-level class: If the data set is ontology-based, top-level classes are all the direct subClasses of owl:Thing. Otherwise, we use the top categories as top-level classes. For example, db-onto:Agent and db-onto:Place are top-level classes in DBpedia, and nyt:nytd_geo and nyt:nytd_org are top-level classes in NYTimes.
-
Frequent core property: The frequently used properties describing instances in the data sets are considered as frequent core properties. For example, the properties db-onto:kingdom, db-onto:- class, and db-onto:family are frequently used to describe instances defined with the class of db-onto:Species.
2 Related Work
3 Ontology Integration Framework
3.1 Graph-Based Ontology Integration
3.1.1 Graph Pattern Extraction
3.1.2 \(<\)Predicate, Object\(>\) Collection
-
Number: The value consists of all numbers.
-
URI: Starts with “http://”.
-
String: All the other values that can not be classified.
Type | Built-in data types |
---|---|
String | |
Date | |
Number | |
URI |
Predicate | Object | Type |
---|---|---|
rdf:type | db-onto:Country | Class |
rdfs:label | “France”@en | String |
foaf:name | “France”@en | String |
foaf:name | “République française”@en | String |
db-onto:wikiPageExternalLink | URI | |
db-prop:populationEstimate | 65447374 | Number |
...... | ...... | ...... |
geo-onto:name | France | String |
geo-onto:alternateName | “France”@en | String |
geo-onto:featureCode | geo-onto:A.PCLI | Class |
geo-onto:population | 64768389 | Number |
...... | ...... | ...... |
rdf:type | mdb:country | Class |
mdb:country_name | France | String |
mdb:country_population | 64094000 | Number |
rdfs:label | France (Country) | String |
...... | ...... | ...... |
rdf:type | skos:Concept | Class |
skos:inScheme | nyt:nytd_geo | Class |
skos:prefLabel | “France”@en | String |
nyt-prop:first_use | 2004-09-01 | Date |
3.1.3 Related Class and Property Grouping
3.1.4 Aggregation of All Integrated Classes and Properties
3.1.5 Manual Revision
3.2 Machine-Learning-Based Approach
3.2.1 Decision Table
3.2.2 Apriori
3.3 Integrated Ontology Constructor
3.3.1 Ontology Enrichment
-
Annotation: We collect all the default annotation definitions of the classes and properties from the data sets. In this process, for each group of classes and properties, we simply remove the duplicated annotations and the simple annotations that are included in the more comprehensive ones.
-
Domain: The domain information of a property should be included in the integrated ontology because it indicates the relation between a property and a class. This information can help users to easily understand the kinds of properties that can be used for a specific class. To retrieve the domain information of a property, we randomly select \(m\) number of samples of instances having the property. Then we collect all the class information of the sample instances and iteratively do the sampling process for \(n\) times. The class information can be collected by tracking with properties rdf:type, skos:inScheme, etc. Then we analyze the collected class information to retrieve the proper domain information for a property. We choose the most frequently appearing classes as the domains of a property, which are the classes that appear in almost every sample instance. However, we observed that some classes are also frequently used, but are missing in a few instances. Hence, we set a frequency threshold for the domain retrieval as 0.95 * \(Freq_{top}\), where \(Freq_{top}\) is the highest frequency of a class. If we could not retrieve the frequent class information or default definition of domain information, we set owl:Thing as the domain information.
-
Range: The range information of a property is also important for users when they create SPARQL queries or publish data sets. However, most of the ranges are missing and sometimes the values are published in various ranges. To retrieve the range information, we also use the same sample instances described above. Then we analyze the values of the properties in the sample instances. We can retrieve the built-in data types by tracking the symbol “\(\wedge \wedge \)”. For other values for which we do not expressively show the data types, we classify them into two types: Resource and String. If the value contains resource information, we classify it as a Resource, otherwise we consider it as a String.
3.3.2 Ontology Merger
-
Class Related classes are collected from the graph-based ontology integration component and the top-level classes in each data set are collected from the machine learning-based ontology entity extraction component.1.Groups of classes from the graph-based ontology integration Related classes from different data sets are extracted by analyzing SameAs graph patterns and then grouped into \(cgroup_{1}, cgroup_{2},\ldots , cgroup_{z}\). We define \({ex\!-\!onto:ClassTerm}\) for each group, where \(ClassTerm\) is the most frequent term in the group. For all \(c_{i} \in cgroup_{k}\), \({<}{ex{-}onto:ClassTerm}_{k}\), \({ex{-}prop:hasMemberClasses}\), \(c_{i}{>}\) is added automatically.2.Classes from the machine-learning-based approach Top-level classes in each data set are added to the integrated ontology. If a top-level class \(c_{i} \not \in cgroup_{k} (1 \le k \le z)\), we create a new group \(cgroup_{z+1}\) for each class \(c_{i}\) and create a new term \({ex\!-\!onto:ClassTerm}_{z+1}\) for the new group. Then we add a triple \(<\) \(ex{-}onto:ClassTerm_{z+1}\), \(ex{-}prop:hasMemberClasses\), \(c_{i}\) \(>\).
-
Property The extracted properties from two components are merged according to the following rules. First, we extract the existing property type and the domain information of each property from the data sets. The property type is mainly defined by rdf:Property, owl:DataTypeProperty, and the object property owl:ObjectProperty. If the type is not clearly defined, we set the type as rdf:Property.1.Groups of properties from graph-based ontology integration Related properties from various data sets are extracted by analyzing the SameAs graph patterns and then grouped into \(pgroup_{1}, pgroup_{2},\ldots , pgr\)- \(oup_{p}\). For each group, we choose the most frequent term ex-onto:propTerm. Next, for each property \(prop_{i} \in pgroup_{t} (1 \le t \le p)\), we add a triple \({<}{ex{-}onto:} {propTerm}_{t}\), \({ex{-}prop:hasMemberProperties}\), \(prop_{i}{>}\) and the triple \({<}{ex{-}onto:propTerm}_{t}\), rdfs:domain, dInfo\(>\), where dInfo is retrieved domain information of \(prop_{i}\) in the ontology enrichment process.2.Properties from machine learning-based approach We automatically add domain information for the properties retrieved by the Apriori method. For each property \(prop\) extracted from the instances of class \(c\), \({<}prop\), \({rdfs:domain}\), \(c\!>\) is automatically added, if it is not defined in the data set.
3.3.3 Naming Validator
4 Experiments
4.1 Data Sets
Data set | Instances | Selected instances | Classes | Top-level classes | Properties | Selected properties |
---|---|---|---|---|---|---|
DBpedia | 3,708,696 | 64,460 | 241 | 28 | 1,385 | 840 |
Geonames | 7,480,462 | 45,000 | 428 | 9 | 31 | 21 |
NYTimes | 10,441 | 10,441 | 5 | 4 | 8 | 7 |
LinkedMDB | 694,400 | 50,000 | 53 | 10 | 107 | 60 |
4.2 Decision Table
-
DBpedia. The Decision Table algorithm retrieved 53 DBpedia properties from 840 selected properties. For example, the properties db-onto:formationYear, db-prop:city, db-prop:debut, and db-prop:stateName are extracted from DBpedia instances. The precision, recall, and \(F\)-measure on DBpedia are 0.892, 0.821, and 0.837, respectively.
-
Geonames. We retrieved 10 properties from 21 selected properties, such as geo-onto:alternateName, geo-onto:countryCode, and wgs84_post:alt, etc. Since all the instances of Geonames are from the geographic domain, the Decision Table algorithm cannot well distinguish different classes with these commonly used properties. Hence, the evaluation results on Geonames are very low with 0.472 precision, 0.4 recall, and 0.324 \(F\)-measure.
-
NYTimes. Among the 7 properties used in the data set, 5 of them are retrieved using the Decision Table algorithm. We retrieved skos:scopeNote, nyt:latest_use, nyt:topicPage, skos:definition, and wg- s84_pos:long. In NYTimes, only a few properties describe news articles and most of them are commonly used in every instance. The cross-validation test results with NYTimes are 0.795 precision, 0.792 recall, and 0.785 \(F\)-measure.
-
LinkedMDB. The algorithm can correctly classify all the instances in LinkedMDB with 11 properties retrieved from 60 properties. In addition to commonly used properties such as foaf:page, and rfs:label, we also extracted some unique properties such as director_directorid, mdb:writer_writerid, md- b:performance_performanceid, etc.
Data set | Average precision | Average recall | Average \(F\)-measure | Retrieved properties |
---|---|---|---|---|
DBpedia | 0.892 | 0.821 | 0.837 | 53 |
Geonames | 0.472 | 0.4 | 0.324 | 10 |
NYTimes | 0.795 | 0.792 | 0.785 | 5 |
LinkedMDB | 1 | 1 | 1 | 11 |
4.3 Apriori
Data set | Class | Properties |
---|---|---|
DBpedia | db:Event | db-onto:place, db-prop:date, db-onto:related\(/\)geo. |
db:Species | db-onto:kingdom, db-onto:class, db-onto:family. | |
db:Person | foaf:givenName, foaf:surname, db-onto:birthDate. | |
Geonames | geo-onto:P | geo-onto:alternateName, geo-onto:countryCode |
geo-onto:R | wgs84_pos:alt, geo-onto:name, geo-onto:countryCode. | |
NYTimes | ny.t:nytd_geo | wgs84_pos:long |
nyt:nytd_des | skos:scopeNote | |
LinkedMDB | mdb:actor | mdb:performance, mdb:actor_name, mdb:actor_netflix_id. |
mdb:film | mdb:director, mdb:performane, mdb:actor, dc:date. |
4.4 Comparison with Other Ontology Matching Tools
4.4.1 Comparison with DB-Geo Alignments
DBpedia-Geonames | AROMA | FITON |
---|---|---|
Precision | 0.18 | 0.64 |
Recall | 0.04 | 0.37 |
\(F\)-measure | 0.07 | 0.47 |
4.4.2 Comparison with BLOOMS Alignments
Geonames-DBpedia | Alignments from BLOOMS | |||
---|---|---|---|---|
FITON | BLOOMS | RiMOM | S-Match | |
Precision | 0.65 | 0 | err | 0.23 |
Recall | 0.26 | 0 | err | 1 |
\(F\)-measure | 0.37 | N/A | N/A | 0.37 |
4.5 Evaluation of the Integrated Ontology
4.5.1 Evaluation with OOPS! validator
4.5.2 Evaluation with Ontology Reference Alignments
Data pair | Precision | Recall |
\(F\)-measure |
---|---|---|---|
DBpedia-Geonames | 0.64 | 0.37 | 0.47 |
DBpedia-LinkedMDB | 1 | 0.1 | 0.2 |
DBpedia-NYTimes | 0.93 | 0.02 | 0.04 |
LinkedMDB-NYTimes | 1 | 0.07 | 0.13 |
LinkedMDB-Geonames | 0 | 0 | n/a |
Geonames-NYTimes | 1 | 0.04 | 0.08 |
Property | Number of instances | rdfs:domain |
---|---|---|
db-onto:birthDate | 287,327 | db-onto:Person |
db-prop:datebirth | 1,675 | N/A |
db-prop:dateofbirth | 87,364 | N/A |
db-prop:dateOfBirth | 163,876 | N/A |
db-prop:born | 34,832 | N/A |
db-prop:birthdate | 70,630 | N/A |
db-prop:birthDate | 101,121 | N/A |
5 Discussion
5.1 Discovering Missing Links with Graph Patterns
5.2 Discovering Missing Links with Integrated Ontology
Example 1: Link Islands | |
---|---|
SELECT DISTINCT ?geo ?db ?string | |
where \(\{\)
| |
?geo geo-onto:featureCode geo-onto:T.ISL. | |
?geo ?gname ?string. | |
ex-onto:name ex-prop:hasMemberProperties ?gname. | |
?db rdf:type db-onto:Island. | |
ex-onto:name ex-prop:hasMemberProperties ?dname. | |
?db ?dname ?string. \(\}\)
| |
Example 2: Link Countries | |
SELECT DISTINCT ?geo ?db | |
where \(\{\)
| |
ex-onto:name ex-prop:hasMemberProperties ?gname. | |
\(\{\) ?geo geo-onto:featureCode go-onto:A.PCLI. \(\}\)
| |
UNION | |
\(\{\) ?geo geo-onto:featureCode geo-onto:A.PCLD. \(\}\)
| |
?geo ?gname ?string. | |
?db rdf:type db-onto:Country. | |
ex-onto:name ex-prop:hasMemberProperties ?dname. | |
?db ?dname ?string. \(\}\)
|
5.3 More Answers with the Integrated Ontology
Standard Query | Query with the Integrated Ontology |
---|---|
Give me all the cities with more than 10,000,000 inhabitants | |
SELECT DISTINCT ?uri ?string | SELECT DISTINCT ?uri ?string |
WHERE \(\{\)
| WHERE \(\{\)
|
?uri rdf:type db-onto:City. | ?uri rdf:type db-onto:City. |
ex-onto:population ex-prop:hasMemberProperties ?prop. | |
?uri db-prop:populationTotal ?inhabitants. | ?uri ?prop ?inhabitants. |
FILTER (?inhabitants \(>\) 10000000). | FILTER (?inhabitants \(>\) 10000000). |
OPTIONAL \(\{\) ?uri rdfs:label ?string. | OPTIONAL \(\{\) ?uri rdfs:label ?string. |
FILTER (lang(?string) = ’en’) \(\}\}\)
| FILTER (lang(?string) = ’en’) \(\}\}\)
|
How tall is Claudia Schiffer? | |
SELECT DISTINCT ?height | SELECT DISTINCT ?height |
WHERE \(\{\)
| WHERE \(\{\)
|
res:Claudia_Schiffer db-onto:height ?height. | ex-onto:height ex-prop:hasMemberProperties ?hprop |
\(\}\)
| res:Claudia_Schiffer ?hprop ?height. \(\}\)
|