Active learning of expressive linkage rules using genetic programming
Introduction
The goal of the Linked Data movement is to extend the Web with a global data space by making data sets accessible according to a set of best practices and by setting RDF links between data sources [1]. While the amount of data that is accessible as Linked Data has grown significantly over the last years, most data sources are still not sufficiently interlinked. Out of the over 31 billion RDF statements published as Linked Data less than 500 million represent RDF links between data sources [2]. Analysis of the Linking Open Data cloud confirms that it represents a weakly connected graph with most publishers only linking to one other data source [2].
A number of link discovery tools have been developed, which generate RDF links between entities in different data sets that represent the same real-world object. Unfortunately, fully automatic link discovery tools do not achieve a satisfying accuracy on many data sets [3]. For this reason, several semi-automatic link discovery tools–such as Silk [4] or LIMES [5]–have been developed. These tools compare entities in different Linked Data sources based on user-provided linkage rules which specify the conditions that must hold true for two entities in order to be interlinked.
Writing good linkage rules by hand is a non-trivial problem as the rule author needs to have a detailed knowledge about the structure of the data sets: first of all, the author needs to choose discriminative properties of the entities to be interlinked together with a distance measure and an appropriate distance threshold. For data sets which are noisy or use different data formats, the property values need to be normalized by employing data transformations prior to comparison. As comparing entities by a single property usually is not sufficient to decide whether both entities describe the same real-world object, the linkage rule has to aggregate the similarity of multiple property comparisons using appropriate aggregation functions.
We illustrate this point with the example of a data set about movies: even within this simple example the linkage rule author faces a couple of challenges. First of all, a comparison solely by film title fails for cases when movies with the same title actually represent distinct movies that have been released in different years. Therefore, the linkage rule needs to compare, at the very least, the titles of the movies as well as their release dates and combine both similarities with an appropriate aggregation function. As data sets can be noisy (e.g., the release dates might be off by a couple of days), the rule author also needs to choose suitable distance measures together with appropriate distance thresholds. Linkage rules must also cover data heterogeneities. For instance, a data source may contain some person names that are formatted as “first namelast name” while others are formatted as “last name, first name”. Finding such heterogeneities and adding the specific data transformations to avoid mismatches are often very tedious. Thus, writing a linkage rule is not only a cumbersome but also a time consuming task.
Supervised learning. A way to reduce this effort is to use supervised learning to generate links from the existing reference links, which contain pairs of entities that have been labeled as matches or non-matches. Creating such reference links is much easier than to write linkage rules as it requires no previous knowledge about similarity computation techniques or the specific linkage rule format used by the system. Usually, reference links are created by domain experts who confirm or reject the equivalence of a number of entity pairs from the data sets. For instance, reference links for locations in a geographical data set can be created by labeling pairs of locations as correct or incorrect. Fig. 1 shows an example of an entity pair. In the given example, the pair is to be declined as both entities represent different real-world locations.
Active learning. In order for the supervised learning algorithms to perform well on unknown data, reference links need to include all relevant corner cases. We illustrate this point by having a second look at the example in Fig. 1: while for many cities a comparison by label is sufficient to determine if two entities represent the same real-world city, the given example shows the corner case of distinct cities sharing the same name. If the entity pairs to be labeled by the user are just selected randomly from the data sets, the user has to label a very large number of pairs in order to include these rare corner cases reliably. As manually labeling link candidates is time-consuming, methods to reduce the number of candidates which need to be labeled are desirable.
The fundamental idea of active learning in the context of entity matching is to reduce the number of link candidates which need to be labeled by actively selecting the most informative candidate for being labeled by the user.
Our contribution. In this article, we present ActiveGenLink, an algorithm for learning linkage rules interactively using active learning and genetic programming. ActiveGenLink learns a linkage rule by asking the user to confirm or reject a number of link candidates which are actively selected by the algorithm. Compared to writing linkage rules by hand, ActiveGenLink lowers the required level of expertise as the task of generating linkage rules is automated by the genetic programming algorithm while the user only has to verify a set of link candidates. The employed query strategy for selecting link candidates minimizes user involvement by selecting the links with the highest information gain for manual verification. Within our experiments, ActiveGenLink outperformed state-of-the-art unsupervised approaches after manually labeling a few link candidates (less than 5 within our experiments). In addition, ActiveGenLink is capable of generating linkage rules with a comparable performance than the supervised GenLink algorithm [6] by labeling a much smaller number of link candidates (between 15 and 50 within our experiments). ActiveGenLink chooses which properties to compare, it chooses appropriate distance measures, aggregation functions, and thresholds, as well as data transformations that are applied to normalize data prior to comparison.
This article makes the following contributions:
- 1.
we propose the ActiveGenLink algorithm which applies genetic programming and active learning to the problem of learning linkage rules for generating RDF links in the context of the Web of Data.
- 2.
the learned rules are more expressive than the linkage rules learned in the previous work on active learning of linkage rules as our algorithm combines different similarity measures non-linearly and also determines the data transformations that should be employed to normalize data prior to comparison.
- 3.
we propose a query strategy that minimizes the number of links that need to be labeled by the user and outperforms the query-by-vote-entropy strategy, which has been used in previous work.
- 4.
we have implemented the proposed approach in the Silk Workbench, a web application which can be used by Linked Data publishers and consumers to set RDF links. The Silk Workbench is part of the Silk Link Discovery Framework and is available for download under the terms of the Apache License.
This article builds on our previous work presented at two occasions:
- •
in [6], we present the GenLink algorithm for learning expressive linkage rules from a set of existing reference links using genetic programming.
- •
in [7], we present an approach which combines genetic programming and active learning to generate expressive linkage rules interactively.
Outline. This article is organized as follows: Section 2 formalizes the entity matching problem. Section 3 introduces our linkage rule representation. Based on that, Section 4 describes the ActiveGenLink workflow. Section 5 describes the genetic programming approach used for learning linkage rules from existing training data. Section 6 describes the proposed query strategy for selecting link candidates in detail. Section 7 introduces our approach of building the initial pool of unlabeled links. Section 8 presents the results of our experimental evaluation. Section 9 discusses related work. Section 10 presents the implementation of ActiveGenLink in the Silk Workbench.
Section snippets
Problem definition
We consider the problem of matching entities between two data sources and . The objective is to determine which entities in and identify the same real-world object.
The general problem of entity matching can be formalized as follows [8]:
Definition 1 Entity Matching Given two data sources and , find the subset of all pairs of entities for which a relation holds: Similarly, we define the set of all pairs for which does not hold:
The purpose of relation is to relate all
Linkage rule representation
In the context of this work, we represent a linkage rule as a tree which is built from four types of operators. We group the operators into similarity operators, which return a similarity score for two given entities, and value operators, which return a set of values for a given entity. First we introduce the value operators:
- property operator:
retrieves all values of a specific property of each entity, such as its label property.
- transformation operator:
transforms the values of a set of property
Active learning workflow
The main idea of ActiveGenLink is to evolve a population of candidate solutions iteratively while building a set of reference links. The algorithm starts with a random population of linkage rules and an initially empty set of reference links. In each iteration, it selects a link candidate for which the current population of linkage rules is uncertain from a pool of unlabeled links. After the link has been labeled by an human expert, it evolves the population of linkage rules based on the
Evolving linkage rules
For evolving the linkage rules in the first step of the active learning workflow, ActiveGenLink employs the supervised GenLink algorithm for learning linkage rules. This section gives an overview of the GenLink algorithm. A more detailed description of GenLink is found in [6].
On starting the execution of the active learning workflow, GenLink generates an initial population of candidate linkage rules according to the method described in Section 5.1.
After a user labeled a new link, GenLink
Query strategy
The goal of the query strategy is to reduce the number of links that need to be labeled. While many query strategies, such as uncertainty sampling, assume that a single model is being trained by the learning method, genetic algorithms train a population of alternative models at the same time. In order to take advantage of this set of competing models, we introduce version space reduction strategies known as query-by-committee. Query-by-committee methods select the query based on the voting of a
Building the unlabeled pool
The overall goal of the active learning algorithm is to create a linkage rule which is able to label all possible entity pairs as matches or non-matches with high confidence. The number of possible entity pairs can be very high for large data sets and usually far exceeds the number of actual matches. For this reason, we use an indexing approach to build a sample which does not include definitive non-matches.
Given two data sets and , the initial unlabeled pool is built according to the
Evaluation
In this section, we evaluate ActiveGenLink experimentally: at first, Section 8.1 introduces the data sets that we used for the evaluation. Section 8.2 describes our experimental setup. Section 8.3 evaluates if by labeling a small number of links, the proposed active learning algorithm is capable of learning linkage rules with a similar accuracy than the supervised learning algorithm GenLink [6] on a larger set of reference links. We evaluated the scalability of the active learning algorithm
Related work
There is a large body of work on unsupervised entity matching as well as on supervised learning of linkage rules.
Unsupervised learning. In the recent years, a number of approaches have been proposed for unsupervised entity matching. ObjectCoref [21], [24] is a self-training interlinking approach, which starts with a kernel that consists of known equivalences and iteratively extends this kernel with discriminative property-value pairs. RiMOM [22] and AgreementMaker [25] are approaches for
Implementation
This section gives an overview of the implementation of ActiveGenLink in the Silk Workbench which is part of the Silk Link Discovery Framework. The Silk Workbench is a web application which guides the user through the process of interlinking different data sources. The Silk Link Discovery Framework is available for download9 under the terms of the Apache License and all experiments that are presented in this paper can thus be repeated by the interested reader.
The Silk
Conclusion and future work
In this article, we presented the ActiveGenLink algorithm, which combines genetic programming and active learning in order to learn expressive linkage rules which include data transformations and combine different similarity measures non-linearly. We reduce the manual effort of interlinking data sources by automating the generation of linkage rules. The user is only required to perform the much simpler task of confirming or declining a set of link candidates which are actively selected by the
Acknowledgment
This work was supported in part by the EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data10 (Ref. No. 257943).
References (38)
Comparison of the predicted and observed secondary structure of t4 phage lysozyme
Biochimica et Biophysica Acta (BBA) — Protein Structure
(1975)- et al.
Learning object identification rules for information integration
Information Systems
(2001) - et al.
Linked Data — the story so far
International Journal on Semantic Web and Information Systems
(2009) - et al.
State of the LOD cloud. Technical Report, Freie Universität Berlin
(2011) - J. Euzenat, A. Ferrara, W.R. van Hage, L. Hollink, C. Meilicke, A. Nikolov, D. Ritze, F. Scharffe, P. Shvaiko, H....
- R. Isele, A. Jentzsch, C. Bizer, Silk Server — adding missing links while consuming Linked Data. in: Proceedings of the...
- A.-C. Ngonga Ngomo, S. Auer, LIMES — a time-efficient approach for large-scale link discovery on the web of data. in:...
- et al.
Learning expressive linkage rules using genetic programming
Proceedings of the VLDB Endowment
(2012) - R. Isele, A. Jentzsch, C. Bizer, Active learning of expressive linkage rules for the web of data. in: Proceedings of...
- et al.
A theory for record linkage
Journal of the American Statistical Association
(1969)
Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection
Active learning literature survey. Computer Sciences Technical Report, University of Wisconsin–Madison
Genetic Programming IV: Routine Human-Competitive Machine Intelligence
Cited by (123)
Discovery of link keys in resource description framework datasets based on pattern structures
2023, International Journal of Approximate ReasoningScaling up knowledge graph creation to large and heterogeneous data sources
2023, Journal of Web SemanticsActive deep learning on entity resolution by risk sampling
2022, Knowledge-Based SystemsConfigurable assembly of classification rules for enhancing entity resolution results
2020, Information Processing and ManagementCitation Excerpt :More recently, a number of state-of-the-art works have focused on the automatic generation of classification rules in the context of ER. In (De Freitas et al., 2010; Isele & Bizer, 2013; Tejada, Knoblock, & Minton, 2001), the authors explore genetic programming (De Freitas et al., 2010; Isele & Bizer, 2013) and decision trees (Tejada et al., 2001) to generate ER classification rules. These strategies employ Query-by-Committee (QBC), which aims to classify duplicate entities based on a voting scheme promoted amongst multiple classifiers.
Link key candidate extraction with relational concept analysis
2020, Discrete Applied MathematicsCitation Excerpt :Various methods have been designed for extracting numerical specifications based on spatial techniques [43], probabilities [46], or genetic programming and active learning [36]. Such numerical specifications may be extracted by using machine learning [27,44]. A larger selection of methods is available in [34].
Helio: A framework for implementing the life cycle of knowledge graphs
2024, Semantic Web