Elsevier

Journal of Web Semantics

Volume 23, December 2013, Pages 2-15
Journal of Web Semantics

Active learning of expressive linkage rules using genetic programming

https://doi.org/10.1016/j.websem.2013.06.001Get rights and content

Abstract

A central problem in the context of the Web of Linked Data as well as in data integration in general is to identify entities in different data sources that describe the same real-world object. Many existing methods for matching entities rely on explicit linkage rules, which specify the conditions which must hold true for two entities in order to be interlinked. As writing good linkage rules by hand is a non-trivial problem, the burden to generate links between data sources is still high. In order to reduce the effort and expertise required to write linkage rules, we present the ActiveGenLink algorithm which combines genetic programming and active learning to generate expressive linkage rules interactively. The ActiveGenLink algorithm automates the generation of linkage rules and only requires the user to confirm or decline a number of link candidates. ActiveGenLink uses a query strategy which minimizes user involvement by selecting link candidates which yield a high information gain. Our evaluation shows that ActiveGenLink is capable of generating high quality linkage rules based on labeling a small number of candidate links and that our query strategy for selecting the link candidates outperforms the query-by-vote-entropy baseline.

Introduction

The goal of the Linked Data movement is to extend the Web with a global data space by making data sets accessible according to a set of best practices and by setting RDF links between data sources  [1]. While the amount of data that is accessible as Linked Data has grown significantly over the last years, most data sources are still not sufficiently interlinked. Out of the over 31 billion RDF statements published as Linked Data less than 500 million represent RDF links between data sources  [2]. Analysis of the Linking Open Data cloud confirms that it represents a weakly connected graph with most publishers only linking to one other data source  [2].

A number of link discovery tools have been developed, which generate RDF links between entities in different data sets that represent the same real-world object. Unfortunately, fully automatic link discovery tools do not achieve a satisfying accuracy on many data sets  [3]. For this reason, several semi-automatic link discovery tools–such as Silk  [4] or LIMES  [5]–have been developed. These tools compare entities in different Linked Data sources based on user-provided linkage rules which specify the conditions that must hold true for two entities in order to be interlinked.

Writing good linkage rules by hand is a non-trivial problem as the rule author needs to have a detailed knowledge about the structure of the data sets: first of all, the author needs to choose discriminative properties of the entities to be interlinked together with a distance measure and an appropriate distance threshold. For data sets which are noisy or use different data formats, the property values need to be normalized by employing data transformations prior to comparison. As comparing entities by a single property usually is not sufficient to decide whether both entities describe the same real-world object, the linkage rule has to aggregate the similarity of multiple property comparisons using appropriate aggregation functions.

We illustrate this point with the example of a data set about movies: even within this simple example the linkage rule author faces a couple of challenges. First of all, a comparison solely by film title fails for cases when movies with the same title actually represent distinct movies that have been released in different years. Therefore, the linkage rule needs to compare, at the very least, the titles of the movies as well as their release dates and combine both similarities with an appropriate aggregation function. As data sets can be noisy (e.g., the release dates might be off by a couple of days), the rule author also needs to choose suitable distance measures together with appropriate distance thresholds. Linkage rules must also cover data heterogeneities. For instance, a data source may contain some person names that are formatted as “<first name><last name>” while others are formatted as “<last name>, <first name>”. Finding such heterogeneities and adding the specific data transformations to avoid mismatches are often very tedious. Thus, writing a linkage rule is not only a cumbersome but also a time consuming task.

Supervised learning. A way to reduce this effort is to use supervised learning to generate links from the existing reference links, which contain pairs of entities that have been labeled as matches or non-matches. Creating such reference links is much easier than to write linkage rules as it requires no previous knowledge about similarity computation techniques or the specific linkage rule format used by the system. Usually, reference links are created by domain experts who confirm or reject the equivalence of a number of entity pairs from the data sets. For instance, reference links for locations in a geographical data set can be created by labeling pairs of locations as correct or incorrect. Fig. 1 shows an example of an entity pair. In the given example, the pair is to be declined as both entities represent different real-world locations.

Active learning. In order for the supervised learning algorithms to perform well on unknown data, reference links need to include all relevant corner cases. We illustrate this point by having a second look at the example in Fig. 1: while for many cities a comparison by label is sufficient to determine if two entities represent the same real-world city, the given example shows the corner case of distinct cities sharing the same name. If the entity pairs to be labeled by the user are just selected randomly from the data sets, the user has to label a very large number of pairs in order to include these rare corner cases reliably. As manually labeling link candidates is time-consuming, methods to reduce the number of candidates which need to be labeled are desirable.

The fundamental idea of active learning in the context of entity matching is to reduce the number of link candidates which need to be labeled by actively selecting the most informative candidate for being labeled by the user.

Our contribution. In this article, we present ActiveGenLink, an algorithm for learning linkage rules interactively using active learning and genetic programming. ActiveGenLink learns a linkage rule by asking the user to confirm or reject a number of link candidates which are actively selected by the algorithm. Compared to writing linkage rules by hand, ActiveGenLink lowers the required level of expertise as the task of generating linkage rules is automated by the genetic programming algorithm while the user only has to verify a set of link candidates. The employed query strategy for selecting link candidates minimizes user involvement by selecting the links with the highest information gain for manual verification. Within our experiments, ActiveGenLink outperformed state-of-the-art unsupervised approaches after manually labeling a few link candidates (less than 5 within our experiments). In addition, ActiveGenLink is capable of generating linkage rules with a comparable performance than the supervised GenLink algorithm  [6] by labeling a much smaller number of link candidates (between 15 and 50 within our experiments). ActiveGenLink chooses which properties to compare, it chooses appropriate distance measures, aggregation functions, and thresholds, as well as data transformations that are applied to normalize data prior to comparison.

This article makes the following contributions:

  • 1.

    we propose the ActiveGenLink algorithm which applies genetic programming and active learning to the problem of learning linkage rules for generating RDF links in the context of the Web of Data.

  • 2.

    the learned rules are more expressive than the linkage rules learned in the previous work on active learning of linkage rules as our algorithm combines different similarity measures non-linearly and also determines the data transformations that should be employed to normalize data prior to comparison.

  • 3.

    we propose a query strategy that minimizes the number of links that need to be labeled by the user and outperforms the query-by-vote-entropy strategy, which has been used in previous work.

  • 4.

    we have implemented the proposed approach in the Silk Workbench, a web application which can be used by Linked Data publishers and consumers to set RDF links. The Silk Workbench is part of the Silk Link Discovery Framework and is available for download under the terms of the Apache License.

This article builds on our previous work presented at two occasions:

  • in  [6], we present the GenLink algorithm for learning expressive linkage rules from a set of existing reference links using genetic programming.

  • in  [7], we present an approach which combines genetic programming and active learning to generate expressive linkage rules interactively.

This article extends the previously presented active learning approach with a novel query strategy, which further minimizes the number of links that need to be labeled by the user as well as a more extensive experimental evaluation. Although in this paper we focus on interlinking data sources in the context of the Web of Linked Data, our approach is not limited to this use case and can be applied to entity matching in other areas–such as in the context of relational databases–as well.

Outline. This article is organized as follows: Section  2 formalizes the entity matching problem. Section  3 introduces our linkage rule representation. Based on that, Section  4 describes the ActiveGenLink workflow. Section  5 describes the genetic programming approach used for learning linkage rules from existing training data. Section  6 describes the proposed query strategy for selecting link candidates in detail. Section  7 introduces our approach of building the initial pool of unlabeled links. Section  8 presents the results of our experimental evaluation. Section  9 discusses related work. Section  10 presents the implementation of ActiveGenLink in the Silk Workbench.

Section snippets

Problem definition

We consider the problem of matching entities between two data sources A and B. The objective is to determine which entities in A and B identify the same real-world object.

The general problem of entity matching can be formalized as follows  [8]:

Definition 1 Entity Matching

Given two data sources A and B, find the subset of all pairs of entities for which a relation R holds: M={(a,b);aRb,aA,bB}. Similarly, we define the set of all pairs for which R does not hold: U=(A×B)M.

The purpose of relation R is to relate all

Linkage rule representation

In the context of this work, we represent a linkage rule as a tree which is built from four types of operators. We group the operators into similarity operators, which return a similarity score for two given entities, and value operators, which return a set of values for a given entity. First we introduce the value operators:

    property operator:

    retrieves all values of a specific property of each entity, such as its label property.

    transformation operator:

    transforms the values of a set of property

Active learning workflow

The main idea of ActiveGenLink is to evolve a population of candidate solutions iteratively while building a set of reference links. The algorithm starts with a random population of linkage rules and an initially empty set of reference links. In each iteration, it selects a link candidate for which the current population of linkage rules is uncertain from a pool of unlabeled links. After the link has been labeled by an human expert, it evolves the population of linkage rules based on the

Evolving linkage rules

For evolving the linkage rules in the first step of the active learning workflow, ActiveGenLink employs the supervised GenLink algorithm for learning linkage rules. This section gives an overview of the GenLink algorithm. A more detailed description of GenLink is found in  [6].

On starting the execution of the active learning workflow, GenLink generates an initial population of candidate linkage rules according to the method described in Section  5.1.

After a user labeled a new link, GenLink

Query strategy

The goal of the query strategy is to reduce the number of links that need to be labeled. While many query strategies, such as uncertainty sampling, assume that a single model is being trained by the learning method, genetic algorithms train a population of alternative models at the same time. In order to take advantage of this set of competing models, we introduce version space reduction strategies known as query-by-committee. Query-by-committee methods select the query based on the voting of a

Building the unlabeled pool

The overall goal of the active learning algorithm is to create a linkage rule which is able to label all possible entity pairs as matches or non-matches with high confidence. The number of possible entity pairs can be very high for large data sets and usually far exceeds the number of actual matches. For this reason, we use an indexing approach to build a sample which does not include definitive non-matches.

Given two data sets A and B, the initial unlabeled pool UA×B is built according to the

Evaluation

In this section, we evaluate ActiveGenLink experimentally: at first, Section  8.1 introduces the data sets that we used for the evaluation. Section  8.2 describes our experimental setup. Section  8.3 evaluates if by labeling a small number of links, the proposed active learning algorithm is capable of learning linkage rules with a similar accuracy than the supervised learning algorithm GenLink  [6] on a larger set of reference links. We evaluated the scalability of the active learning algorithm

Related work

There is a large body of work on unsupervised entity matching as well as on supervised learning of linkage rules.

Unsupervised learning. In the recent years, a number of approaches have been proposed for unsupervised entity matching. ObjectCoref  [21], [24] is a self-training interlinking approach, which starts with a kernel that consists of known equivalences and iteratively extends this kernel with discriminative property-value pairs. RiMOM  [22] and AgreementMaker  [25] are approaches for

Implementation

This section gives an overview of the implementation of ActiveGenLink in the Silk Workbench which is part of the Silk Link Discovery Framework. The Silk Workbench is a web application which guides the user through the process of interlinking different data sources. The Silk Link Discovery Framework is available for download9 under the terms of the Apache License and all experiments that are presented in this paper can thus be repeated by the interested reader.

The Silk

Conclusion and future work

In this article, we presented the ActiveGenLink algorithm, which combines genetic programming and active learning in order to learn expressive linkage rules which include data transformations and combine different similarity measures non-linearly. We reduce the manual effort of interlinking data sources by automating the generation of linkage rules. The user is only required to perform the much simpler task of confirming or declining a set of link candidates which are actively selected by the

Acknowledgment

This work was supported in part by the EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data10 (Ref. No. 257943).

References (38)

  • B. Matthews

    Comparison of the predicted and observed secondary structure of t4 phage lysozyme

    Biochimica et Biophysica Acta (BBA) — Protein Structure

    (1975)
  • S. Tejada et al.

    Learning object identification rules for information integration

    Information Systems

    (2001)
  • C. Bizer et al.

    Linked Data — the story so far

    International Journal on Semantic Web and Information Systems

    (2009)
  • C. Bizer et al.

    State of the LOD cloud. Technical Report, Freie Universität Berlin

    (2011)
  • J. Euzenat, A. Ferrara, W.R. van Hage, L. Hollink, C. Meilicke, A. Nikolov, D. Ritze, F. Scharffe, P. Shvaiko, H....
  • R. Isele, A. Jentzsch, C. Bizer, Silk Server — adding missing links while consuming Linked Data. in: Proceedings of the...
  • A.-C. Ngonga Ngomo, S. Auer, LIMES — a time-efficient approach for large-scale link discovery on the web of data. in:...
  • R. Isele et al.

    Learning expressive linkage rules using genetic programming

    Proceedings of the VLDB Endowment

    (2012)
  • R. Isele, A. Jentzsch, C. Bizer, Active learning of expressive linkage rules for the web of data. in: Proceedings of...
  • I.P. Fellegi et al.

    A theory for record linkage

    Journal of the American Statistical Association

    (1969)
  • P. Christen

    Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

    (2012)
  • D. Brickley, L. Miller, FOAF vocabulary specification, 2005....
  • A. Arasu, M. Götz, R. Kaushik, On active learning of record matching packages. in: Proceedings of the 16th ACM SIGMOD...
  • B. Settles

    Active learning literature survey. Computer Sciences Technical Report, University of Wisconsin–Madison

    (2009)
  • H.S. Seung, M. Opper, H. Sompolinsky, Query by committee. in: Proceedings of the Fifth Annual Workshop on Computational...
  • M. Carvalho, A. Laender, M. Gonçalves, A. da Silva, Replica identification using genetic programming. in: Proceedings...
  • J. Koza et al.

    Genetic Programming IV: Routine Human-Competitive Machine Intelligence

    (2005)
  • K.D. Jong, An analysis of the behavior of a class of genetic adaptive systems, Ph.D. Thesis, University of Michigan,...
  • T. Jones, Crossover, macromutation, and population-based search. in: Proceedings of the Sixth International Conference...
  • Cited by (123)

    • Configurable assembly of classification rules for enhancing entity resolution results

      2020, Information Processing and Management
      Citation Excerpt :

      More recently, a number of state-of-the-art works have focused on the automatic generation of classification rules in the context of ER. In (De Freitas et al., 2010; Isele & Bizer, 2013; Tejada, Knoblock, & Minton, 2001), the authors explore genetic programming (De Freitas et al., 2010; Isele & Bizer, 2013) and decision trees (Tejada et al., 2001) to generate ER classification rules. These strategies employ Query-by-Committee (QBC), which aims to classify duplicate entities based on a voting scheme promoted amongst multiple classifiers.

    • Link key candidate extraction with relational concept analysis

      2020, Discrete Applied Mathematics
      Citation Excerpt :

      Various methods have been designed for extracting numerical specifications based on spatial techniques [43], probabilities [46], or genetic programming and active learning [36]. Such numerical specifications may be extracted by using machine learning [27,44]. A larger selection of methods is available in [34].

    View all citing articles on Scopus
    View full text