research-article

Active sampling for entity matching

Authors:
Kedar Bellare

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

,
Suresh Iyengar

Yahoo! Labs, Bangalore, India

Yahoo! Labs, Bangalore, India
View Profile

,
Aditya G. Parameswaran

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Vibhor Rastogi

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2012Pages 1131–1139https://doi.org/10.1145/2339530.2339707

Published:12 August 2012Publication History

KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1131–1139

ABSTRACT

In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or non-duplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0-1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more non-duplicate pairs than duplicate pairs). To address this, a recent paper [1] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst-case.

Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sub-linear label complexity (under certain distributional assumptions). Our algorithm uses as a black-box any active learning module that minimizes 0-1 loss. We show that label complexity of our algorithm is at most log n times the label complexity of the black-box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfying the precision constraint. We provide an empirical evaluation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach.

Supplemental Material

311b_t_talk_13.mp4

mp4

324 MB

Download

References

Arvind Arasu, Michaela Götz, and Raghav Kaushik. On active learning of record matching packages. In SIGMOD Conference, pages 783--794, 2010. Google ScholarDigital Library
Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. J. Comput. Syst. Sci., 75(1):78--89, 2009. Google ScholarDigital Library
Maria-Florina Balcan, Andrei Z. Broder, and Tong Zhang. Margin based active learning. In COLT, pages 35--50, 2007. Google ScholarDigital Library
Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In ICML, page 7, 2009. Google ScholarDigital Library
Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic active learning without constraints. In NIPS, pages 199--207, 2010.Google ScholarDigital Library
Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '03, pages 39--48, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
Mikhail Bilenko and Raymond J. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington DC, pages 7--12, 2003.Google Scholar
Surajit Chaudhuri, Bee-Chung Chen, Venkatesh Ganti, and Raghav Kaushik. Example-driven design of efficient record matching queries. In Proceedings of the 33rd international conference on Very large data bases, VLDB '07, pages 327--338. VLDB Endowment, 2007. Google ScholarDigital Library
Peter Christen. Probabilistic data generation for deduplication and data linkage. In IDEAL, pages 109--116, 2005. Google ScholarDigital Library
David Cohn, Les Atlas, and Richard Ladner. Improving Generalization with Active Learning. Mach. Learn., 15(2):201--221, May 1994. Google ScholarDigital Library
Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In ICML, pages 208--215, 2008. Google ScholarDigital Library
Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In NIPS, 2007.Google ScholarDigital Library
Sanjoy Dasgupta and John Langford. Tutorial summary: Active learning. In ICML, page 178, 2009. Google ScholarDigital Library
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowl. and Data Eng., 19:1--16, January 2007. Google ScholarDigital Library
A. Frank and A. Asuncion. UCI machine learning repository, 2010.Google Scholar
Steve Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages 353--360, 2007. Google ScholarDigital Library
Steve Hanneke. Adaptive rates of convergence in active learning. In COLT, 2009.Google Scholar
Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Active learning by querying informative and representative examples. In NIPS, pages 892--900, 2010.Google ScholarDigital Library
Nikos Karampatziakis and John Langford. Online importance weight aware updates. In UAI, pages 392--399, 2011.Google ScholarDigital Library
Hanna Köpcke and Erhard Rahm. Training selection for tuning entity matching. In QDB/MUD, pages 3--12, 2008.Google Scholar
Hanna Köpcke and Erhard Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69:197--210, February 2010. Google ScholarDigital Library
Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '00, pages 169--178, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
Paul Mineiro. Cost-Sensitive Binary Classification and Active Learning, 2003.Google Scholar
Franco P. Preparata and Michael I. Shamos. Computational geometry: an introduction. Springer-Verlag New York, Inc., New York, NY, USA, 1985. Google ScholarDigital Library
Steffen Rendle and Lars Schmidt-Thieme. Active learning of equivalence relations by minimizing the expected loss using constraint inference. In ICDM, pages 1001--1006, 2008. Google ScholarDigital Library
Sunita Sarawagi and Anuradha Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278, 2002. Google ScholarDigital Library
Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009.Google Scholar
Sheila Tejada, Craig A. Knoblock, and Steven Minton. Learning object identification rules for information integration. Inf. Syst., 26:607--633, December 2001. Google ScholarDigital Library
Wei Wang and Zhi-Hua Zhou. Multi-view active learning in the non-realizable case. In NIPS, pages 2388--2396, 2010.Google Scholar
Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, and Hector Garcia-Molina. Entity resolution with iterative blocking. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD '09, pages 219--232, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
William E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau, 1999.Google Scholar
Bianca Zadrozny, John Langford, and Naoki Abe. Cost-Sensitive Learning by Cost-Proportionate Example Weighting, 2012.Google Scholar

Index Terms

Active sampling for entity matching
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Active Sampling for Entity Matching with Guarantees
Special Issue on ACM SIGKDD 2012

In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or nonduplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this ...
Read More
Entity Matching with Active Monotone Classification
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Given two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ın X x Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the ...
Read More
Ground Truth Inference for Weakly Supervised Entity Matching
PACMMOD

Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2012
1616 pages
ISBN:9781450314626
DOI:10.1145/2339530
General Chair:
Qiang Yang
Hong Kong University of Science and Technology
,
Program Chairs:
Deepak Agarwal
LinkedIn
,
Jian Pei
Simon Fraser University
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
active learning
deduplication
entity matching
imbalanced data
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 51
  Total Citations
  View Citations
- 932
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Active sampling for entity matching

KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Active Sampling for Entity Matching with Guarantees

Entity Matching with Active Monotone Classification

Ground Truth Inference for Weakly Supervised Entity Matching