research-article

Active Sampling for Entity Matching with Guarantees

Authors:
Kedar Bellare

Yahoo Research and Facebook Inc.

Yahoo Research and Facebook Inc.
View Profile

,
Suresh Iyengar

Yahoo Research and Microsoft Research Lab India

Yahoo Research and Microsoft Research Lab India
View Profile

,
Aditya Parameswaran

Yahoo Research and Stanford University

Yahoo Research and Stanford University
View Profile

,
Vibhor Rastogi

Yahoo Research and Google Inc.

Yahoo Research and Google Inc.
View Profile

Authors Info & Claims

ACM Transactions on Knowledge Discovery from Data Volume 7 Issue 3Article No.: 12pp 1–24https://doi.org/10.1145/2500490

Published:01 September 2013Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or nonduplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0--1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more nonduplicate pairs than duplicate pairs). To address this, a recent paper [Arasu et al. 2010] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst case.

Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sublinear label complexity (under certain distributional assumptions). Our algorithm uses as a black box any active learning module that minimizes 0--1 loss. We show that label complexity of our algorithm is at most log n times the label complexity of the black box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfying the precision constraint. We provide an empirical evaluation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach.

References

Arasu, A., Götz, M., and Kaushik, R. 2010. On active learning of record matching packages. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 783--794. Google ScholarDigital Library
Balcan, M.-F., Broder, A. Z., and Zhang, T. 2007. Margin based active learning. In Proceedings of the 20th Annual Conference on Learning Theory. 35--50. Google ScholarDigital Library
Balcan, M.-F., Beygelzimer, A., and Langford, J. 2009. Agnostic active learning. J. Comput. Syst. Sci. 75, 1, 78--89. Google ScholarDigital Library
Beygelzimer, A., Dasgupta, S., and Langford, J. 2009. Importance weighted active learning. In Proceedings of the International Conference on Machine Learning. 7. Google ScholarDigital Library
Beygelzimer, A., Hsu, D., Langford, J., and Zhang, T. 2010. Agnostic active learning without constraints. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 199--207.Google Scholar
Bilenko, M. and Mooney, R. J. 2003a. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03). ACM, New York, 39--48. Google ScholarDigital Library
Bilenko, M. and Mooney, R. J. 2003b. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation. 7--12.Google Scholar
Chaudhuri, S., Chen, B.-C., Ganti, V., and Kaushik, R. 2007. Example-driven design of efficient record matching queries. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07). VLDB Endowment, 327--338. Google ScholarDigital Library
Christen, P. 2005. Probabilistic data generation for deduplication and data linkage. In Proceedings of the 6th International Conference on Intelligent Data Engineering and Automated Learning. 109--116. Google ScholarDigital Library
Christen, P. 2012. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 9, 1537--1555. Google ScholarDigital Library
Cohn, D., Atlas, L., and Ladner, R. 1994. Improving generalization with active learning. Mach. Learn. 15, 2, 201--221. Google ScholarDigital Library
Dasgupta, S. and Hsu, D. 2008. Hierarchical sampling for active learning. In Proceedings of the International Conference on Machine Learning. 208--215. Google ScholarDigital Library
Dasgupta, S. and Langford, J. 2009. Tutorial summary: Active learning. In Proceedings of the International Conference on Machine Learning. 178. Google ScholarDigital Library
Dasgupta, S., Hsu, D., and Monteleoni, C. 2007. A general agnostic active learning algorithm. In Proceedings of the Conference on Advances in Neural Information Processing Systems.Google Scholar
Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1--16. Google ScholarDigital Library
Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml/.Google Scholar
Hanneke, S. 2007. A bound on the label complexity of agnostic active learning. In Proceedings of the International Conference on Machine Learning. 353--360. Google ScholarDigital Library
Hanneke, S. 2009. Adaptive rates of convergence in active learning. In Proceedings of the 20th Annual Conference on Learning Theory.Google Scholar
Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 301, 13--30.Google ScholarCross Ref
Huang, S.-J., Jin, R., and Zhou, Z.-H. 2010. Active learning by querying informative and representative examples. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 892--900.Google Scholar
Karampatziakis, N. and Langford, J. 2011. Online importance weight aware updates. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. 392--399.Google Scholar
Köpcke, H. and Rahm, E. 2008. Training selection for tuning entity matching. In Proceedings of the International Workshop on Quality in Databases and Management of Uncertain Data. 3--12.Google Scholar
Köpcke, H. and Rahm, 2010. Frameworks for entity matching: A comparison. Data Knowl. Eng. 69, 197--210. Google ScholarDigital Library
McCallum, A., Nigam, K., and Ungar, L. H. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00). ACM, New York, 169--178. Google ScholarDigital Library
McNeill, N., Kardes, H., and Borthwick, A. 2012. Dynamic record blocking: Efficient linking of massive databases in MapReduce. In Proceedings of the 10th International Workshop on Quality in Databases (QDB) in conjunction with VLDB’12.Google Scholar
Mineiro, P. 2003. Cost-sensitive binary classification and active learning. http://www.machinedlearnings.com/2012/01/cost-sensitive-binary-classification.html.Google Scholar
Preparata, F. P. and Shamos, M. I. 1985. Computational Geometry: An Introduction. Springer. Google ScholarDigital Library
Rendle, S. and Schmidt-Thieme, L. 2008. Active learning of equivalence relations by minimizing the expected loss using constraint inference. In Proceedings of the IEEE International Conference on Data Mining. 1001--1006. Google ScholarDigital Library
Sarawagi, S. and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 269--278. Google ScholarDigital Library
Settles, B. 2009. Active learning literature survey. Computer Sciences Tech. rep. 1648, University of Wisconsin--Madison.Google Scholar
Tejada, S., Knoblock, C. A., and Minton, S. 2001. Learning object identification rules for information integration. Inf. Syst. 26, 607--633. Google ScholarDigital Library
Wang, W. and Zhou, Z.-H. 2010. Multi-view active learning in the non-realizable case. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 2388--2396.Google Scholar
Whang, S. E., Menestrina, D., Koutrika, G., Theobald, M., and Garcia-Molina, H. 2009. Entity resolution with iterative blocking. In Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD’09). ACM, 219--232. Google ScholarDigital Library
Winkler, W. E. 1999. The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau.Google Scholar
Zadrozny, B., Langford, J., and Abe, N. 2012. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the IEEE International Conference on Data Mining. Google ScholarDigital Library

Index Terms

Active Sampling for Entity Matching with Guarantees
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Active sampling for entity matching
KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or non-duplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this ...
Read More
Entity Matching with Active Monotone Classification
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Given two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ın X x Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the ...
Read More
Ground Truth Inference for Weakly Supervised Entity Matching
PACMMOD

Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 7, Issue 3
Special Issue on ACM SIGKDD 2012
September 2013
156 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2513092
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2013
- Accepted: 1 February 2013
- Revised: 1 December 2012
- Received: 1 September 2012
Published in tkdd Volume 7, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Entity matching
active learning
deduplication
imbalanced data
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 392
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Active Sampling for Entity Matching with Guarantees

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Active sampling for entity matching

Entity Matching with Active Monotone Classification

Ground Truth Inference for Weakly Supervised Entity Matching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Active Sampling for Entity Matching with Guarantees

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Active sampling for entity matching

Entity Matching with Active Monotone Classification

Ground Truth Inference for Weakly Supervised Entity Matching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media