ABSTRACT
In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or non-duplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0-1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more non-duplicate pairs than duplicate pairs). To address this, a recent paper [1] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst-case.
Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sub-linear label complexity (under certain distributional assumptions). Our algorithm uses as a black-box any active learning module that minimizes 0-1 loss. We show that label complexity of our algorithm is at most log n times the label complexity of the black-box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfying the precision constraint. We provide an empirical evaluation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach.
Supplemental Material
- Arvind Arasu, Michaela Götz, and Raghav Kaushik. On active learning of record matching packages. In SIGMOD Conference, pages 783--794, 2010. Google ScholarDigital Library
- Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. J. Comput. Syst. Sci., 75(1):78--89, 2009. Google ScholarDigital Library
- Maria-Florina Balcan, Andrei Z. Broder, and Tong Zhang. Margin based active learning. In COLT, pages 35--50, 2007. Google ScholarDigital Library
- Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In ICML, page 7, 2009. Google ScholarDigital Library
- Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic active learning without constraints. In NIPS, pages 199--207, 2010.Google ScholarDigital Library
- Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '03, pages 39--48, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- Mikhail Bilenko and Raymond J. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington DC, pages 7--12, 2003.Google Scholar
- Surajit Chaudhuri, Bee-Chung Chen, Venkatesh Ganti, and Raghav Kaushik. Example-driven design of efficient record matching queries. In Proceedings of the 33rd international conference on Very large data bases, VLDB '07, pages 327--338. VLDB Endowment, 2007. Google ScholarDigital Library
- Peter Christen. Probabilistic data generation for deduplication and data linkage. In IDEAL, pages 109--116, 2005. Google ScholarDigital Library
- David Cohn, Les Atlas, and Richard Ladner. Improving Generalization with Active Learning. Mach. Learn., 15(2):201--221, May 1994. Google ScholarDigital Library
- Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In ICML, pages 208--215, 2008. Google ScholarDigital Library
- Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In NIPS, 2007.Google ScholarDigital Library
- Sanjoy Dasgupta and John Langford. Tutorial summary: Active learning. In ICML, page 178, 2009. Google ScholarDigital Library
- Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowl. and Data Eng., 19:1--16, January 2007. Google ScholarDigital Library
- A. Frank and A. Asuncion. UCI machine learning repository, 2010.Google Scholar
- Steve Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages 353--360, 2007. Google ScholarDigital Library
- Steve Hanneke. Adaptive rates of convergence in active learning. In COLT, 2009.Google Scholar
- Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Active learning by querying informative and representative examples. In NIPS, pages 892--900, 2010.Google ScholarDigital Library
- Nikos Karampatziakis and John Langford. Online importance weight aware updates. In UAI, pages 392--399, 2011.Google ScholarDigital Library
- Hanna Köpcke and Erhard Rahm. Training selection for tuning entity matching. In QDB/MUD, pages 3--12, 2008.Google Scholar
- Hanna Köpcke and Erhard Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69:197--210, February 2010. Google ScholarDigital Library
- Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '00, pages 169--178, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- Paul Mineiro. Cost-Sensitive Binary Classification and Active Learning, 2003.Google Scholar
- Franco P. Preparata and Michael I. Shamos. Computational geometry: an introduction. Springer-Verlag New York, Inc., New York, NY, USA, 1985. Google ScholarDigital Library
- Steffen Rendle and Lars Schmidt-Thieme. Active learning of equivalence relations by minimizing the expected loss using constraint inference. In ICDM, pages 1001--1006, 2008. Google ScholarDigital Library
- Sunita Sarawagi and Anuradha Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278, 2002. Google ScholarDigital Library
- Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009.Google Scholar
- Sheila Tejada, Craig A. Knoblock, and Steven Minton. Learning object identification rules for information integration. Inf. Syst., 26:607--633, December 2001. Google ScholarDigital Library
- Wei Wang and Zhi-Hua Zhou. Multi-view active learning in the non-realizable case. In NIPS, pages 2388--2396, 2010.Google Scholar
- Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, and Hector Garcia-Molina. Entity resolution with iterative blocking. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD '09, pages 219--232, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- William E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau, 1999.Google Scholar
- Bianca Zadrozny, John Langford, and Naoki Abe. Cost-Sensitive Learning by Cost-Proportionate Example Weighting, 2012.Google Scholar
Index Terms
- Active sampling for entity matching
Recommendations
Active Sampling for Entity Matching with Guarantees
Special Issue on ACM SIGKDD 2012In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or nonduplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this ...
Entity Matching with Active Monotone Classification
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsGiven two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ın X x Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the ...
Ground Truth Inference for Weakly Supervised Entity Matching
PACMMODEntity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching ...
Comments