Abstract
In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or nonduplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0--1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more nonduplicate pairs than duplicate pairs). To address this, a recent paper [Arasu et al. 2010] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst case.
Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sublinear label complexity (under certain distributional assumptions). Our algorithm uses as a black box any active learning module that minimizes 0--1 loss. We show that label complexity of our algorithm is at most log n times the label complexity of the black box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfying the precision constraint. We provide an empirical evaluation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach.
- Arasu, A., Götz, M., and Kaushik, R. 2010. On active learning of record matching packages. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 783--794. Google ScholarDigital Library
- Balcan, M.-F., Broder, A. Z., and Zhang, T. 2007. Margin based active learning. In Proceedings of the 20th Annual Conference on Learning Theory. 35--50. Google ScholarDigital Library
- Balcan, M.-F., Beygelzimer, A., and Langford, J. 2009. Agnostic active learning. J. Comput. Syst. Sci. 75, 1, 78--89. Google ScholarDigital Library
- Beygelzimer, A., Dasgupta, S., and Langford, J. 2009. Importance weighted active learning. In Proceedings of the International Conference on Machine Learning. 7. Google ScholarDigital Library
- Beygelzimer, A., Hsu, D., Langford, J., and Zhang, T. 2010. Agnostic active learning without constraints. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 199--207.Google Scholar
- Bilenko, M. and Mooney, R. J. 2003a. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03). ACM, New York, 39--48. Google ScholarDigital Library
- Bilenko, M. and Mooney, R. J. 2003b. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation. 7--12.Google Scholar
- Chaudhuri, S., Chen, B.-C., Ganti, V., and Kaushik, R. 2007. Example-driven design of efficient record matching queries. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07). VLDB Endowment, 327--338. Google ScholarDigital Library
- Christen, P. 2005. Probabilistic data generation for deduplication and data linkage. In Proceedings of the 6th International Conference on Intelligent Data Engineering and Automated Learning. 109--116. Google ScholarDigital Library
- Christen, P. 2012. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 9, 1537--1555. Google ScholarDigital Library
- Cohn, D., Atlas, L., and Ladner, R. 1994. Improving generalization with active learning. Mach. Learn. 15, 2, 201--221. Google ScholarDigital Library
- Dasgupta, S. and Hsu, D. 2008. Hierarchical sampling for active learning. In Proceedings of the International Conference on Machine Learning. 208--215. Google ScholarDigital Library
- Dasgupta, S. and Langford, J. 2009. Tutorial summary: Active learning. In Proceedings of the International Conference on Machine Learning. 178. Google ScholarDigital Library
- Dasgupta, S., Hsu, D., and Monteleoni, C. 2007. A general agnostic active learning algorithm. In Proceedings of the Conference on Advances in Neural Information Processing Systems.Google Scholar
- Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1--16. Google ScholarDigital Library
- Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml/.Google Scholar
- Hanneke, S. 2007. A bound on the label complexity of agnostic active learning. In Proceedings of the International Conference on Machine Learning. 353--360. Google ScholarDigital Library
- Hanneke, S. 2009. Adaptive rates of convergence in active learning. In Proceedings of the 20th Annual Conference on Learning Theory.Google Scholar
- Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 301, 13--30.Google ScholarCross Ref
- Huang, S.-J., Jin, R., and Zhou, Z.-H. 2010. Active learning by querying informative and representative examples. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 892--900.Google Scholar
- Karampatziakis, N. and Langford, J. 2011. Online importance weight aware updates. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. 392--399.Google Scholar
- Köpcke, H. and Rahm, E. 2008. Training selection for tuning entity matching. In Proceedings of the International Workshop on Quality in Databases and Management of Uncertain Data. 3--12.Google Scholar
- Köpcke, H. and Rahm, 2010. Frameworks for entity matching: A comparison. Data Knowl. Eng. 69, 197--210. Google ScholarDigital Library
- McCallum, A., Nigam, K., and Ungar, L. H. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00). ACM, New York, 169--178. Google ScholarDigital Library
- McNeill, N., Kardes, H., and Borthwick, A. 2012. Dynamic record blocking: Efficient linking of massive databases in MapReduce. In Proceedings of the 10th International Workshop on Quality in Databases (QDB) in conjunction with VLDB’12.Google Scholar
- Mineiro, P. 2003. Cost-sensitive binary classification and active learning. http://www.machinedlearnings.com/2012/01/cost-sensitive-binary-classification.html.Google Scholar
- Preparata, F. P. and Shamos, M. I. 1985. Computational Geometry: An Introduction. Springer. Google ScholarDigital Library
- Rendle, S. and Schmidt-Thieme, L. 2008. Active learning of equivalence relations by minimizing the expected loss using constraint inference. In Proceedings of the IEEE International Conference on Data Mining. 1001--1006. Google ScholarDigital Library
- Sarawagi, S. and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 269--278. Google ScholarDigital Library
- Settles, B. 2009. Active learning literature survey. Computer Sciences Tech. rep. 1648, University of Wisconsin--Madison.Google Scholar
- Tejada, S., Knoblock, C. A., and Minton, S. 2001. Learning object identification rules for information integration. Inf. Syst. 26, 607--633. Google ScholarDigital Library
- Wang, W. and Zhou, Z.-H. 2010. Multi-view active learning in the non-realizable case. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 2388--2396.Google Scholar
- Whang, S. E., Menestrina, D., Koutrika, G., Theobald, M., and Garcia-Molina, H. 2009. Entity resolution with iterative blocking. In Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD’09). ACM, 219--232. Google ScholarDigital Library
- Winkler, W. E. 1999. The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau.Google Scholar
- Zadrozny, B., Langford, J., and Abe, N. 2012. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the IEEE International Conference on Data Mining. Google ScholarDigital Library
Index Terms
- Active Sampling for Entity Matching with Guarantees
Recommendations
Active sampling for entity matching
KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data miningIn entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or non-duplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this ...
Entity Matching with Active Monotone Classification
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsGiven two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ın X x Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the ...
Ground Truth Inference for Weakly Supervised Entity Matching
PACMMODEntity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching ...
Comments