skip to main content
research-article

Active Sampling for Entity Matching with Guarantees

Published:01 September 2013Publication History
Skip Abstract Section

Abstract

In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or nonduplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0--1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more nonduplicate pairs than duplicate pairs). To address this, a recent paper [Arasu et al. 2010] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst case.

Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sublinear label complexity (under certain distributional assumptions). Our algorithm uses as a black box any active learning module that minimizes 0--1 loss. We show that label complexity of our algorithm is at most log n times the label complexity of the black box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfying the precision constraint. We provide an empirical evaluation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach.

References

  1. Arasu, A., Götz, M., and Kaushik, R. 2010. On active learning of record matching packages. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 783--794. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Balcan, M.-F., Broder, A. Z., and Zhang, T. 2007. Margin based active learning. In Proceedings of the 20th Annual Conference on Learning Theory. 35--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Balcan, M.-F., Beygelzimer, A., and Langford, J. 2009. Agnostic active learning. J. Comput. Syst. Sci. 75, 1, 78--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Beygelzimer, A., Dasgupta, S., and Langford, J. 2009. Importance weighted active learning. In Proceedings of the International Conference on Machine Learning. 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Beygelzimer, A., Hsu, D., Langford, J., and Zhang, T. 2010. Agnostic active learning without constraints. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 199--207.Google ScholarGoogle Scholar
  6. Bilenko, M. and Mooney, R. J. 2003a. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03). ACM, New York, 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bilenko, M. and Mooney, R. J. 2003b. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation. 7--12.Google ScholarGoogle Scholar
  8. Chaudhuri, S., Chen, B.-C., Ganti, V., and Kaushik, R. 2007. Example-driven design of efficient record matching queries. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07). VLDB Endowment, 327--338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christen, P. 2005. Probabilistic data generation for deduplication and data linkage. In Proceedings of the 6th International Conference on Intelligent Data Engineering and Automated Learning. 109--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Christen, P. 2012. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 9, 1537--1555. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cohn, D., Atlas, L., and Ladner, R. 1994. Improving generalization with active learning. Mach. Learn. 15, 2, 201--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dasgupta, S. and Hsu, D. 2008. Hierarchical sampling for active learning. In Proceedings of the International Conference on Machine Learning. 208--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dasgupta, S. and Langford, J. 2009. Tutorial summary: Active learning. In Proceedings of the International Conference on Machine Learning. 178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dasgupta, S., Hsu, D., and Monteleoni, C. 2007. A general agnostic active learning algorithm. In Proceedings of the Conference on Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  15. Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml/.Google ScholarGoogle Scholar
  17. Hanneke, S. 2007. A bound on the label complexity of agnostic active learning. In Proceedings of the International Conference on Machine Learning. 353--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hanneke, S. 2009. Adaptive rates of convergence in active learning. In Proceedings of the 20th Annual Conference on Learning Theory.Google ScholarGoogle Scholar
  19. Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 301, 13--30.Google ScholarGoogle ScholarCross RefCross Ref
  20. Huang, S.-J., Jin, R., and Zhou, Z.-H. 2010. Active learning by querying informative and representative examples. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 892--900.Google ScholarGoogle Scholar
  21. Karampatziakis, N. and Langford, J. 2011. Online importance weight aware updates. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. 392--399.Google ScholarGoogle Scholar
  22. Köpcke, H. and Rahm, E. 2008. Training selection for tuning entity matching. In Proceedings of the International Workshop on Quality in Databases and Management of Uncertain Data. 3--12.Google ScholarGoogle Scholar
  23. Köpcke, H. and Rahm, 2010. Frameworks for entity matching: A comparison. Data Knowl. Eng. 69, 197--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. McCallum, A., Nigam, K., and Ungar, L. H. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00). ACM, New York, 169--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. McNeill, N., Kardes, H., and Borthwick, A. 2012. Dynamic record blocking: Efficient linking of massive databases in MapReduce. In Proceedings of the 10th International Workshop on Quality in Databases (QDB) in conjunction with VLDB’12.Google ScholarGoogle Scholar
  26. Mineiro, P. 2003. Cost-sensitive binary classification and active learning. http://www.machinedlearnings.com/2012/01/cost-sensitive-binary-classification.html.Google ScholarGoogle Scholar
  27. Preparata, F. P. and Shamos, M. I. 1985. Computational Geometry: An Introduction. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rendle, S. and Schmidt-Thieme, L. 2008. Active learning of equivalence relations by minimizing the expected loss using constraint inference. In Proceedings of the IEEE International Conference on Data Mining. 1001--1006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sarawagi, S. and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 269--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Settles, B. 2009. Active learning literature survey. Computer Sciences Tech. rep. 1648, University of Wisconsin--Madison.Google ScholarGoogle Scholar
  31. Tejada, S., Knoblock, C. A., and Minton, S. 2001. Learning object identification rules for information integration. Inf. Syst. 26, 607--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Wang, W. and Zhou, Z.-H. 2010. Multi-view active learning in the non-realizable case. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 2388--2396.Google ScholarGoogle Scholar
  33. Whang, S. E., Menestrina, D., Koutrika, G., Theobald, M., and Garcia-Molina, H. 2009. Entity resolution with iterative blocking. In Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD’09). ACM, 219--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Winkler, W. E. 1999. The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau.Google ScholarGoogle Scholar
  35. Zadrozny, B., Langford, J., and Abe, N. 2012. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the IEEE International Conference on Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Active Sampling for Entity Matching with Guarantees

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 7, Issue 3
        Special Issue on ACM SIGKDD 2012
        September 2013
        156 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/2513092
        Issue’s Table of Contents

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 September 2013
        • Accepted: 1 February 2013
        • Revised: 1 December 2012
        • Received: 1 September 2012
        Published in tkdd Volume 7, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader