skip to main content
10.1145/2339530.2339707acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Active sampling for entity matching

Published:12 August 2012Publication History

ABSTRACT

In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or non-duplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0-1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more non-duplicate pairs than duplicate pairs). To address this, a recent paper [1] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst-case.

Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sub-linear label complexity (under certain distributional assumptions). Our algorithm uses as a black-box any active learning module that minimizes 0-1 loss. We show that label complexity of our algorithm is at most log n times the label complexity of the black-box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfying the precision constraint. We provide an empirical evaluation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach.

Skip Supplemental Material Section

Supplemental Material

311b_t_talk_13.mp4

mp4

324 MB

References

  1. Arvind Arasu, Michaela Götz, and Raghav Kaushik. On active learning of record matching packages. In SIGMOD Conference, pages 783--794, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. J. Comput. Syst. Sci., 75(1):78--89, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Maria-Florina Balcan, Andrei Z. Broder, and Tong Zhang. Margin based active learning. In COLT, pages 35--50, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In ICML, page 7, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic active learning without constraints. In NIPS, pages 199--207, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '03, pages 39--48, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mikhail Bilenko and Raymond J. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington DC, pages 7--12, 2003.Google ScholarGoogle Scholar
  8. Surajit Chaudhuri, Bee-Chung Chen, Venkatesh Ganti, and Raghav Kaushik. Example-driven design of efficient record matching queries. In Proceedings of the 33rd international conference on Very large data bases, VLDB '07, pages 327--338. VLDB Endowment, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Peter Christen. Probabilistic data generation for deduplication and data linkage. In IDEAL, pages 109--116, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David Cohn, Les Atlas, and Richard Ladner. Improving Generalization with Active Learning. Mach. Learn., 15(2):201--221, May 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In ICML, pages 208--215, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In NIPS, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sanjoy Dasgupta and John Langford. Tutorial summary: Active learning. In ICML, page 178, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowl. and Data Eng., 19:1--16, January 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Frank and A. Asuncion. UCI machine learning repository, 2010.Google ScholarGoogle Scholar
  16. Steve Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages 353--360, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Steve Hanneke. Adaptive rates of convergence in active learning. In COLT, 2009.Google ScholarGoogle Scholar
  18. Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Active learning by querying informative and representative examples. In NIPS, pages 892--900, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Nikos Karampatziakis and John Langford. Online importance weight aware updates. In UAI, pages 392--399, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hanna Köpcke and Erhard Rahm. Training selection for tuning entity matching. In QDB/MUD, pages 3--12, 2008.Google ScholarGoogle Scholar
  21. Hanna Köpcke and Erhard Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69:197--210, February 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '00, pages 169--178, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Paul Mineiro. Cost-Sensitive Binary Classification and Active Learning, 2003.Google ScholarGoogle Scholar
  24. Franco P. Preparata and Michael I. Shamos. Computational geometry: an introduction. Springer-Verlag New York, Inc., New York, NY, USA, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Steffen Rendle and Lars Schmidt-Thieme. Active learning of equivalence relations by minimizing the expected loss using constraint inference. In ICDM, pages 1001--1006, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sunita Sarawagi and Anuradha Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009.Google ScholarGoogle Scholar
  28. Sheila Tejada, Craig A. Knoblock, and Steven Minton. Learning object identification rules for information integration. Inf. Syst., 26:607--633, December 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Wei Wang and Zhi-Hua Zhou. Multi-view active learning in the non-realizable case. In NIPS, pages 2388--2396, 2010.Google ScholarGoogle Scholar
  30. Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, and Hector Garcia-Molina. Entity resolution with iterative blocking. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD '09, pages 219--232, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. William E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau, 1999.Google ScholarGoogle Scholar
  32. Bianca Zadrozny, John Langford, and Naoki Abe. Cost-Sensitive Learning by Cost-Proportionate Example Weighting, 2012.Google ScholarGoogle Scholar

Index Terms

  1. Active sampling for entity matching

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2012
        1616 pages
        ISBN:9781450314626
        DOI:10.1145/2339530

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 August 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader