skip to main content
10.1145/2064085.2064087acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Learning-based entity resolution with MapReduce

Published:28 October 2011Publication History

ABSTRACT

Entity resolution is a crucial step for data quality and data integration. Learning-based approaches show high effectiveness at the expense of poor efficiency. To reduce the typically high execution times, we investigate how learning-based entity resolution can be realized in a cloud infrastructure using MapReduce. We propose and evaluate two efficient MapReduce-based strategies for pair-wise similarity computation and classifier application on the Cartesian product of two input sources. Our evaluation is based on real-world datasets and shows the high efficiency and effectiveness of the proposed approaches.

References

  1. Hadoop. http://hadoop.apache.org/mapreduce/.Google ScholarGoogle Scholar
  2. Mahout. http://mahout.apache.org/.Google ScholarGoogle Scholar
  3. Baxter et al. A comparison of fast blocking methods for record linkage. In Workshop Data Cleaning, Record Linkage, and Object Consolidation, 2003.Google ScholarGoogle Scholar
  4. Bilenko and Mooney. Adaptive duplicate detection using learnable string similarity measures. In ACM SIGKDD, pages 39--48, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Blanas et al. A comparison of join algorithms for log processing in mapreduce. In SIGMOD, pages 975--986, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chaudhuri et al. Example-driven design of efficient record matching queries. In VLDB, pages 327--338, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christen et al. Febrl - a parallel open source data linkage system. In PAKDD, pages 638--647, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  8. Chu et al. Map-reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Elsayed et al. Pairwise Document Similarity in Large Collections with MapReduce. In ACL, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ghoting et al. SystemML: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hall et al. The weka data mining software: an update. SIGKDD Explorations, 11(1):10--18, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jin et al. Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. IEEE Trans. Knowl. Data Eng., 17(1):71--89, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kargupta et al. The distributed data mining bibliography. URL http://www.csee.umbc.edu/~hillol/DDMBIB, 2011.Google ScholarGoogle Scholar
  15. Kim and Lee. Parallel linkage. In CIKM, pages 283--292, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kirsten et al. Data partitioning for parallel entity matching. In QDB, 2010.Google ScholarGoogle Scholar
  17. Kolb et al. Multi-pass Sorted Neighborhood Blocking with MapReduce. CSRD, pages 1--19, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kolb et al. Parallel Sorted Neighborhood Blocking with MapReduce. In BTW, 2011.Google ScholarGoogle Scholar
  19. Köpcke et al. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Köpcke and Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mierswa et al. Yale: Rapid prototyping for complex data mining tasks. In SIGKDD, pages 935--940, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Vernica et al. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Wang et al. MapDupReducer: Detecting near duplicates over massive datasets. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning-based entity resolution with MapReduce

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            CloudDB '11: Proceedings of the third international workshop on Cloud data management
            October 2011
            56 pages
            ISBN:9781450309561
            DOI:10.1145/2064085
            • General Chair:
            • Xiaofeng Meng,
            • Program Chairs:
            • Zhiming Ding,
            • Haibo Hu

            Copyright © 2011 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 28 October 2011

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate12of17submissions,71%

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader