skip to main content
10.1145/1982185.1982411acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

A fast approach for parallel deduplication on multicore processors

Published:21 March 2011Publication History

ABSTRACT

In this paper, we propose a fast approach that parallelizes the deduplication process on multicore processors. Our approach, named MD-Approach, combines an efficient blocking method with a robust data parallel programming model. The blocking phase is composed of two steps. The first step generates large blocks by grouping records with low degree of similarity. The second step segments large blocks, that may result in unbalanced load, in more precise sub-blocks. A parallel data programming model is used to implement our approach in a sequence of both map and reduce operations. An empirical evaluation has shown that our deduplication approach is almost twice faster than BTO-BK, that is a scalable parallel deduplication solution in distributed environment. To the best of our knowledge, MD-Approach is the first to focus on multicore processors for parallel dedu-plication.

References

  1. A. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration, pages 30--39, Washington, DC, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baxter, P. Christen, and C. F. Epidemiology. A comparison of fast blocking methods for record linkage. In Proceedings of Workshop Data Cleaning, Record Linkage, and Object Consolidation, 2003.Google ScholarGoogle Scholar
  3. O. Benjelloun, H. Garcia-Molina, H. Gong, H. Kawai, T. E. Larson, D. Menestrina, and S. Thavisomboon. D-swoosh: A family of algorithms for generic, distributed entity resolution. Distributed Computing Systems, International Conference on, 0: 37, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Christen. Probabilistic data generation for deduplication and data linkage. In Intelligent Data Engineering and Automated Learning - IDEAL 2005, pages 109--116. Springer Berlin/Heidelberg, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Christen, T. Churches, and M. Hegland. Febrl - a parallel open source data linkage system. In PAKDD, pages 638--647, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  6. T. de Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays. In CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 305--314, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation, Berkeley, CA, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1): 107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. K. Elmagarmid, P. G. Ipeirotis, Vassilios, and S. Verykios. Duplicate record detection: A survey. Transactions on Knowledge and Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Kim and D. Lee. Parallel linkage. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD '00: ACM SIGKDD. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Peter. Performance and scalability of fast blocking techniques for deduplication and data linkage. Proc. VLDB Endow., 1(2): 1253--1264, 2007.Google ScholarGoogle Scholar
  13. C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA '07, pages 13--24, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Santos, T. Teixeira, C. Machado, W. M. Jr., R. Ferreira, D. Guedes, and A. S. D. Silva. A scalable parallel deduplication algorithm. Computer Architecture and High Performance Computing, Symposium on, 0: 79--86, 2007.Google ScholarGoogle Scholar
  15. R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD '10: Proceedings of the 2010 international conference on Management of data, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A fast approach for parallel deduplication on multicore processors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing
      March 2011
      1868 pages
      ISBN:9781450301138
      DOI:10.1145/1982185

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 March 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,650of6,669submissions,25%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader