research-article

A fast approach for parallel deduplication on multicore processors

Authors:
Guilherme Dal Bianco

Universidade Federal do Rio, Porto Alegre, RS, Brazil

Universidade Federal do Rio, Porto Alegre, RS, Brazil
View Profile

,
Renata Galante

Universidade Federal do Rio, Porto Alegre, RS, Brazil

Universidade Federal do Rio, Porto Alegre, RS, Brazil
View Profile

,
Carlos A. Heuser

Universidade Federal do Rio, Porto Alegre, RS, Brazil

Universidade Federal do Rio, Porto Alegre, RS, Brazil
View Profile

SAC '11: Proceedings of the 2011 ACM Symposium on Applied ComputingMarch 2011Pages 1027–1032https://doi.org/10.1145/1982185.1982411

Published:21 March 2011Publication History

SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing

Pages 1027–1032

ABSTRACT

In this paper, we propose a fast approach that parallelizes the deduplication process on multicore processors. Our approach, named MD-Approach, combines an efficient blocking method with a robust data parallel programming model. The blocking phase is composed of two steps. The first step generates large blocks by grouping records with low degree of similarity. The second step segments large blocks, that may result in unbalanced load, in more precise sub-blocks. A parallel data programming model is used to implement our approach in a sequence of both map and reduce operations. An empirical evaluation has shown that our deduplication approach is almost twice faster than BTO-BK, that is a scalable parallel deduplication solution in distributed environment. To the best of our knowledge, MD-Approach is the first to focus on multicore processors for parallel dedu-plication.

References

A. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration, pages 30--39, Washington, DC, USA, 2005. Google ScholarDigital Library
R. Baxter, P. Christen, and C. F. Epidemiology. A comparison of fast blocking methods for record linkage. In Proceedings of Workshop Data Cleaning, Record Linkage, and Object Consolidation, 2003.Google Scholar
O. Benjelloun, H. Garcia-Molina, H. Gong, H. Kawai, T. E. Larson, D. Menestrina, and S. Thavisomboon. D-swoosh: A family of algorithms for generic, distributed entity resolution. Distributed Computing Systems, International Conference on, 0: 37, 2007. Google ScholarDigital Library
P. Christen. Probabilistic data generation for deduplication and data linkage. In Intelligent Data Engineering and Automated Learning - IDEAL 2005, pages 109--116. Springer Berlin/Heidelberg, 2005. Google ScholarDigital Library
P. Christen, T. Churches, and M. Hegland. Febrl - a parallel open source data linkage system. In PAKDD, pages 638--647, 2004.Google ScholarCross Ref
T. de Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays. In CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 305--314, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation, Berkeley, CA, USA, 2004. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1): 107--113, 2008. Google ScholarDigital Library
A. K. Elmagarmid, P. G. Ipeirotis, Vassilios, and S. Verykios. Duplicate record detection: A survey. Transactions on Knowledge and Data Engineering. Google ScholarDigital Library
H. Kim and D. Lee. Parallel linkage. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007. Google ScholarDigital Library
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD '00: ACM SIGKDD. ACM, 2000. Google ScholarDigital Library
C. Peter. Performance and scalability of fast blocking techniques for deduplication and data linkage. Proc. VLDB Endow., 1(2): 1253--1264, 2007.Google Scholar
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA '07, pages 13--24, 2007. Google ScholarDigital Library
W. Santos, T. Teixeira, C. Machado, W. M. Jr., R. Ferreira, D. Guedes, and A. S. D. Silva. A scalable parallel deduplication algorithm. Computer Architecture and High Performance Computing, Symposium on, 0: 79--86, 2007.Google Scholar
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD '10: Proceedings of the 2010 international conference on Management of data, 2010. Google ScholarDigital Library

Index Terms

A fast approach for parallel deduplication on multicore processors
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Distributed data deduplication

Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed ...
Read More
Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling
Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing
March 2011
1868 pages
ISBN:9781450301138
DOI:10.1145/1982185
Conference Chairs:
William Chu
Tunghai University, TaiChung, Taiwan
,
W. Eric Wong
University of Texas at Dallas, Richardson, Texas
,
Program Chairs:
Mathew J. Palakal
Indiana University Purdue University, Indianapolis
,
Chih-Cheng Hung
Southern Polytechnic State University, Marietta
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 March 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data integration
deduplication
parallel systems
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 338
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A fast approach for parallel deduplication on multicore processors

SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Distributed data deduplication

Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A fast approach for parallel deduplication on multicore processors

SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Distributed data deduplication

Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media