ABSTRACT
In this paper, we propose a fast approach that parallelizes the deduplication process on multicore processors. Our approach, named MD-Approach, combines an efficient blocking method with a robust data parallel programming model. The blocking phase is composed of two steps. The first step generates large blocks by grouping records with low degree of similarity. The second step segments large blocks, that may result in unbalanced load, in more precise sub-blocks. A parallel data programming model is used to implement our approach in a sequence of both map and reduce operations. An empirical evaluation has shown that our deduplication approach is almost twice faster than BTO-BK, that is a scalable parallel deduplication solution in distributed environment. To the best of our knowledge, MD-Approach is the first to focus on multicore processors for parallel dedu-plication.
- A. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration, pages 30--39, Washington, DC, USA, 2005. Google ScholarDigital Library
- R. Baxter, P. Christen, and C. F. Epidemiology. A comparison of fast blocking methods for record linkage. In Proceedings of Workshop Data Cleaning, Record Linkage, and Object Consolidation, 2003.Google Scholar
- O. Benjelloun, H. Garcia-Molina, H. Gong, H. Kawai, T. E. Larson, D. Menestrina, and S. Thavisomboon. D-swoosh: A family of algorithms for generic, distributed entity resolution. Distributed Computing Systems, International Conference on, 0: 37, 2007. Google ScholarDigital Library
- P. Christen. Probabilistic data generation for deduplication and data linkage. In Intelligent Data Engineering and Automated Learning - IDEAL 2005, pages 109--116. Springer Berlin/Heidelberg, 2005. Google ScholarDigital Library
- P. Christen, T. Churches, and M. Hegland. Febrl - a parallel open source data linkage system. In PAKDD, pages 638--647, 2004.Google ScholarCross Ref
- T. de Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays. In CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 305--314, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation, Berkeley, CA, USA, 2004. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1): 107--113, 2008. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, Vassilios, and S. Verykios. Duplicate record detection: A survey. Transactions on Knowledge and Data Engineering. Google ScholarDigital Library
- H. Kim and D. Lee. Parallel linkage. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007. Google ScholarDigital Library
- A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD '00: ACM SIGKDD. ACM, 2000. Google ScholarDigital Library
- C. Peter. Performance and scalability of fast blocking techniques for deduplication and data linkage. Proc. VLDB Endow., 1(2): 1253--1264, 2007.Google Scholar
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA '07, pages 13--24, 2007. Google ScholarDigital Library
- W. Santos, T. Teixeira, C. Machado, W. M. Jr., R. Ferreira, D. Guedes, and A. S. D. Silva. A scalable parallel deduplication algorithm. Computer Architecture and High Performance Computing, Symposium on, 0: 79--86, 2007.Google Scholar
- R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD '10: Proceedings of the 2010 international conference on Management of data, 2010. Google ScholarDigital Library
Index Terms
- A fast approach for parallel deduplication on multicore processors
Recommendations
Distributed data deduplication
Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed ...
Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling
Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are ...
Comments