ABSTRACT
Entity resolution is a crucial step for data quality and data integration. Learning-based approaches show high effectiveness at the expense of poor efficiency. To reduce the typically high execution times, we investigate how learning-based entity resolution can be realized in a cloud infrastructure using MapReduce. We propose and evaluate two efficient MapReduce-based strategies for pair-wise similarity computation and classifier application on the Cartesian product of two input sources. Our evaluation is based on real-world datasets and shows the high efficiency and effectiveness of the proposed approaches.
- Hadoop. http://hadoop.apache.org/mapreduce/.Google Scholar
- Mahout. http://mahout.apache.org/.Google Scholar
- Baxter et al. A comparison of fast blocking methods for record linkage. In Workshop Data Cleaning, Record Linkage, and Object Consolidation, 2003.Google Scholar
- Bilenko and Mooney. Adaptive duplicate detection using learnable string similarity measures. In ACM SIGKDD, pages 39--48, 2003. Google ScholarDigital Library
- Blanas et al. A comparison of join algorithms for log processing in mapreduce. In SIGMOD, pages 975--986, 2010. Google ScholarDigital Library
- Chaudhuri et al. Example-driven design of efficient record matching queries. In VLDB, pages 327--338, 2007. Google ScholarDigital Library
- Christen et al. Febrl - a parallel open source data linkage system. In PAKDD, pages 638--647, 2004.Google ScholarCross Ref
- Chu et al. Map-reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.Google ScholarDigital Library
- Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarDigital Library
- Elsayed et al. Pairwise Document Similarity in Large Collections with MapReduce. In ACL, 2008. Google ScholarDigital Library
- Ghoting et al. SystemML: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011. Google ScholarDigital Library
- Hall et al. The weka data mining software: an update. SIGKDD Explorations, 11(1):10--18, 2009. Google ScholarDigital Library
- Jin et al. Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. IEEE Trans. Knowl. Data Eng., 17(1):71--89, 2005. Google ScholarDigital Library
- Kargupta et al. The distributed data mining bibliography. URL http://www.csee.umbc.edu/~hillol/DDMBIB, 2011.Google Scholar
- Kim and Lee. Parallel linkage. In CIKM, pages 283--292, 2007. Google ScholarDigital Library
- Kirsten et al. Data partitioning for parallel entity matching. In QDB, 2010.Google Scholar
- Kolb et al. Multi-pass Sorted Neighborhood Blocking with MapReduce. CSRD, pages 1--19, 2011. Google ScholarDigital Library
- Kolb et al. Parallel Sorted Neighborhood Blocking with MapReduce. In BTW, 2011.Google Scholar
- Köpcke et al. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1), 2010. Google ScholarDigital Library
- Köpcke and Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2), 2010. Google ScholarDigital Library
- Mierswa et al. Yale: Rapid prototyping for complex data mining tasks. In SIGKDD, pages 935--940, 2006. Google ScholarDigital Library
- Vernica et al. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, 2010. Google ScholarDigital Library
- Wang et al. MapDupReducer: Detecting near duplicates over massive datasets. In SIGMOD, 2010. Google ScholarDigital Library
Index Terms
- Learning-based entity resolution with MapReduce
Recommendations
Parallel NoSQL Entity Resolution Approach with MapReduce
INCOS '15: Proceedings of the 2015 International Conference on Intelligent Networking and Collaborative SystemsTo address the limitation of entity resolution of NoSQL documents, we propose a new parallel NoSQL entity resolution approach with MapReduce. Although current MapReduce framework enables efficient parallel execution of entity resolution, it cannot find ...
Block-based load balancing for entity resolution with MapReduce
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementThe effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus ...
Uncertain entity resolution: re-evaluating entity resolution in the big data era: tutorial
Entity resolution is a fundamental problem in data integration dealing with the combination of data from different sources to a unified view of the data. Entity resolution is inherently an uncertain process because the decision to map a set of records ...
Comments