research-article

Learning-based entity resolution with MapReduce

Authors:
Lars Kolb

University of Leipzig, Leipzig, Germany

University of Leipzig, Leipzig, Germany
View Profile

,
Hanna Köpcke

University of Leipzig, Leipzig, Germany

University of Leipzig, Leipzig, Germany
View Profile

,
Andreas Thor

University of Leipzig, Leipzig, Germany

University of Leipzig, Leipzig, Germany
View Profile

,
Erhard Rahm

University of Leipzig, Leipzig, Germany

University of Leipzig, Leipzig, Germany
View Profile

CloudDB '11: Proceedings of the third international workshop on Cloud data managementOctober 2011Pages 1–6https://doi.org/10.1145/2064085.2064087

Published:28 October 2011Publication History

CloudDB '11: Proceedings of the third international workshop on Cloud data management

Pages 1–6

ABSTRACT

Entity resolution is a crucial step for data quality and data integration. Learning-based approaches show high effectiveness at the expense of poor efficiency. To reduce the typically high execution times, we investigate how learning-based entity resolution can be realized in a cloud infrastructure using MapReduce. We propose and evaluate two efficient MapReduce-based strategies for pair-wise similarity computation and classifier application on the Cartesian product of two input sources. Our evaluation is based on real-world datasets and shows the high efficiency and effectiveness of the proposed approaches.

References

Hadoop. http://hadoop.apache.org/mapreduce/.Google Scholar
Mahout. http://mahout.apache.org/.Google Scholar
Baxter et al. A comparison of fast blocking methods for record linkage. In Workshop Data Cleaning, Record Linkage, and Object Consolidation, 2003.Google Scholar
Bilenko and Mooney. Adaptive duplicate detection using learnable string similarity measures. In ACM SIGKDD, pages 39--48, 2003. Google ScholarDigital Library
Blanas et al. A comparison of join algorithms for log processing in mapreduce. In SIGMOD, pages 975--986, 2010. Google ScholarDigital Library
Chaudhuri et al. Example-driven design of efficient record matching queries. In VLDB, pages 327--338, 2007. Google ScholarDigital Library
Christen et al. Febrl - a parallel open source data linkage system. In PAKDD, pages 638--647, 2004.Google ScholarCross Ref
Chu et al. Map-reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.Google ScholarDigital Library
Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarDigital Library
Elsayed et al. Pairwise Document Similarity in Large Collections with MapReduce. In ACL, 2008. Google ScholarDigital Library
Ghoting et al. SystemML: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011. Google ScholarDigital Library
Hall et al. The weka data mining software: an update. SIGKDD Explorations, 11(1):10--18, 2009. Google ScholarDigital Library
Jin et al. Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. IEEE Trans. Knowl. Data Eng., 17(1):71--89, 2005. Google ScholarDigital Library
Kargupta et al. The distributed data mining bibliography. URL http://www.csee.umbc.edu/~hillol/DDMBIB, 2011.Google Scholar
Kim and Lee. Parallel linkage. In CIKM, pages 283--292, 2007. Google ScholarDigital Library
Kirsten et al. Data partitioning for parallel entity matching. In QDB, 2010.Google Scholar
Kolb et al. Multi-pass Sorted Neighborhood Blocking with MapReduce. CSRD, pages 1--19, 2011. Google ScholarDigital Library
Kolb et al. Parallel Sorted Neighborhood Blocking with MapReduce. In BTW, 2011.Google Scholar
Köpcke et al. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1), 2010. Google ScholarDigital Library
Köpcke and Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2), 2010. Google ScholarDigital Library
Mierswa et al. Yale: Rapid prototyping for complex data mining tasks. In SIGKDD, pages 935--940, 2006. Google ScholarDigital Library
Vernica et al. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, 2010. Google ScholarDigital Library
Wang et al. MapDupReducer: Detecting near duplicates over massive datasets. In SIGMOD, 2010. Google ScholarDigital Library

Index Terms

Learning-based entity resolution with MapReduce
1. Information systems

Recommendations

Parallel NoSQL Entity Resolution Approach with MapReduce
INCOS '15: Proceedings of the 2015 International Conference on Intelligent Networking and Collaborative Systems

To address the limitation of entity resolution of NoSQL documents, we propose a new parallel NoSQL entity resolution approach with MapReduce. Although current MapReduce framework enables efficient parallel execution of entity resolution, it cannot find ...
Read More
Block-based load balancing for entity resolution with MapReduce
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus ...
Read More
Uncertain entity resolution: re-evaluating entity resolution in the big data era: tutorial

Entity resolution is a fundamental problem in data integration dealing with the combination of data from different sources to a unified view of the data. Entity resolution is inherently an uncertain process because the decision to map a set of records ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CloudDB '11: Proceedings of the third international workshop on Cloud data management
October 2011
56 pages
ISBN:9781450309561
DOI:10.1145/2064085
General Chair:
Xiaofeng Meng
Renmin University of China, China
,
Program Chairs:
Zhiming Ding
Institute of Software, Chinese Academy of Sciences, China
,
Haibo Hu
Hong Kong Baptist University, Hong Kong, China
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 October 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cartesian product
entity resolution
machine learning
mapreduce
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate12of17submissions,71%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 276
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning-based entity resolution with MapReduce

CloudDB '11: Proceedings of the third international workshop on Cloud data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Parallel NoSQL Entity Resolution Approach with MapReduce

Block-based load balancing for entity resolution with MapReduce

Uncertain entity resolution: re-evaluating entity resolution in the big data era: tutorial

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning-based entity resolution with MapReduce

CloudDB '11: Proceedings of the third international workshop on Cloud data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Parallel NoSQL Entity Resolution Approach with MapReduce

Block-based load balancing for entity resolution with MapReduce

Uncertain entity resolution: re-evaluating entity resolution in the big data era: tutorial

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media