Abstract
This tutorial brings together perspectives on ER from a variety of fields, including databases, machine learning, natural language processing and information retrieval, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges, and open research problems.
- A. Arasu, M. Goetz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010. Google Scholar
- O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB Journal, 18(1), 2009. Google Scholar
- I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In SDM, 2006.Google Scholar
- I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery in Data, 1(1), 2007. Google Scholar
- M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage and clustering. In ICDM, 2006. Google Scholar
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. Google Scholar
- S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, 2005. Google Scholar
- P. Christen. Data Matching. Springer, 2012.Google Scholar
- W. Cohen and P. Ravikumar. A hierarchical graphical model for record linkage. In Proc. of UAI, 2004. Google Scholar
- W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proc. of IJCAI, 2003.Google Scholar
- X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. Google Scholar
- I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(2283), 1969.Google Scholar
- L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for data cleansing and integration in an rdbms. In ICDE, 2003.Google Scholar
- M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google Scholar
- D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SDM, 2005.Google Scholar
- H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010. Google Scholar
- N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: Similarity measures and algorithms. Tutorial at SIGMOD, 2006. Google Scholar
- A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, 2000. Google Scholar
- A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.Google Scholar
- D. Menestrina, S. E. Whang, and H. Garcia-Molina. Evaluating entity resolution results. In PVLDB, 2010. Google Scholar
- M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006. Google Scholar
- A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.Google Scholar
- H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2003. Google Scholar
- V. Rastogi, N. Dalvi, and M. Garofalakis. Large-scale collective entity matching. In PVLDB, 2012. Google Scholar
- E. S. Ristad and P. N. Yianilos. Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998. Google Scholar
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google Scholar
- W. Shen, X. Li, and A. Doan. Constraint-based entity matching. In AAAI, 2005. Google Scholar
- P. Singla and P. Domingos. Multi-relational record linkage. In KDD, 2004.Google Scholar
- P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, 2006. Google Scholar
- S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD, 2009. Google Scholar
- W. E. Winkler. Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, U. S. Census Bureau, 2002.Google Scholar
Index Terms
- Entity resolution: theory, practice & open challenges
Recommendations
Entity resolution for big data
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningEntity resolution (ER), the problem of extracting, matching and resolving entity mentions in structured and unstructured data, is a long-standing challenge in database management, information retrieval, machine learning, natural language processing and ...
Collective entity resolution in relational data
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Handling data quality in entity resolution
IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systemsEntity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what ...
Comments