Abstract
Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.
- Benchmark datasets for entity resolution. https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution.Google Scholar
- Duplicate detection, record linkage, and identity uncertainty: Datasets. http://www.cs.utexas.edu/users/ml/riddle/data.html.Google Scholar
- E. Asgari and M. R. Mofrad. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one, 10(11):e0141287, 2015.Google ScholarCross Ref
- R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003.Google Scholar
- Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798--1828, 2013.Google Scholar
- Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. JMLR, 2003.Google ScholarDigital Library
- Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. Trans. Neur. Netw., 5(2):157--166, Mar. 1994.Google ScholarDigital Library
- M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, 2003.Google ScholarDigital Library
- P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.Google Scholar
- Q. Chen, J. Zobel, and K. Verspoor. Benchmarks for measurement of duplicate detection methods in nucleotide databases. Database, 2016.Google Scholar
- F. Chollet. Deep learning with Python. Manning Publications, 2018.Google ScholarDigital Library
- P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 2012.Google ScholarDigital Library
- W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In KDD, 2002.Google ScholarDigital Library
- R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011.Google Scholar
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. JMLR, 2011.Google ScholarDigital Library
- M. Covell and S. Baluja. Lsh banding for large-scale retrieval with memory and recall constraints. In ICASSP, 2009.Google ScholarDigital Library
- S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, and Y. Park. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD, pages 1431--1446, 2017.Google ScholarDigital Library
- S. Das, A. Doan, P. S. G. C., C. Gokhale, and P. Konda. The magellan data repository. https://sites.google.com/site/anhaidgroup/projects/data.Google Scholar
- A. Doan, A. Ardalan, J. R. Ballard, S. Das, Y. Govind, P. Konda, H. Li, S. Mudgal, E. Paulson, P. S. G. C., and H. Zhang. Human-in-the-loop challenges for entity matching: A midterm report. In HILDA@SIGMOD, 2017.Google Scholar
- H. L. Dunn. Record linkage. American Journal of Public Health, 36 (12), 1946.Google Scholar
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1), 2007.Google Scholar
- M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166, 2014.Google Scholar
- I. Fellegi and A. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64 (328), 1969.Google ScholarCross Ref
- J. Fisher, P. Christen, Q. Wang, and E. Rahm. A clustering-based framework to control block sizes for entity resolution. In KDD, 2015.Google ScholarDigital Library
- A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In PVLDB, volume 99, pages 518--529, 1999.Google ScholarDigital Library
- C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD, pages 601--612, 2014.Google ScholarDigital Library
- I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.Google ScholarDigital Library
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 23--26, 1998, pages 604--613, 1998.Google ScholarDigital Library
- B. Kenig and A. Gal. Mfiblocks: An effective blocking algorithm for entity resolution. Information Systems, 2013.Google Scholar
- P. Konda, S. Das, P. Suganthan GC, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, et al. Magellan: Toward building entity matching management systems. PVLDB, 9(13):1581--1584, 2016.Google ScholarDigital Library
- H. Köpcke and E. Rahm. Training selection for tuning entity matching. In QDB/MUD, 2008.Google Scholar
- H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.Google ScholarDigital Library
- A. Lazaridou, G. Dinu, and M. Baroni. Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 270--280, 2015.Google ScholarCross Ref
- Q. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, 2014.Google ScholarDigital Library
- J. Li, T. Luong, D. Jurafsky, and E. Hovy. When are tree structures necessary for deep learning of representations? In EMNLP, 2015.Google ScholarCross Ref
- Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: efficient indexing for high-dimensional similarity search. In PVLDB, pages 950--961, 2007.Google ScholarDigital Library
- Magellan. End-to-end em workflows, 2017.Google Scholar
- M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, pages 440--445, 2006.Google ScholarDigital Library
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.Google ScholarDigital Library
- S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In SIGMOD, 2018.Google ScholarDigital Library
- F. Naumann and M. Herschel. An introduction to duplicate detection. Synthesis Lectures on Data Management, 2(1):1--87, 2010.Google ScholarDigital Library
- J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.Google ScholarCross Ref
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.Google ScholarDigital Library
- M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE TSP, 1997.Google ScholarDigital Library
- R. Singh, V. Meduri, A. K. Elmagarmid, S. Madden, P. Papotti, J. Quiané-Ruiz, A. Solar-Lezama, and N. Tang. Synthesizing entity matching rules by examples. PVLDB, 11(2):189--202, 2017.Google ScholarDigital Library
- R. C. Steorts, S. L. Ventura, M. Sadinle, and S. E. Fienberg. A comparison of blocking methods for record linkage. In PSD, 2014.Google ScholarCross Ref
- S. Thirumuruganathan, N. Tang, and M. Ouzzani. Data curation with deep learning : Towards self driving data curation. arXiv preprint arXiv:1803.01384, 2018.Google Scholar
- J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11):1483--1494, 2012.Google ScholarDigital Library
- J. Wang, G. Li, J. X. Yu, and J. Feng. Entity matching: How similar is similar. PVLDB, 4(10):622--633, 2011.Google ScholarDigital Library
- J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927, 2014.Google Scholar
- Q. Wang, M. Cui, and H. Liang. Semantic-aware blocking for entity resolution. TKDE, 2016.Google ScholarCross Ref
- W. E. Winkler. Data quality in data warehouses. In Encyclopedia of Data Warehousing and Mining, Second Edition (4 Volumes), pages 550--555. 2009.Google Scholar
- J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320--3328, 2014.Google Scholar
Index Terms
- Distributed representations of tuples for entity resolution
Recommendations
Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge ManagementEntity Resolution (ER) identifies records from different data sources that refer to the same real-world entity. Conventional ER approaches usually employ a structure matching mechanism, where attributes are aligned, compared and aggregated for ER ...
Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataEntity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
Distributed representations of tuples for entity resolution
Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining ...
Comments