research-article

Distributed representations of tuples for entity resolution

Authors:
Muhammad Ebraheem

Qatar Computing Research Institute, HBKU, Qatar

Qatar Computing Research Institute, HBKU, Qatar
View Profile

,
Saravanan Thirumuruganathan

Qatar Computing Research Institute, HBKU, Qatar

Qatar Computing Research Institute, HBKU, Qatar
View Profile

,
Shafiq Joty

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

,
Mourad Ouzzani

Qatar Computing Research Institute, HBKU, Qatar

Qatar Computing Research Institute, HBKU, Qatar
View Profile

,
Nan Tang

Qatar Computing Research Institute, HBKU, Qatar

Qatar Computing Research Institute, HBKU, Qatar
View Profile

Proceedings of the VLDB Endowment Volume 11 Issue 11pp 1454–1467https://doi.org/10.14778/3236187.3236198

Published:01 July 2018Publication History

Proceedings of the VLDB Endowment

Abstract

Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

References

Benchmark datasets for entity resolution. https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution.Google Scholar
Duplicate detection, record linkage, and identity uncertainty: Datasets. http://www.cs.utexas.edu/users/ml/riddle/data.html.Google Scholar
E. Asgari and M. R. Mofrad. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one, 10(11):e0141287, 2015.Google ScholarCross Ref
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003.Google Scholar
Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798--1828, 2013.Google Scholar
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. JMLR, 2003.Google ScholarDigital Library
Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. Trans. Neur. Netw., 5(2):157--166, Mar. 1994.Google ScholarDigital Library
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, 2003.Google ScholarDigital Library
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.Google Scholar
Q. Chen, J. Zobel, and K. Verspoor. Benchmarks for measurement of duplicate detection methods in nucleotide databases. Database, 2016.Google Scholar
F. Chollet. Deep learning with Python. Manning Publications, 2018.Google ScholarDigital Library
P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 2012.Google ScholarDigital Library
W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In KDD, 2002.Google ScholarDigital Library
R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011.Google Scholar
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. JMLR, 2011.Google ScholarDigital Library
M. Covell and S. Baluja. Lsh banding for large-scale retrieval with memory and recall constraints. In ICASSP, 2009.Google ScholarDigital Library
S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, and Y. Park. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD, pages 1431--1446, 2017.Google ScholarDigital Library
S. Das, A. Doan, P. S. G. C., C. Gokhale, and P. Konda. The magellan data repository. https://sites.google.com/site/anhaidgroup/projects/data.Google Scholar
A. Doan, A. Ardalan, J. R. Ballard, S. Das, Y. Govind, P. Konda, H. Li, S. Mudgal, E. Paulson, P. S. G. C., and H. Zhang. Human-in-the-loop challenges for entity matching: A midterm report. In HILDA@SIGMOD, 2017.Google Scholar
H. L. Dunn. Record linkage. American Journal of Public Health, 36 (12), 1946.Google Scholar
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1), 2007.Google Scholar
M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166, 2014.Google Scholar
I. Fellegi and A. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64 (328), 1969.Google ScholarCross Ref
J. Fisher, P. Christen, Q. Wang, and E. Rahm. A clustering-based framework to control block sizes for entity resolution. In KDD, 2015.Google ScholarDigital Library
A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In PVLDB, volume 99, pages 518--529, 1999.Google ScholarDigital Library
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD, pages 601--612, 2014.Google ScholarDigital Library
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.Google ScholarDigital Library
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 23--26, 1998, pages 604--613, 1998.Google ScholarDigital Library
B. Kenig and A. Gal. Mfiblocks: An effective blocking algorithm for entity resolution. Information Systems, 2013.Google Scholar
P. Konda, S. Das, P. Suganthan GC, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, et al. Magellan: Toward building entity matching management systems. PVLDB, 9(13):1581--1584, 2016.Google ScholarDigital Library
H. Köpcke and E. Rahm. Training selection for tuning entity matching. In QDB/MUD, 2008.Google Scholar
H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.Google ScholarDigital Library
A. Lazaridou, G. Dinu, and M. Baroni. Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 270--280, 2015.Google ScholarCross Ref
Q. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, 2014.Google ScholarDigital Library
J. Li, T. Luong, D. Jurafsky, and E. Hovy. When are tree structures necessary for deep learning of representations? In EMNLP, 2015.Google ScholarCross Ref
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: efficient indexing for high-dimensional similarity search. In PVLDB, pages 950--961, 2007.Google ScholarDigital Library
Magellan. End-to-end em workflows, 2017.Google Scholar
M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, pages 440--445, 2006.Google ScholarDigital Library
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.Google ScholarDigital Library
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In SIGMOD, 2018.Google ScholarDigital Library
F. Naumann and M. Herschel. An introduction to duplicate detection. Synthesis Lectures on Data Management, 2(1):1--87, 2010.Google ScholarDigital Library
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.Google ScholarCross Ref
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.Google ScholarDigital Library
M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE TSP, 1997.Google ScholarDigital Library
R. Singh, V. Meduri, A. K. Elmagarmid, S. Madden, P. Papotti, J. Quiané-Ruiz, A. Solar-Lezama, and N. Tang. Synthesizing entity matching rules by examples. PVLDB, 11(2):189--202, 2017.Google ScholarDigital Library
R. C. Steorts, S. L. Ventura, M. Sadinle, and S. E. Fienberg. A comparison of blocking methods for record linkage. In PSD, 2014.Google ScholarCross Ref
S. Thirumuruganathan, N. Tang, and M. Ouzzani. Data curation with deep learning : Towards self driving data curation. arXiv preprint arXiv:1803.01384, 2018.Google Scholar
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11):1483--1494, 2012.Google ScholarDigital Library
J. Wang, G. Li, J. X. Yu, and J. Feng. Entity matching: How similar is similar. PVLDB, 4(10):622--633, 2011.Google ScholarDigital Library
J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927, 2014.Google Scholar
Q. Wang, M. Cui, and H. Liang. Semantic-aware blocking for entity resolution. TKDE, 2016.Google ScholarCross Ref
W. E. Winkler. Data quality in data warehouses. In Encyclopedia of Data Warehousing and Mining, Second Edition (4 Volumes), pages 550--555. 2009.Google Scholar
J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320--3328, 2014.Google Scholar

Index Terms

Distributed representations of tuples for entity resolution
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Entity Resolution (ER) identifies records from different data sources that refer to the same real-world entity. Conventional ER approaches usually employ a structure matching mechanism, where attributes are aligned, compared and aggregated for ER ...
Read More
Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
Read More
Distributed representations of tuples for entity resolution

Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 11, Issue 11
July 2018
507 pages
ISSN:2150-8097
Editors:
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
,
Jian Pei
Simon Fraser University
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2018
Published in pvldb Volume 11, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 58
  Total Citations
  View Citations
- 492
  Total Downloads
- Downloads (Last 12 months)51
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Distributed representations of tuples for entity resolution

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution

Entity resolution with iterative blocking

Distributed representations of tuples for entity resolution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Distributed representations of tuples for entity resolution

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution

Entity resolution with iterative blocking

Distributed representations of tuples for entity resolution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media