skip to main content
research-article

Distributed representations of tuples for entity resolution

Published:01 July 2018Publication History
Skip Abstract Section

Abstract

Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

References

  1. Benchmark datasets for entity resolution. https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution.Google ScholarGoogle Scholar
  2. Duplicate detection, record linkage, and identity uncertainty: Datasets. http://www.cs.utexas.edu/users/ml/riddle/data.html.Google ScholarGoogle Scholar
  3. E. Asgari and M. R. Mofrad. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one, 10(11):e0141287, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  4. R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003.Google ScholarGoogle Scholar
  5. Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798--1828, 2013.Google ScholarGoogle Scholar
  6. Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. JMLR, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. Trans. Neur. Netw., 5(2):157--166, Mar. 1994.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.Google ScholarGoogle Scholar
  10. Q. Chen, J. Zobel, and K. Verspoor. Benchmarks for measurement of duplicate detection methods in nucleotide databases. Database, 2016.Google ScholarGoogle Scholar
  11. F. Chollet. Deep learning with Python. Manning Publications, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In KDD, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011.Google ScholarGoogle Scholar
  15. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. JMLR, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Covell and S. Baluja. Lsh banding for large-scale retrieval with memory and recall constraints. In ICASSP, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, and Y. Park. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD, pages 1431--1446, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Das, A. Doan, P. S. G. C., C. Gokhale, and P. Konda. The magellan data repository. https://sites.google.com/site/anhaidgroup/projects/data.Google ScholarGoogle Scholar
  19. A. Doan, A. Ardalan, J. R. Ballard, S. Das, Y. Govind, P. Konda, H. Li, S. Mudgal, E. Paulson, P. S. G. C., and H. Zhang. Human-in-the-loop challenges for entity matching: A midterm report. In HILDA@SIGMOD, 2017.Google ScholarGoogle Scholar
  20. H. L. Dunn. Record linkage. American Journal of Public Health, 36 (12), 1946.Google ScholarGoogle Scholar
  21. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1), 2007.Google ScholarGoogle Scholar
  22. M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166, 2014.Google ScholarGoogle Scholar
  23. I. Fellegi and A. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64 (328), 1969.Google ScholarGoogle ScholarCross RefCross Ref
  24. J. Fisher, P. Christen, Q. Wang, and E. Rahm. A clustering-based framework to control block sizes for entity resolution. In KDD, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In PVLDB, volume 99, pages 518--529, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD, pages 601--612, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 23--26, 1998, pages 604--613, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Kenig and A. Gal. Mfiblocks: An effective blocking algorithm for entity resolution. Information Systems, 2013.Google ScholarGoogle Scholar
  31. P. Konda, S. Das, P. Suganthan GC, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, et al. Magellan: Toward building entity matching management systems. PVLDB, 9(13):1581--1584, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. H. Köpcke and E. Rahm. Training selection for tuning entity matching. In QDB/MUD, 2008.Google ScholarGoogle Scholar
  33. H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Lazaridou, G. Dinu, and M. Baroni. Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 270--280, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  35. Q. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Li, T. Luong, D. Jurafsky, and E. Hovy. When are tree structures necessary for deep learning of representations? In EMNLP, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  37. Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: efficient indexing for high-dimensional similarity search. In PVLDB, pages 950--961, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Magellan. End-to-end em workflows, 2017.Google ScholarGoogle Scholar
  39. M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, pages 440--445, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In SIGMOD, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. F. Naumann and M. Herschel. An introduction to duplicate detection. Synthesis Lectures on Data Management, 2(1):1--87, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  44. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE TSP, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. R. Singh, V. Meduri, A. K. Elmagarmid, S. Madden, P. Papotti, J. Quiané-Ruiz, A. Solar-Lezama, and N. Tang. Synthesizing entity matching rules by examples. PVLDB, 11(2):189--202, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. R. C. Steorts, S. L. Ventura, M. Sadinle, and S. E. Fienberg. A comparison of blocking methods for record linkage. In PSD, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  48. S. Thirumuruganathan, N. Tang, and M. Ouzzani. Data curation with deep learning : Towards self driving data curation. arXiv preprint arXiv:1803.01384, 2018.Google ScholarGoogle Scholar
  49. J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11):1483--1494, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. J. Wang, G. Li, J. X. Yu, and J. Feng. Entity matching: How similar is similar. PVLDB, 4(10):622--633, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927, 2014.Google ScholarGoogle Scholar
  52. Q. Wang, M. Cui, and H. Liang. Semantic-aware blocking for entity resolution. TKDE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  53. W. E. Winkler. Data quality in data warehouses. In Encyclopedia of Data Warehousing and Mining, Second Edition (4 Volumes), pages 550--555. 2009.Google ScholarGoogle Scholar
  54. J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320--3328, 2014.Google ScholarGoogle Scholar

Index Terms

  1. Distributed representations of tuples for entity resolution
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 11, Issue 11
      July 2018
      507 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 July 2018
      Published in pvldb Volume 11, Issue 11

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader