skip to main content
research-article

Effective and efficient retrieval of structured entities

Published:01 February 2020Publication History
Skip Abstract Section

Abstract

Structured entities are commonly abstracted, such as from XML, RDF or hidden-web databases. Direct retrieval of various structured entities is highly demanded in data lakes, e.g., given a JSON object, to find the XML entities that denote the same real-world object. Existing approaches on evaluating structured entity similarity emphasize too much the structural inconsistency. Indeed, entities from heterogeneous sources could have very distinct structures, owing to various information representation conventions. We argue that the retrieval could be more tolerant to structural differences and focus more on the contents of the entities. In this paper, we first identify the unique challenge of parent-child (containment) relationships among structured entities, which unfortunately prevent the retrieval of proper entities (returning parents or children). To solve the problem, a novel hierarchy smooth function is proposed to combine the term scores in different nodes of a structured entity. Entities sharing the same structure, namely an entity family, are employed to learn the coefficient in aggregating the scores, and thus distinguish/prune the parent or child entities. Remarkably, the proposed method could cooperate with both the bag-of-words (BOW) and word embedding models, successful in retrieving unstructured documents, for querying structured entities. Extensive experiments on real datasets demonstrate that our proposal is effective and efficient.

References

  1. http://citeseerx.ist.psu.edu.Google ScholarGoogle Scholar
  2. https://dblp.uni-trier.de/xml/.Google ScholarGoogle Scholar
  3. http://www.freebase.com/.Google ScholarGoogle Scholar
  4. http://www.imdb.com/.Google ScholarGoogle Scholar
  5. A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank: Authority-based keyword search in databases. In VLDB, pages 564--575, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 1(1), 2007.Google ScholarGoogle Scholar
  7. C. Böhm, G. de Melo, F. Naumann, and G. Weikum. LINDA: distributed web-of-data-scale entity matching. In 21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012, pages 2104--2108, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Cataldi, K. S. Candan, and M. L. Sapino. Narrative-based taxonomy distillation for effective indexing of text collections. Data Knowl. Eng., 72:103--125, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Champin and C. Solnon. Measuring the similarity of labeled graphs. In ICCBR, pages 80--95, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. J. Chen and Y. Papakonstantinou. Supporting top-k keyword search in XML databases. In ICDE, pages 689--700, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  11. Y. Chen, W. Wang, Z. Liu, and X. Lin. Keyword search on structured and semi-structured data. In SIGMOD, pages 1005--1010, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. S. Cherukuri and K. S. Candan. Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees. In Proceeding of the 2008 ACM Workshop on Large-Scale Distributed Systems for Information Retrieval, LSDS-IR 2008, Napa Valley, California, USA, October 30, 2008, pages 3--10, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD Conference, pages 201--212, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Farah and D. Vanderpooten. An outranking approach for rank aggregation in information retrieval. In SIGIR, pages 591--598, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Integrating XML data sources using approximate joins. ACM Trans. Database Syst., 31(1):161--207, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging the results of approximate match operations. In VLDB, pages 636--647, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Hamilton. The mechanics of a deep net metasearch engine. In Proceedings of the Twelfth International World Wide Web Conference - Posters, WWW 2003, Budapest, Hungary, May 20--24, 2003, 2003.Google ScholarGoogle Scholar
  19. V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient IR-style keyword search over relational databases. In VLDB, pages 850--861, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. R. Jaiswal, D. J. Miller, and P. Mitra. Schema matching and embedded value mapping for databases with opaque column names and mixed continuous and discrete-valued data fields. ACM Trans. Database Syst., 38(1):2:1--2:34, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Kang and J. F. Naughton. Schema matching using interattribute dependencies. IEEE Trans. Knowl. Data Eng., 20(10):1393--1407, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. W. Kim and K. S. Candan. CP/CV: concept similarity mining without frequency information from domain describing taxonomies. In Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, November 6--11, 2006, pages 483--492, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Lacoste-Julien, K. Palla, A. Davies, G. Kasneci, T. Graepel, and Z. Ghahramani. Sigma: simple greedy matching for aligning large knowledge bases. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11--14, 2013, pages 572--580, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. Le, F. Li, A. Kementsietsidis, and S. Duan. Scalable keyword search on large RDF data. IEEE Trans. Knowl. Data Eng., 26(11):2774--2788, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  25. J. Lilleberg, Y. Zhu, and Y. Zhang. Support vector machines and word2vec for text classification with semantic features. In 14th IEEE International Conference on Cognitive Informatics & Cognitive Computing, ICCI*CC 2015, Beijing, China, July 6--8, 2015, pages 136--140, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  26. F. Liu, C. T. Yu, W. Meng, and A. Chowdhury. Effective keyword search in relational databases. In SIGMOD Conference, pages 563--574, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Luo, X. Lin, W. Wang, and X. Zhou. SPARK: Top-k keyword query in relational databases. In SIGMOD Conference, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.Google ScholarGoogle Scholar
  29. D. Milano, M. Scannapieco, and T. Catarci. Structure-aware XML object identification. IEEE Data Eng. Bull., 29(2):67--74, 2006.Google ScholarGoogle Scholar
  30. G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39--41, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Neuhaus and H. Bunke. A convolution edit kernel for error-tolerant graph matching. In ICPR, pages 220--223, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Nierman and H. V. Jagadish. Evaluating structural similarity in XML documents. In WebDB, pages 61--66, 2002.Google ScholarGoogle Scholar
  33. J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25--29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532--1543, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  34. G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Sayyadian, H. LeKhac, A. Doan, and L. Gravano. Efficient keyword search across heterogeneous relational databases. In ICDE, pages 346--355, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  36. J. Seo and W. B. Croft. Geometric representations for multiple documents. In SIGIR, pages 251--258, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. A. Shaw and E. A. Fox. Combination of multiple searches. In Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2--4, 1994, pages 105--108, 1994.Google ScholarGoogle Scholar
  38. R. Shaw, A. Datta, D. E. VanderMeer, and K. Dutta. Building a scalable database-driven reverse dictionary. IEEE Trans. Knowl. Data Eng., 25(3):528--540, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35--43, 2001.Google ScholarGoogle Scholar
  40. K. Sparck Jones. Index term weighting. Information Storage and Retrieval, 9(11):619--633, 1973.Google ScholarGoogle ScholarCross RefCross Ref
  41. M. Weis and F. Naumann. Dogmatix tracks down duplicates in XML. In SIGMOD Conference, pages 431--442, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. X. Yu, L. Qin, and L. Chang. Keyword Search in Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. W. Zheng, L. Zou, X. Lian, J. X. Yu, S. Song, and D. Zhao. How to build templates for RDF question/answering: An uncertain graph similarity join approach. In SIGMOD Conference, pages 1809--1824, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. N. Zhiltsov, A. Kotov, and F. Nikolaev. Fielded sequential dependence model for ad-hoc entity retrieval in the web of data. In SIGIR, pages 253--262, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Y. Zhou, H. Cheng, and J. X. Yu. Graph clustering based on structural/attribute similarities. PVLDB, 2(1):718--729, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. X. Zhu, S. Song, X. Lian, J. Wang, and L. Zou. Matching heterogeneous event data. In SIGMOD, pages 1211--1222, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. L. Zou and M. T. Özsu. Graph-based RDF data management. Data Science and Engineering, 2(1):56--70, 2017.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Effective and efficient retrieval of structured entities
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 13, Issue 6
        February 2020
        170 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 February 2020
        Published in pvldb Volume 13, Issue 6

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader