research-article

Effective and efficient retrieval of structured entities

Authors:
Ruihong Huang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Shaoxu Song

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Yunsu Lee

Samsung Research, Seoul, South Korea

Samsung Research, Seoul, South Korea
View Profile

,
Jungho Park

Samsung Research, Seoul, South Korea

Samsung Research, Seoul, South Korea
View Profile

,
Soo-Hyung Kim

Samsung Research, Seoul, South Korea

Samsung Research, Seoul, South Korea
View Profile

,
Sungmin Yi

Samsung Research, Seoul, South Korea

Samsung Research, Seoul, South Korea
View Profile

Proceedings of the VLDB Endowment Volume 13 Issue 6pp 826–839https://doi.org/10.14778/3380750.3380754

Published:01 February 2020Publication History

Proceedings of the VLDB Endowment

Abstract

Structured entities are commonly abstracted, such as from XML, RDF or hidden-web databases. Direct retrieval of various structured entities is highly demanded in data lakes, e.g., given a JSON object, to find the XML entities that denote the same real-world object. Existing approaches on evaluating structured entity similarity emphasize too much the structural inconsistency. Indeed, entities from heterogeneous sources could have very distinct structures, owing to various information representation conventions. We argue that the retrieval could be more tolerant to structural differences and focus more on the contents of the entities. In this paper, we first identify the unique challenge of parent-child (containment) relationships among structured entities, which unfortunately prevent the retrieval of proper entities (returning parents or children). To solve the problem, a novel hierarchy smooth function is proposed to combine the term scores in different nodes of a structured entity. Entities sharing the same structure, namely an entity family, are employed to learn the coefficient in aggregating the scores, and thus distinguish/prune the parent or child entities. Remarkably, the proposed method could cooperate with both the bag-of-words (BOW) and word embedding models, successful in retrieving unstructured documents, for querying structured entities. Extensive experiments on real datasets demonstrate that our proposal is effective and efficient.

References

http://citeseerx.ist.psu.edu.Google Scholar
https://dblp.uni-trier.de/xml/.Google Scholar
http://www.freebase.com/.Google Scholar
http://www.imdb.com/.Google Scholar
A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank: Authority-based keyword search in databases. In VLDB, pages 564--575, 2004.Google ScholarDigital Library
I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 1(1), 2007.Google Scholar
C. Böhm, G. de Melo, F. Naumann, and G. Weikum. LINDA: distributed web-of-data-scale entity matching. In 21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012, pages 2104--2108, 2012.Google ScholarDigital Library
M. Cataldi, K. S. Candan, and M. L. Sapino. Narrative-based taxonomy distillation for effective indexing of text collections. Data Knowl. Eng., 72:103--125, 2012.Google ScholarDigital Library
P. Champin and C. Solnon. Measuring the similarity of labeled graphs. In ICCBR, pages 80--95, 2003.Google ScholarDigital Library
L. J. Chen and Y. Papakonstantinou. Supporting top-k keyword search in XML databases. In ICDE, pages 689--700, 2010.Google ScholarCross Ref
Y. Chen, W. Wang, Z. Liu, and X. Lin. Keyword search on structured and semi-structured data. In SIGMOD, pages 1005--1010, 2009.Google ScholarDigital Library
V. S. Cherukuri and K. S. Candan. Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees. In Proceeding of the 2008 ACM Workshop on Large-Scale Distributed Systems for Information Retrieval, LSDS-IR 2008, Napa Valley, California, USA, October 30, 2008, pages 3--10, 2008.Google ScholarDigital Library
W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD Conference, pages 201--212, 1998.Google ScholarDigital Library
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS, 2001.Google ScholarDigital Library
M. Farah and D. Vanderpooten. An outranking approach for rank aggregation in information retrieval. In SIGIR, pages 591--598, 2007.Google ScholarDigital Library
S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Integrating XML data sources using approximate joins. ACM Trans. Database Syst., 31(1):161--207, 2006.Google ScholarDigital Library
S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging the results of approximate match operations. In VLDB, pages 636--647, 2004.Google ScholarDigital Library
N. Hamilton. The mechanics of a deep net metasearch engine. In Proceedings of the Twelfth International World Wide Web Conference - Posters, WWW 2003, Budapest, Hungary, May 20--24, 2003, 2003.Google Scholar
V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient IR-style keyword search over relational databases. In VLDB, pages 850--861, 2003.Google ScholarDigital Library
A. R. Jaiswal, D. J. Miller, and P. Mitra. Schema matching and embedded value mapping for databases with opaque column names and mixed continuous and discrete-valued data fields. ACM Trans. Database Syst., 38(1):2:1--2:34, 2013.Google ScholarDigital Library
J. Kang and J. F. Naughton. Schema matching using interattribute dependencies. IEEE Trans. Knowl. Data Eng., 20(10):1393--1407, 2008.Google ScholarDigital Library
J. W. Kim and K. S. Candan. CP/CV: concept similarity mining without frequency information from domain describing taxonomies. In Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, November 6--11, 2006, pages 483--492, 2006.Google ScholarDigital Library
S. Lacoste-Julien, K. Palla, A. Davies, G. Kasneci, T. Graepel, and Z. Ghahramani. Sigma: simple greedy matching for aligning large knowledge bases. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11--14, 2013, pages 572--580, 2013.Google ScholarDigital Library
W. Le, F. Li, A. Kementsietsidis, and S. Duan. Scalable keyword search on large RDF data. IEEE Trans. Knowl. Data Eng., 26(11):2774--2788, 2014.Google ScholarCross Ref
J. Lilleberg, Y. Zhu, and Y. Zhang. Support vector machines and word2vec for text classification with semantic features. In 14th IEEE International Conference on Cognitive Informatics & Cognitive Computing, ICCI^*CC 2015, Beijing, China, July 6--8, 2015, pages 136--140, 2015.Google ScholarCross Ref
F. Liu, C. T. Yu, W. Meng, and A. Chowdhury. Effective keyword search in relational databases. In SIGMOD Conference, pages 563--574, 2006.Google ScholarDigital Library
Y. Luo, X. Lin, W. Wang, and X. Zhou. SPARK: Top-k keyword query in relational databases. In SIGMOD Conference, 2007.Google ScholarDigital Library
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.Google Scholar
D. Milano, M. Scannapieco, and T. Catarci. Structure-aware XML object identification. IEEE Data Eng. Bull., 29(2):67--74, 2006.Google Scholar
G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39--41, 1995.Google ScholarDigital Library
M. Neuhaus and H. Bunke. A convolution edit kernel for error-tolerant graph matching. In ICPR, pages 220--223, 2006.Google ScholarDigital Library
A. Nierman and H. V. Jagadish. Evaluating structural similarity in XML documents. In WebDB, pages 61--66, 2002.Google Scholar
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25--29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532--1543, 2014.Google ScholarCross Ref
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1984.Google ScholarDigital Library
M. Sayyadian, H. LeKhac, A. Doan, and L. Gravano. Efficient keyword search across heterogeneous relational databases. In ICDE, pages 346--355, 2007.Google ScholarCross Ref
J. Seo and W. B. Croft. Geometric representations for multiple documents. In SIGIR, pages 251--258, 2010.Google ScholarDigital Library
J. A. Shaw and E. A. Fox. Combination of multiple searches. In Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2--4, 1994, pages 105--108, 1994.Google Scholar
R. Shaw, A. Datta, D. E. VanderMeer, and K. Dutta. Building a scalable database-driven reverse dictionary. IEEE Trans. Knowl. Data Eng., 25(3):528--540, 2013.Google ScholarDigital Library
A. Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35--43, 2001.Google Scholar
K. Sparck Jones. Index term weighting. Information Storage and Retrieval, 9(11):619--633, 1973.Google ScholarCross Ref
M. Weis and F. Naumann. Dogmatix tracks down duplicates in XML. In SIGMOD Conference, pages 431--442, 2005.Google ScholarDigital Library
J. X. Yu, L. Qin, and L. Chang. Keyword Search in Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2010.Google ScholarDigital Library
W. Zheng, L. Zou, X. Lian, J. X. Yu, S. Song, and D. Zhao. How to build templates for RDF question/answering: An uncertain graph similarity join approach. In SIGMOD Conference, pages 1809--1824, 2015.Google ScholarDigital Library
N. Zhiltsov, A. Kotov, and F. Nikolaev. Fielded sequential dependence model for ad-hoc entity retrieval in the web of data. In SIGIR, pages 253--262, 2015.Google ScholarDigital Library
Y. Zhou, H. Cheng, and J. X. Yu. Graph clustering based on structural/attribute similarities. PVLDB, 2(1):718--729, 2009.Google ScholarDigital Library
X. Zhu, S. Song, X. Lian, J. Wang, and L. Zou. Matching heterogeneous event data. In SIGMOD, pages 1211--1222, 2014.Google ScholarDigital Library
L. Zou and M. T. Özsu. Graph-based RDF data management. Data Science and Engineering, 2(1):56--70, 2017.Google ScholarCross Ref

Index Terms

Effective and efficient retrieval of structured entities
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval
    1. Document representation

Index terms have been assigned to the content through auto-classification.

Recommendations

Effective and efficient structured retrieval
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Search engines that support structured documents typically support structure created by the author (e.g., title, section), and may also support structure added by an annotation process (e.g., part of speech, named entity, semantic role). Exploiting such ...
Read More
Flexible and efficient distributed resolution of large entities
FoIKS'12: Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems

Entity resolution (ER) is a computationally hard problem of data integration scenarios, where database records have to be grouped according to the real-world entities they belong to. In practice these entities may consist of only a few records from ...
Read More
Structured positional entity language model for enterprise entity retrieval
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

We investigate the problem of general entity retrieval for enterprise websites. Our framework transforms the webpage content into a structured content representation, which captures hierarchical information blocks and semi-structured data records ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 13, Issue 6
February 2020
170 pages
ISSN:2150-8097
Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 February 2020
Published in pvldb Volume 13, Issue 6
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 133
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Effective and efficient retrieval of structured entities

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Effective and efficient structured retrieval

Flexible and efficient distributed resolution of large entities

Structured positional entity language model for enterprise entity retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Effective and efficient retrieval of structured entities

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Effective and efficient structured retrieval

Flexible and efficient distributed resolution of large entities

Structured positional entity language model for enterprise entity retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media