Abstract
Structured entities are commonly abstracted, such as from XML, RDF or hidden-web databases. Direct retrieval of various structured entities is highly demanded in data lakes, e.g., given a JSON object, to find the XML entities that denote the same real-world object. Existing approaches on evaluating structured entity similarity emphasize too much the structural inconsistency. Indeed, entities from heterogeneous sources could have very distinct structures, owing to various information representation conventions. We argue that the retrieval could be more tolerant to structural differences and focus more on the contents of the entities. In this paper, we first identify the unique challenge of parent-child (containment) relationships among structured entities, which unfortunately prevent the retrieval of proper entities (returning parents or children). To solve the problem, a novel hierarchy smooth function is proposed to combine the term scores in different nodes of a structured entity. Entities sharing the same structure, namely an entity family, are employed to learn the coefficient in aggregating the scores, and thus distinguish/prune the parent or child entities. Remarkably, the proposed method could cooperate with both the bag-of-words (BOW) and word embedding models, successful in retrieving unstructured documents, for querying structured entities. Extensive experiments on real datasets demonstrate that our proposal is effective and efficient.
- http://citeseerx.ist.psu.edu.Google Scholar
- https://dblp.uni-trier.de/xml/.Google Scholar
- http://www.freebase.com/.Google Scholar
- http://www.imdb.com/.Google Scholar
- A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank: Authority-based keyword search in databases. In VLDB, pages 564--575, 2004.Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 1(1), 2007.Google Scholar
- C. Böhm, G. de Melo, F. Naumann, and G. Weikum. LINDA: distributed web-of-data-scale entity matching. In 21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012, pages 2104--2108, 2012.Google ScholarDigital Library
- M. Cataldi, K. S. Candan, and M. L. Sapino. Narrative-based taxonomy distillation for effective indexing of text collections. Data Knowl. Eng., 72:103--125, 2012.Google ScholarDigital Library
- P. Champin and C. Solnon. Measuring the similarity of labeled graphs. In ICCBR, pages 80--95, 2003.Google ScholarDigital Library
- L. J. Chen and Y. Papakonstantinou. Supporting top-k keyword search in XML databases. In ICDE, pages 689--700, 2010.Google ScholarCross Ref
- Y. Chen, W. Wang, Z. Liu, and X. Lin. Keyword search on structured and semi-structured data. In SIGMOD, pages 1005--1010, 2009.Google ScholarDigital Library
- V. S. Cherukuri and K. S. Candan. Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees. In Proceeding of the 2008 ACM Workshop on Large-Scale Distributed Systems for Information Retrieval, LSDS-IR 2008, Napa Valley, California, USA, October 30, 2008, pages 3--10, 2008.Google ScholarDigital Library
- W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD Conference, pages 201--212, 1998.Google ScholarDigital Library
- R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS, 2001.Google ScholarDigital Library
- M. Farah and D. Vanderpooten. An outranking approach for rank aggregation in information retrieval. In SIGIR, pages 591--598, 2007.Google ScholarDigital Library
- S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Integrating XML data sources using approximate joins. ACM Trans. Database Syst., 31(1):161--207, 2006.Google ScholarDigital Library
- S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging the results of approximate match operations. In VLDB, pages 636--647, 2004.Google ScholarDigital Library
- N. Hamilton. The mechanics of a deep net metasearch engine. In Proceedings of the Twelfth International World Wide Web Conference - Posters, WWW 2003, Budapest, Hungary, May 20--24, 2003, 2003.Google Scholar
- V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient IR-style keyword search over relational databases. In VLDB, pages 850--861, 2003.Google ScholarDigital Library
- A. R. Jaiswal, D. J. Miller, and P. Mitra. Schema matching and embedded value mapping for databases with opaque column names and mixed continuous and discrete-valued data fields. ACM Trans. Database Syst., 38(1):2:1--2:34, 2013.Google ScholarDigital Library
- J. Kang and J. F. Naughton. Schema matching using interattribute dependencies. IEEE Trans. Knowl. Data Eng., 20(10):1393--1407, 2008.Google ScholarDigital Library
- J. W. Kim and K. S. Candan. CP/CV: concept similarity mining without frequency information from domain describing taxonomies. In Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, November 6--11, 2006, pages 483--492, 2006.Google ScholarDigital Library
- S. Lacoste-Julien, K. Palla, A. Davies, G. Kasneci, T. Graepel, and Z. Ghahramani. Sigma: simple greedy matching for aligning large knowledge bases. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11--14, 2013, pages 572--580, 2013.Google ScholarDigital Library
- W. Le, F. Li, A. Kementsietsidis, and S. Duan. Scalable keyword search on large RDF data. IEEE Trans. Knowl. Data Eng., 26(11):2774--2788, 2014.Google ScholarCross Ref
- J. Lilleberg, Y. Zhu, and Y. Zhang. Support vector machines and word2vec for text classification with semantic features. In 14th IEEE International Conference on Cognitive Informatics & Cognitive Computing, ICCI*CC 2015, Beijing, China, July 6--8, 2015, pages 136--140, 2015.Google ScholarCross Ref
- F. Liu, C. T. Yu, W. Meng, and A. Chowdhury. Effective keyword search in relational databases. In SIGMOD Conference, pages 563--574, 2006.Google ScholarDigital Library
- Y. Luo, X. Lin, W. Wang, and X. Zhou. SPARK: Top-k keyword query in relational databases. In SIGMOD Conference, 2007.Google ScholarDigital Library
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.Google Scholar
- D. Milano, M. Scannapieco, and T. Catarci. Structure-aware XML object identification. IEEE Data Eng. Bull., 29(2):67--74, 2006.Google Scholar
- G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39--41, 1995.Google ScholarDigital Library
- M. Neuhaus and H. Bunke. A convolution edit kernel for error-tolerant graph matching. In ICPR, pages 220--223, 2006.Google ScholarDigital Library
- A. Nierman and H. V. Jagadish. Evaluating structural similarity in XML documents. In WebDB, pages 61--66, 2002.Google Scholar
- J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25--29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532--1543, 2014.Google ScholarCross Ref
- G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1984.Google ScholarDigital Library
- M. Sayyadian, H. LeKhac, A. Doan, and L. Gravano. Efficient keyword search across heterogeneous relational databases. In ICDE, pages 346--355, 2007.Google ScholarCross Ref
- J. Seo and W. B. Croft. Geometric representations for multiple documents. In SIGIR, pages 251--258, 2010.Google ScholarDigital Library
- J. A. Shaw and E. A. Fox. Combination of multiple searches. In Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2--4, 1994, pages 105--108, 1994.Google Scholar
- R. Shaw, A. Datta, D. E. VanderMeer, and K. Dutta. Building a scalable database-driven reverse dictionary. IEEE Trans. Knowl. Data Eng., 25(3):528--540, 2013.Google ScholarDigital Library
- A. Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35--43, 2001.Google Scholar
- K. Sparck Jones. Index term weighting. Information Storage and Retrieval, 9(11):619--633, 1973.Google ScholarCross Ref
- M. Weis and F. Naumann. Dogmatix tracks down duplicates in XML. In SIGMOD Conference, pages 431--442, 2005.Google ScholarDigital Library
- J. X. Yu, L. Qin, and L. Chang. Keyword Search in Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2010.Google ScholarDigital Library
- W. Zheng, L. Zou, X. Lian, J. X. Yu, S. Song, and D. Zhao. How to build templates for RDF question/answering: An uncertain graph similarity join approach. In SIGMOD Conference, pages 1809--1824, 2015.Google ScholarDigital Library
- N. Zhiltsov, A. Kotov, and F. Nikolaev. Fielded sequential dependence model for ad-hoc entity retrieval in the web of data. In SIGIR, pages 253--262, 2015.Google ScholarDigital Library
- Y. Zhou, H. Cheng, and J. X. Yu. Graph clustering based on structural/attribute similarities. PVLDB, 2(1):718--729, 2009.Google ScholarDigital Library
- X. Zhu, S. Song, X. Lian, J. Wang, and L. Zou. Matching heterogeneous event data. In SIGMOD, pages 1211--1222, 2014.Google ScholarDigital Library
- L. Zou and M. T. Özsu. Graph-based RDF data management. Data Science and Engineering, 2(1):56--70, 2017.Google ScholarCross Ref
Index Terms
- Effective and efficient retrieval of structured entities
Recommendations
Effective and efficient structured retrieval
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementSearch engines that support structured documents typically support structure created by the author (e.g., title, section), and may also support structure added by an annotation process (e.g., part of speech, named entity, semantic role). Exploiting such ...
Flexible and efficient distributed resolution of large entities
FoIKS'12: Proceedings of the 7th international conference on Foundations of Information and Knowledge SystemsEntity resolution (ER) is a computationally hard problem of data integration scenarios, where database records have to be grouped according to the real-world entities they belong to. In practice these entities may consist of only a few records from ...
Structured positional entity language model for enterprise entity retrieval
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementWe investigate the problem of general entity retrieval for enterprise websites. Our framework transforms the webpage content into a structured content representation, which captures hierarchical information blocks and semi-structured data records ...
Comments