Abstract
Natural language text corpora are often available as sets of syntactically parsed trees. A wide range of expressive tree queries are possible over such parsed trees that open a new avenue in searching over natural language text. They not only allow for querying roles and relationships within sentences, but also improve search effectiveness compared to flat keyword queries. One major drawback of current systems supporting querying over parsed text is the performance of evaluating queries over large data. In this paper we propose a novel indexing scheme over unique subtrees as index keys. We also propose a novel root-split coding scheme that stores subtree structural information only partially, thus reducing index size and improving querying performance. Our extensive set of experiments show that root-split coding reduces the index size of any interval coding which stores individual node numbers by a factor of 50% to 80%, depending on the sizes of subtrees indexed. Moreover, We show that our index using root-split coding, outperforms previous approaches by at least an order of magnitude in terms of the response time of queries.
- S. Al-Khalifa, H. Jagadish, N. Koudas, J. Patel, D. Srivastava, and Y. Wu. Structural Joins: A Primitive for Efficient XML Query Pattern Matching. In ICDE, pages 141--152, 2002. Google ScholarDigital Library
- The aquaint corpus of english news text, 2002. http://www.ldc.upenn.edu/Catalog/docs/LDC2002T31/.Google Scholar
- M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information Extraction from the Web. In IJCAI, pages 2670--2676, 2007. Google ScholarDigital Library
- S. Bird, Y. Chen, S. Davidson, H. Lee, and Y. Zheng. Designing and Evaluating an XPath Dialect for Linguistic Queries. In ICDE, pages 52--52, 2006. Google ScholarDigital Library
- N. Bruno, N. Koudas, and D. Srivastava. Holistic Twig Joins: Optimal XML Pattern Matching. In SIGMOD, pages 310--321, 2002. Google ScholarDigital Library
- J. Christensen, Mausam, S. Soderland, and E. Etzioni. Semantic Role Labeling for Open Information Extraction. In NAACL, pages 52--60, 2010. Google ScholarDigital Library
- P. Chubak. Indexing and Querying Natural Language Text. PhD thesis, University of Alberta, 2012.Google Scholar
- G. Gou and R. Chirkova. Efficiently Querying Large XML Data Repositories: A Survey. Transactions on Knowledge and Data Engineering, 19(10):1381--1403, 2007. Google ScholarDigital Library
- D. Klein and C. Manning. Accurate Unlexicalized Parsing. In ACL, pages 423--430, 2003. Google ScholarDigital Library
- C. Kwok, O. Etzioni, and D. Weld. Scaling Question Answering to the Web. Transactions on Information Systems, 19(3):242--262, 2001. Google ScholarDigital Library
- D. Moldovan, S. Harabagiu, R. Girju, P. Morarescu, F. Lacatusu, A. Novischi, A. Badulescu, and O. Bolohan. LCC Tools for Question Answering. In TREC, pages 388--397, 2002.Google Scholar
- G. Pass, A. Chowdhury, and C. Torgeson. A Picture of Search. In INFOSCALE, page 1, 2006. Google ScholarDigital Library
- B. Randall. CorpusSearch Users Manual. University of Pennsylvania, 2000. http://www.ling.upenn.edu/~dringe/CorpStuff/Manual/Contents.html.Google Scholar
- D. Rohde. TGrep2 User Manual version 1.15. Massachusetts Institute of Technology. http://tedlab.mit.edu/dr/Tgrep2, 2005.Google Scholar
- D. Shasha, J. Wang, and R. Giugno. Algorithmics and Applications of Tree and Graph Searching. In PODS, pages 39--52, 2002. Google ScholarDigital Library
- D. Shasha, J. Wang, H. Shan, and K. Zhang. Atreegrep: Approximate Searching in Unordered Trees. In SSDBM, pages 89--98, 2002. Google ScholarDigital Library
- D. Williams, J. Huan, and W. Wang. Graph Database Indexing Using Structured Graph Decomposition. In ICDE, pages 976--985, 2007.Google ScholarCross Ref
- F. Wu and D. Weld. Open Information Extraction Using Wikipedia. In ACL, pages 118--127, 2010. Google ScholarDigital Library
- X. Yan, P. Yu, and J. Han. Graph Indexing: A Frequent Structure-Based Approach. In SIGMOD, pages 335--346, 2004. Google ScholarDigital Library
- M. Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura. XRel: a Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases. Transactions on Internet Technology, 1(1):110--141, 2001. Google ScholarDigital Library
- M. Yue. A Simple Proof of the Inequality FFD(L) ≤ 11/9 OPT(L)+1, ∀ L for the FFD Bin-Packing Algorithm. Acta Mathematicae Applicatae Sinica (English Series), 7(4):321--331, 1991.Google Scholar
- C. Zhang, J. Naughton, D. DeWitt, Q. Luo, and G. Lohman. On Supporting Containment Queries in Relational Database Management Systems. In SIGMOD, pages 425--436, 2001. Google ScholarDigital Library
- S. Zhang, M. Hu, and J. Yang. TreePi: A Novel Graph Indexing Method. In ICDE, pages 966--975, 2007.Google ScholarCross Ref
Recommendations
A description language for syntactically annotated corpora
COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 2This paper introduces a description language for syntactically annotated corpora which allows for encoding both the syntactic annotation to a corpus and the queries to a syntactically annotated corpus.In terms of descriptive adequacy and computational ...
Syntactically Lexicalized Phrase-Based SMT
Until quite recently, extending phrase-based statistical machine translation (PBSMT) with syntactic knowledge caused system performance to deteriorate. The most recent successful enrichments of PBSMT with hierarchical structure either employ ...
System for querying syntactically annotated corpora
ACLDemos '09: Proceedings of the ACL-IJCNLP 2009 Software DemonstrationsThis paper presents a system for querying treebanks. The system consists of a powerful query language with natural support for cross-layer queries, a client interface with a graphical query builder and visualizer of the results, a command-line client ...
Comments