ABSTRACT
Many existing indexes on text work at the document granularity and are not effective in answering the class of queries where the desired answer is only a term or a phrase. In this paper, we study some of the index structures that are capable of answering the class of queries referred to here as wild card queries and perform an analysis of their performance. Our experimental results on a large class of queries from different sources (including query logs and parse trees) and with various datasets reveal some of the performance barriers of these indexes. We then present Word Permuterm Index (WPI) which is an adaptation of the permuterm index for natural language text applications and show that this index supports a wide range of wild card queries, is quick to construct and is highly scalable. Our experimental resultS comparing WPI to alternative methods on a wide range oF wild card queries show a few orders of magnitude performancE improvements for WPI while the memory usage is kept the same for all compared systems.
- Altavista. http://www.altavista.com.Google Scholar
- Apache lucene. http://lucene.apache.org/java/2_3_2/queryparsersyntax.html.Google Scholar
- The aquaint corpus of english news text. http://www.ldc.upenn.edu/Catalog/docs/LDC2002T31/.Google Scholar
- Google. http://www.google.com.Google Scholar
- Indri - language modeling meets inference networks. http://www.lemurproject.org/indri/.Google Scholar
- Minipar home page. http://webdocs.cs.ualberta.ca/~lindek/minipar.htm.Google Scholar
- Openephyra - ephyra question answering system. http://mu.lti.cs.cmu.edu/trac/Ephyra/wiki/OpenEphyra.Google Scholar
- Yahoo! search - web search. http://search.yahoo.com.Google Scholar
- Oracle text, an oracle technical white paper, 2005. http://www.oracle.com/technology/products/text/pdf/10gR2text_twp_f.pdf.Google Scholar
- A. Andersson, N. J. Larsson, and K. Swanson. Suffix trees on words. Algorithmica, 23(3):246--260, 1999.Google ScholarCross Ref
- D. Bahle, H. E. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. In Proceedings of the 25th ACM SIGIR conference on Research and development in information retrieval, pages 215--221. ACM, 2002. Google ScholarDigital Library
- N. R. Brisaboa, A. Fariña, G. Navarro, and J. R. Paramá. Lightweight natural language text compression. Information Retrieval, 10(1):1--33, 2007. Google ScholarDigital Library
- M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.Google Scholar
- M. J. Cafarella and O. Etzioni. A search engine for natural language applications. In Proceedings of the 14th World Wide Web Conference, 2005. Google ScholarDigital Library
- M. J. Cafarella, C. Re, D. Suciu, and O. Etzioni. Structured querying of web text data: A technical challenge. In Proceedings of 3rd Biennial Conference on Innovative Data Systems Research (CIDR), 2007.Google Scholar
- S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In Proceedings of 15th International World Wide Web Conference (WWW), pages 717--726, 2006. Google ScholarDigital Library
- S. Chaudhuri, K. Church, A. C. Konig, and L. Sui. Heavy-tailed distributions and multi-keyword queries. In Proceedings of the 30th ACM SIGIR conference on Research and development in information retrieval. ACM, 2007. Google ScholarDigital Library
- O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In Proceedings of the 13th World Wide Web Conference (WWW), 2004. Google ScholarDigital Library
- P. Ferragina and J. Fischer. Suffix arrays on words. In Combinatorial Pattern Matching, pages 328--339. Springer, 2007. Google ScholarDigital Library
- P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):581, 2005. Google ScholarDigital Library
- P. Ferragina and R. Venturini. Compressed permuterm index. In Proceedings of the 30th ACM SIGIR conference on Research and development in information retrieval, pages 535--542. ACM, 2007. Google ScholarDigital Library
- E. Garfield. The permuterm subject index: An autobiographical review. Journal of the ACM, 27:288--291, 1976.Google Scholar
- R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM symposium on Discrete algorithms, pages 841--850. Society for Industrial and Applied Mathematics, 2003. Google ScholarDigital Library
- A. Maier and H. J. Novak. Db2's full-text search products - white paper, 2006.Google Scholar
- C. D. Manning, P. Raghavan, and H. Schutze. An introduction to information retrieval. Google ScholarDigital Library
- G. Navarro and V. Makinen. Compressed full-text indexes. ACM Computing Surveys (CSUR), 39(1):2, 2007. Google ScholarDigital Library
- G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In Proceedings of the 1st international conference on Scalable information systems. Citeseer, 2006. Google ScholarDigital Library
- D. Rafiei and H. Li. Data extraction from the web using wild card queries. In Proceedings of the 18th Conference on Information and Knowledge Management (CIKM), pages 1939--1942, 2009. Google ScholarDigital Library
- R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the 13th ACM-SIAM symposium on Discrete algorithms, pages 233--242. ACM, 2002. Google ScholarDigital Library
- H. E. Williams, J. Zobel, and D. Bahle. Fast phrase querying with combined indexes. ACM Transactions on Information Systems (TOIS), 22(4):573--594, 2004. Google ScholarDigital Library
Index Terms
- Index structures for efficiently searching natural language text
Recommendations
Word-based self-indexes for natural language text
The inverted index supports efficient full-text searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for single-word searches, yet phrase searches ...
An Efficiently Updatable Index Scheme for Structured Documents
DEXA '98: Proceedings of the 9th International Workshop on Database and Expert Systems ApplicationsWe propose an efficiently updatable index scheme for XML documents. This index scheme consists of four types of indices. Content index manages occurrence positions of words, element names, attribute names and attribute values. Local structure index ...
Comments