research-article

Index structures for efficiently searching natural language text

Authors:
Pirooz Chubak

University of Alberta, Edmonton, AB, Canada

University of Alberta, Edmonton, AB, Canada
View Profile

,
Davood Rafiei

University of Alberta, Edmonton, AB, Canada

University of Alberta, Edmonton, AB, Canada
View Profile

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementOctober 2010Pages 689–698https://doi.org/10.1145/1871437.1871527

Published:26 October 2010Publication History

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 689–698

ABSTRACT

Many existing indexes on text work at the document granularity and are not effective in answering the class of queries where the desired answer is only a term or a phrase. In this paper, we study some of the index structures that are capable of answering the class of queries referred to here as wild card queries and perform an analysis of their performance. Our experimental results on a large class of queries from different sources (including query logs and parse trees) and with various datasets reveal some of the performance barriers of these indexes. We then present Word Permuterm Index (WPI) which is an adaptation of the permuterm index for natural language text applications and show that this index supports a wide range of wild card queries, is quick to construct and is highly scalable. Our experimental resultS comparing WPI to alternative methods on a wide range oF wild card queries show a few orders of magnitude performancE improvements for WPI while the memory usage is kept the same for all compared systems.

References

Altavista. http://www.altavista.com.Google Scholar
Apache lucene. http://lucene.apache.org/java/2_3_2/queryparsersyntax.html.Google Scholar
The aquaint corpus of english news text. http://www.ldc.upenn.edu/Catalog/docs/LDC2002T31/.Google Scholar
Google. http://www.google.com.Google Scholar
Indri - language modeling meets inference networks. http://www.lemurproject.org/indri/.Google Scholar
Minipar home page. http://webdocs.cs.ualberta.ca/~lindek/minipar.htm.Google Scholar
Openephyra - ephyra question answering system. http://mu.lti.cs.cmu.edu/trac/Ephyra/wiki/OpenEphyra.Google Scholar
Yahoo! search - web search. http://search.yahoo.com.Google Scholar
Oracle text, an oracle technical white paper, 2005. http://www.oracle.com/technology/products/text/pdf/10gR2text_twp_f.pdf.Google Scholar
A. Andersson, N. J. Larsson, and K. Swanson. Suffix trees on words. Algorithmica, 23(3):246--260, 1999.Google ScholarCross Ref
D. Bahle, H. E. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. In Proceedings of the 25th ACM SIGIR conference on Research and development in information retrieval, pages 215--221. ACM, 2002. Google ScholarDigital Library
N. R. Brisaboa, A. Fariña, G. Navarro, and J. R. Paramá. Lightweight natural language text compression. Information Retrieval, 10(1):1--33, 2007. Google ScholarDigital Library
M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.Google Scholar
M. J. Cafarella and O. Etzioni. A search engine for natural language applications. In Proceedings of the 14th World Wide Web Conference, 2005. Google ScholarDigital Library
M. J. Cafarella, C. Re, D. Suciu, and O. Etzioni. Structured querying of web text data: A technical challenge. In Proceedings of 3rd Biennial Conference on Innovative Data Systems Research (CIDR), 2007.Google Scholar
S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In Proceedings of 15th International World Wide Web Conference (WWW), pages 717--726, 2006. Google ScholarDigital Library
S. Chaudhuri, K. Church, A. C. Konig, and L. Sui. Heavy-tailed distributions and multi-keyword queries. In Proceedings of the 30th ACM SIGIR conference on Research and development in information retrieval. ACM, 2007. Google ScholarDigital Library
O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In Proceedings of the 13th World Wide Web Conference (WWW), 2004. Google ScholarDigital Library
P. Ferragina and J. Fischer. Suffix arrays on words. In Combinatorial Pattern Matching, pages 328--339. Springer, 2007. Google ScholarDigital Library
P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):581, 2005. Google ScholarDigital Library
P. Ferragina and R. Venturini. Compressed permuterm index. In Proceedings of the 30th ACM SIGIR conference on Research and development in information retrieval, pages 535--542. ACM, 2007. Google ScholarDigital Library
E. Garfield. The permuterm subject index: An autobiographical review. Journal of the ACM, 27:288--291, 1976.Google Scholar
R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM symposium on Discrete algorithms, pages 841--850. Society for Industrial and Applied Mathematics, 2003. Google ScholarDigital Library
A. Maier and H. J. Novak. Db2's full-text search products - white paper, 2006.Google Scholar
C. D. Manning, P. Raghavan, and H. Schutze. An introduction to information retrieval. Google ScholarDigital Library
G. Navarro and V. Makinen. Compressed full-text indexes. ACM Computing Surveys (CSUR), 39(1):2, 2007. Google ScholarDigital Library
G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In Proceedings of the 1st international conference on Scalable information systems. Citeseer, 2006. Google ScholarDigital Library
D. Rafiei and H. Li. Data extraction from the web using wild card queries. In Proceedings of the 18th Conference on Information and Knowledge Management (CIKM), pages 1939--1942, 2009. Google ScholarDigital Library
R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the 13th ACM-SIAM symposium on Discrete algorithms, pages 233--242. ACM, 2002. Google ScholarDigital Library
H. E. Williams, J. Zobel, and D. Bahle. Fast phrase querying with combined indexes. ACM Transactions on Information Systems (TOIS), 22(4):573--594, 2004. Google ScholarDigital Library

Index Terms

Index structures for efficiently searching natural language text

Recommendations

Indexing and querying natural language text
Read More
Word-based self-indexes for natural language text

The inverted index supports efficient full-text searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for single-word searches, yet phrase searches ...
Read More
An Efficiently Updatable Index Scheme for Structured Documents
DEXA '98: Proceedings of the 9th International Workshop on Database and Expert Systems Applications

We propose an efficiently updatable index scheme for XML documents. This index scheme consists of four types of indices. Content index manages occurrence positions of words, element names, attribute names and attribute values. Local structure index ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
indexing natural language text
querying performance
wild card queries
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 276
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Index structures for efficiently searching natural language text

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Indexing and querying natural language text

Word-based self-indexes for natural language text

An Efficiently Updatable Index Scheme for Structured Documents