skip to main content
10.1145/1871437.1871527acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Index structures for efficiently searching natural language text

Published:26 October 2010Publication History

ABSTRACT

Many existing indexes on text work at the document granularity and are not effective in answering the class of queries where the desired answer is only a term or a phrase. In this paper, we study some of the index structures that are capable of answering the class of queries referred to here as wild card queries and perform an analysis of their performance. Our experimental results on a large class of queries from different sources (including query logs and parse trees) and with various datasets reveal some of the performance barriers of these indexes. We then present Word Permuterm Index (WPI) which is an adaptation of the permuterm index for natural language text applications and show that this index supports a wide range of wild card queries, is quick to construct and is highly scalable. Our experimental resultS comparing WPI to alternative methods on a wide range oF wild card queries show a few orders of magnitude performancE improvements for WPI while the memory usage is kept the same for all compared systems.

References

  1. Altavista. http://www.altavista.com.Google ScholarGoogle Scholar
  2. Apache lucene. http://lucene.apache.org/java/2_3_2/queryparsersyntax.html.Google ScholarGoogle Scholar
  3. The aquaint corpus of english news text. http://www.ldc.upenn.edu/Catalog/docs/LDC2002T31/.Google ScholarGoogle Scholar
  4. Google. http://www.google.com.Google ScholarGoogle Scholar
  5. Indri - language modeling meets inference networks. http://www.lemurproject.org/indri/.Google ScholarGoogle Scholar
  6. Minipar home page. http://webdocs.cs.ualberta.ca/~lindek/minipar.htm.Google ScholarGoogle Scholar
  7. Openephyra - ephyra question answering system. http://mu.lti.cs.cmu.edu/trac/Ephyra/wiki/OpenEphyra.Google ScholarGoogle Scholar
  8. Yahoo! search - web search. http://search.yahoo.com.Google ScholarGoogle Scholar
  9. Oracle text, an oracle technical white paper, 2005. http://www.oracle.com/technology/products/text/pdf/10gR2text_twp_f.pdf.Google ScholarGoogle Scholar
  10. A. Andersson, N. J. Larsson, and K. Swanson. Suffix trees on words. Algorithmica, 23(3):246--260, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. Bahle, H. E. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. In Proceedings of the 25th ACM SIGIR conference on Research and development in information retrieval, pages 215--221. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. R. Brisaboa, A. Fariña, G. Navarro, and J. R. Paramá. Lightweight natural language text compression. Information Retrieval, 10(1):1--33, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.Google ScholarGoogle Scholar
  14. M. J. Cafarella and O. Etzioni. A search engine for natural language applications. In Proceedings of the 14th World Wide Web Conference, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. J. Cafarella, C. Re, D. Suciu, and O. Etzioni. Structured querying of web text data: A technical challenge. In Proceedings of 3rd Biennial Conference on Innovative Data Systems Research (CIDR), 2007.Google ScholarGoogle Scholar
  16. S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In Proceedings of 15th International World Wide Web Conference (WWW), pages 717--726, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Chaudhuri, K. Church, A. C. Konig, and L. Sui. Heavy-tailed distributions and multi-keyword queries. In Proceedings of the 30th ACM SIGIR conference on Research and development in information retrieval. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In Proceedings of the 13th World Wide Web Conference (WWW), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Ferragina and J. Fischer. Suffix arrays on words. In Combinatorial Pattern Matching, pages 328--339. Springer, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):581, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Ferragina and R. Venturini. Compressed permuterm index. In Proceedings of the 30th ACM SIGIR conference on Research and development in information retrieval, pages 535--542. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Garfield. The permuterm subject index: An autobiographical review. Journal of the ACM, 27:288--291, 1976.Google ScholarGoogle Scholar
  23. R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM symposium on Discrete algorithms, pages 841--850. Society for Industrial and Applied Mathematics, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Maier and H. J. Novak. Db2's full-text search products - white paper, 2006.Google ScholarGoogle Scholar
  25. C. D. Manning, P. Raghavan, and H. Schutze. An introduction to information retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Navarro and V. Makinen. Compressed full-text indexes. ACM Computing Surveys (CSUR), 39(1):2, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In Proceedings of the 1st international conference on Scalable information systems. Citeseer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Rafiei and H. Li. Data extraction from the web using wild card queries. In Proceedings of the 18th Conference on Information and Knowledge Management (CIKM), pages 1939--1942, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the 13th ACM-SIAM symposium on Discrete algorithms, pages 233--242. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. E. Williams, J. Zobel, and D. Bahle. Fast phrase querying with combined indexes. ACM Transactions on Information Systems (TOIS), 22(4):573--594, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Index structures for efficiently searching natural language text

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
          October 2010
          2036 pages
          ISBN:9781450300995
          DOI:10.1145/1871437

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 October 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader