skip to main content
10.1145/383952.383958acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Static index pruning for information retrieval systems

Published:01 September 2001Publication History

ABSTRACT

We introduce static index pruning methods that significantly reduce the index size in information retrieval systems.We investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. In uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. In term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. We give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. We present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index.

References

  1. 1.C.Buckley and A.F.Lewit.Optimization of inverted vector searches.In Proceedin s of the Ei hth International ACM-SIGIR Conference ,pages 97 -110, Montreal,Canada,June 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.C.Buckley,A.Singhal,M.Mitra,and G.Salton.New retrieval approaches using SMART:TREC 4.In Proceedings of the Fourth Text REtrieval Conference (TREC-4),pages 25 -48,Gaithersberg,Maryland, November 1995.Google ScholarGoogle Scholar
  3. 3.S.Deerwester,S.Dumais,G.Furnas,T.Landauer, and R.Harshman.Indexing by latent semantic analysis.Journal of the American Society of Information Science ,41(6):391 -407,1990.Google ScholarGoogle ScholarCross RefCross Ref
  4. 4.R.Fagn,R.Kumar,andD.Svakumar.Top k orderings and near metrics.To appear,2001.Google ScholarGoogle Scholar
  5. 5.Pirate Search for Palm. http://www.alphaworks.ibm.com/tech/piratesearchGoogle ScholarGoogle Scholar
  6. 6.D.Harman and G.Candela.Retrieving records from a gigabyte of text on a minicomputer using statistical ranking.Journal of the American Society of Information Science ,41(8):581 -589,1990.Google ScholarGoogle ScholarCross RefCross Ref
  7. 7.M.Kendall and J.D.Gibbons.Rank correlation methods .Edward Arnold,London,5th edition,1990.Google ScholarGoogle Scholar
  8. 8.Y.Maarek and F.Smadja.Full text indexing based on lexical relations:An application:Software libraries.In Proceedin s of the Twelfth International ACM SIGIR Conference ,pages 198 -206,Cambridge,MA,June 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.A.Mo .at and J.Zobel.Fast ranking n limited space. In Proceedin s of the 10th IEEE International Conference on Data Engineering,,pages 428 -4376, Houston,TX,February 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.M.Persin.Document .ltering for fast ranking.In Proceedin s of the Seventeenth International ACM-SIGIR Conference ,pages 339 -348,Dublin, Ireland,July 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.M.Persin,J.Zobel,and R.Sacks-Davis.Filtered document retrieval with frequency-sorted ndexes. Journal of the American Society of Information Science ,47(10):749 -764,October 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.G.Salton and M.J.McGill.An Introduction to Modern Information Retrieval .McGraw-Hill,New York,1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.H.Turtle and J.Flood.Query evaluation:Strategies and optimizations.Information Processing and Management ,31(6):831 -850,1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.A.N.Vo and A.Mo .at.Compressed nverted .les with reduced decoding overheads.In Proceedin s of the 21st International ACM-SIGIR Conference ,pages 290 -297,Melbourne,Australia,August 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.E.M.Voorhees and D.K.Harman.Overview of the Seventh Text REtrieval Conference (TREC-7).In Proceedin s of the Seventh Text REtrieval Conference (TREC-7).National Institute of Standards and Technology,http://trec.nist.gov/pubs.html 1999.Google ScholarGoogle Scholar
  16. 16.I.H.Witten,A.Mo .at,and T.C.Bell.Managing Gigabytes .Morgan Kaufman,San Francisco,1999.Google ScholarGoogle Scholar

Index Terms

  1. Static index pruning for information retrieval systems

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
            September 2001
            454 pages
            ISBN:1581133316
            DOI:10.1145/383952

            Copyright © 2001 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 September 2001

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            SIGIR '01 Paper Acceptance Rate47of201submissions,23%Overall Acceptance Rate792of3,983submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader