ABSTRACT
We introduce static index pruning methods that significantly reduce the index size in information retrieval systems.We investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. In uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. In term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. We give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. We present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index.
- 1.C.Buckley and A.F.Lewit.Optimization of inverted vector searches.In Proceedin s of the Ei hth International ACM-SIGIR Conference ,pages 97 -110, Montreal,Canada,June 1985. Google ScholarDigital Library
- 2.C.Buckley,A.Singhal,M.Mitra,and G.Salton.New retrieval approaches using SMART:TREC 4.In Proceedings of the Fourth Text REtrieval Conference (TREC-4),pages 25 -48,Gaithersberg,Maryland, November 1995.Google Scholar
- 3.S.Deerwester,S.Dumais,G.Furnas,T.Landauer, and R.Harshman.Indexing by latent semantic analysis.Journal of the American Society of Information Science ,41(6):391 -407,1990.Google ScholarCross Ref
- 4.R.Fagn,R.Kumar,andD.Svakumar.Top k orderings and near metrics.To appear,2001.Google Scholar
- 5.Pirate Search for Palm. http://www.alphaworks.ibm.com/tech/piratesearchGoogle Scholar
- 6.D.Harman and G.Candela.Retrieving records from a gigabyte of text on a minicomputer using statistical ranking.Journal of the American Society of Information Science ,41(8):581 -589,1990.Google ScholarCross Ref
- 7.M.Kendall and J.D.Gibbons.Rank correlation methods .Edward Arnold,London,5th edition,1990.Google Scholar
- 8.Y.Maarek and F.Smadja.Full text indexing based on lexical relations:An application:Software libraries.In Proceedin s of the Twelfth International ACM SIGIR Conference ,pages 198 -206,Cambridge,MA,June 1989. Google ScholarDigital Library
- 9.A.Mo .at and J.Zobel.Fast ranking n limited space. In Proceedin s of the 10th IEEE International Conference on Data Engineering,,pages 428 -4376, Houston,TX,February 1994. Google ScholarDigital Library
- 10.M.Persin.Document .ltering for fast ranking.In Proceedin s of the Seventeenth International ACM-SIGIR Conference ,pages 339 -348,Dublin, Ireland,July 1994. Google ScholarDigital Library
- 11.M.Persin,J.Zobel,and R.Sacks-Davis.Filtered document retrieval with frequency-sorted ndexes. Journal of the American Society of Information Science ,47(10):749 -764,October 1996. Google ScholarDigital Library
- 12.G.Salton and M.J.McGill.An Introduction to Modern Information Retrieval .McGraw-Hill,New York,1993. Google ScholarDigital Library
- 13.H.Turtle and J.Flood.Query evaluation:Strategies and optimizations.Information Processing and Management ,31(6):831 -850,1995. Google ScholarDigital Library
- 14.A.N.Vo and A.Mo .at.Compressed nverted .les with reduced decoding overheads.In Proceedin s of the 21st International ACM-SIGIR Conference ,pages 290 -297,Melbourne,Australia,August 1998. Google ScholarDigital Library
- 15.E.M.Voorhees and D.K.Harman.Overview of the Seventh Text REtrieval Conference (TREC-7).In Proceedin s of the Seventh Text REtrieval Conference (TREC-7).National Institute of Standards and Technology,http://trec.nist.gov/pubs.html 1999.Google Scholar
- 16.I.H.Witten,A.Mo .at,and T.C.Bell.Managing Gigabytes .Morgan Kaufman,San Francisco,1999.Google Scholar
Index Terms
- Static index pruning for information retrieval systems
Recommendations
A document-centric approach to static index pruning in text retrieval systems
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementWe present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a document-centric approach to decide whether a posting for a given term should remain in the index or not. The decision is made based on the term's ...
Probabilistic static pruning of inverted files
Information retrieval (IR) systems typically compress their indexes in order to increase their efficiency. Static pruning is a form of lossy data compression: it removes from the index, data that is estimated to be the least important to retrieval ...
A Fast Static Index Pruning Algorithm
ICCC '13: Proceedings of the Second International Conference on Innovative Computing and Cloud ComputingAs a query processing optimization technique over inverted index, static index pruning can significantly reduce index size and query processing time. A fast static index pruning algorithm is presented, which is a term-centric method and adopts BM25 ...
Comments