skip to main content
research-article

Static index pruning in web search engines: Combining term and document popularities with query views

Published:06 March 2012Publication History
Skip Abstract Section

Abstract

Static index pruning techniques permanently remove a presumably redundant part of an inverted file, to reduce the file size and query processing time. These techniques differ in deciding which parts of an index can be removed safely; that is, without changing the top-ranked query results. As defined in the literature, the query view of a document is the set of query terms that access to this particular document, that is, retrieves this document among its top results. In this paper, we first propose using query views to improve the quality of the top results compared against the original results. We incorporate query views in a number of static pruning strategies, namely term-centric, document-centric, term popularity based and document access popularity based approaches, and show that the new strategies considerably outperform their counterparts especially for the higher levels of pruning and for both disjunctive and conjunctive query processing. Additionally, we combine the notions of term and document access popularity to form new pruning strategies, and further extend these strategies with the query views. The new strategies improve the result quality especially for the conjunctive query processing, which is the default and most common search mode of a search engine.

References

  1. Altingovde, I. S., Demir, E., Can, F., and Ulusoy, Ö. 2008. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Trans. Info. Syst. 26, 3, 1--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Altingovde, I. S., Ozcan, R., and Ulusoy, Ö. 2009a. Exploiting query views for static index pruning in Web search engines. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 1951--1954. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Altingovde, I. S., Ozcan, R., and Ulusoy, Ö. 2009b. A practitioner's guide for static index pruning. In Proceedings of the 34th European Conference on IR Research. M. Boughanem, C. Berrut, J. Mothe, and C. Soulé-Dupuy, Eds., Lecture Notes in Computer Science, vol. 5478, Springer, 675--679. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Altingovde, I. S., Ozcan, R., and Ulusoy, Ö. 2011. Combining term and document popularities with query views for static index pruning. Tech. rep. BU-CE-1103, Computer Engineering Department, Bilkent University.Google ScholarGoogle Scholar
  5. Anh, V. N., de Kretser, O., and Moffat, A. 2001. Vector-space ranking with effective early termination. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 35--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Antonellis, I., Garcia-Molina, H., and Karim, J. 2009. Tagging with queries: how and why? In WSDM (Late Breaking-Results). R. A. Baeza-Yates, P. Boldi, B. A. Ribeiro-Neto, and B. B. Cambazoglu, Eds.Google ScholarGoogle Scholar
  7. Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., and Silvestri, F. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 183--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Blanco, R. 2008. Index compression for information retrieval systems. Ph.D. thesis, University of A Coruna, Spain.Google ScholarGoogle Scholar
  9. Blanco, R. and Barreiro, A. 2007a. Boosting static pruning of inverted files. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 777--778. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Blanco, R. and Barreiro, A. 2007b. Static pruning of terms in inverted files. In Proceedings of the 29th European Conference on IR Research. 64--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Büttcher, S. and Clarke, C. L. A. 2006. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. 182--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y. S., and Soffer, A. 2001. Static index pruning for information retrieval systems. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 43--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Castellanos, M. 2003. Hotminer: Discovering hot topics from dirty text. In Survey of Text Mining, M. W. Berry, Ed., Springer-Verlag, 223--233.Google ScholarGoogle Scholar
  14. Clarke, C. L., Craswell, N., and Soboroff, I. 2010. Overview of the trec 2009 web track. In Proceedings of the 18th Text Retrieval Conference.Google ScholarGoogle Scholar
  15. de Moura, E. S., dos Santos, C. F., Araujo, B., Silva, A. S., Calado, P., and Nascimento, M. A. 2008. Locality-based pruning methods for Web search. ACM Trans. Inf. Syst. 26, 2, 1--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. de Moura, E. S., dos Santos, C. F., Fernandes, D. R., Silva, A. S., Calado, P., and Nascimento, M. A. 2005. Improving Web search efficiency via a locality based static pruning method. In Proceedings of the 14th International Conference on World Wide Web. 235--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Garcia, S. 2007. Search engine optimization using past queries. Ph.D. thesis, RMIT University, Melbourne, Australia.Google ScholarGoogle Scholar
  18. Garcia, S. and Turpin, A. 2006. Efficient query evaluation through access-reordering. In Proceedings of the Asia Information Retrieval Symposium. H. T. Ng, M.-K. Leong, M.-Y. Kan, and D. Ji, Eds., Lecture Notes in Computer Science, vol. 4182, Springer, 106--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Garcia, S., Williams, H. E., and Cannane, A. 2004. Access-ordered indexes. In Proceedings of the 27th Australasian Conference on Computer Science (ACSC). 7--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Moffat, A. and Zobel, J. 1996. Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst. 14, 4, 349--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ntoulas, A. and Cho, J. 2007. Pruning policies for two-tiered inverted index with correctness guarantee. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 191--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Pass, G., Chowdhury, A., and Torgeson, C. 2006. A picture of search. In Proceedings of the 1st International Conference on Scalable Information Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Persin, M., Zobel, J., and Sacks-Davis, R. 1996. Filtered document retrieval with frequency-sorted indexes. J. Amer. Soc. Inf. Sci. 47, 10, 749--764. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Poblete, B. and Baeza-Yates, R. 2008. Query-sets: using implicit feedback and query patterns to organize Web documents. In Proceeding of the 17th International Conference on World Wide Web. 41--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Puppin, D., Perego, R., Silvestri, F., and Baeza-Yates, R. 2010. Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Trans. Inf. Syst. 28, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Puppin, D., Silvestri, F., and Laforenza, D. 2006. Query-driven document partitioning and collection selection. In Proceedings of the 1st International Conference on Scalable Information Systems. 34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Skobeltsyn, G., Junqueira, F., Plachouras, V., and Baeza-Yates, R. 2008. ResIn: a combination of results caching and index pruning for high-performance Web search engines. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 131--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tonellotto, N., Macdonald, C., and Ounis, I. 2010. Efficient dynamic pruning with proximity support. In Proceedings of LSDS-IR Workshop at SIGIR. 31--35.Google ScholarGoogle Scholar
  29. Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Comput. Sur. 38, 2, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Static index pruning in web search engines: Combining term and document popularities with query views

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Information Systems
          ACM Transactions on Information Systems  Volume 30, Issue 1
          February 2012
          193 pages
          ISSN:1046-8188
          EISSN:1558-2868
          DOI:10.1145/2094072
          Issue’s Table of Contents

          Copyright © 2012 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 March 2012
          • Accepted: 1 September 2011
          • Revised: 1 March 2011
          • Received: 1 January 2010
          Published in tois Volume 30, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader