Abstract
Static index pruning techniques permanently remove a presumably redundant part of an inverted file, to reduce the file size and query processing time. These techniques differ in deciding which parts of an index can be removed safely; that is, without changing the top-ranked query results. As defined in the literature, the query view of a document is the set of query terms that access to this particular document, that is, retrieves this document among its top results. In this paper, we first propose using query views to improve the quality of the top results compared against the original results. We incorporate query views in a number of static pruning strategies, namely term-centric, document-centric, term popularity based and document access popularity based approaches, and show that the new strategies considerably outperform their counterparts especially for the higher levels of pruning and for both disjunctive and conjunctive query processing. Additionally, we combine the notions of term and document access popularity to form new pruning strategies, and further extend these strategies with the query views. The new strategies improve the result quality especially for the conjunctive query processing, which is the default and most common search mode of a search engine.
- Altingovde, I. S., Demir, E., Can, F., and Ulusoy, Ö. 2008. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Trans. Info. Syst. 26, 3, 1--36. Google ScholarDigital Library
- Altingovde, I. S., Ozcan, R., and Ulusoy, Ö. 2009a. Exploiting query views for static index pruning in Web search engines. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 1951--1954. Google ScholarDigital Library
- Altingovde, I. S., Ozcan, R., and Ulusoy, Ö. 2009b. A practitioner's guide for static index pruning. In Proceedings of the 34th European Conference on IR Research. M. Boughanem, C. Berrut, J. Mothe, and C. Soulé-Dupuy, Eds., Lecture Notes in Computer Science, vol. 5478, Springer, 675--679. Google ScholarDigital Library
- Altingovde, I. S., Ozcan, R., and Ulusoy, Ö. 2011. Combining term and document popularities with query views for static index pruning. Tech. rep. BU-CE-1103, Computer Engineering Department, Bilkent University.Google Scholar
- Anh, V. N., de Kretser, O., and Moffat, A. 2001. Vector-space ranking with effective early termination. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 35--42. Google ScholarDigital Library
- Antonellis, I., Garcia-Molina, H., and Karim, J. 2009. Tagging with queries: how and why? In WSDM (Late Breaking-Results). R. A. Baeza-Yates, P. Boldi, B. A. Ribeiro-Neto, and B. B. Cambazoglu, Eds.Google Scholar
- Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., and Silvestri, F. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 183--190. Google ScholarDigital Library
- Blanco, R. 2008. Index compression for information retrieval systems. Ph.D. thesis, University of A Coruna, Spain.Google Scholar
- Blanco, R. and Barreiro, A. 2007a. Boosting static pruning of inverted files. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 777--778. Google ScholarDigital Library
- Blanco, R. and Barreiro, A. 2007b. Static pruning of terms in inverted files. In Proceedings of the 29th European Conference on IR Research. 64--75. Google ScholarDigital Library
- Büttcher, S. and Clarke, C. L. A. 2006. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. 182--189. Google ScholarDigital Library
- Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y. S., and Soffer, A. 2001. Static index pruning for information retrieval systems. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 43--50. Google ScholarDigital Library
- Castellanos, M. 2003. Hotminer: Discovering hot topics from dirty text. In Survey of Text Mining, M. W. Berry, Ed., Springer-Verlag, 223--233.Google Scholar
- Clarke, C. L., Craswell, N., and Soboroff, I. 2010. Overview of the trec 2009 web track. In Proceedings of the 18th Text Retrieval Conference.Google Scholar
- de Moura, E. S., dos Santos, C. F., Araujo, B., Silva, A. S., Calado, P., and Nascimento, M. A. 2008. Locality-based pruning methods for Web search. ACM Trans. Inf. Syst. 26, 2, 1--28. Google ScholarDigital Library
- de Moura, E. S., dos Santos, C. F., Fernandes, D. R., Silva, A. S., Calado, P., and Nascimento, M. A. 2005. Improving Web search efficiency via a locality based static pruning method. In Proceedings of the 14th International Conference on World Wide Web. 235--244. Google ScholarDigital Library
- Garcia, S. 2007. Search engine optimization using past queries. Ph.D. thesis, RMIT University, Melbourne, Australia.Google Scholar
- Garcia, S. and Turpin, A. 2006. Efficient query evaluation through access-reordering. In Proceedings of the Asia Information Retrieval Symposium. H. T. Ng, M.-K. Leong, M.-Y. Kan, and D. Ji, Eds., Lecture Notes in Computer Science, vol. 4182, Springer, 106--118. Google ScholarDigital Library
- Garcia, S., Williams, H. E., and Cannane, A. 2004. Access-ordered indexes. In Proceedings of the 27th Australasian Conference on Computer Science (ACSC). 7--14. Google ScholarDigital Library
- Moffat, A. and Zobel, J. 1996. Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst. 14, 4, 349--379. Google ScholarDigital Library
- Ntoulas, A. and Cho, J. 2007. Pruning policies for two-tiered inverted index with correctness guarantee. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 191--198. Google ScholarDigital Library
- Pass, G., Chowdhury, A., and Torgeson, C. 2006. A picture of search. In Proceedings of the 1st International Conference on Scalable Information Systems. Google ScholarDigital Library
- Persin, M., Zobel, J., and Sacks-Davis, R. 1996. Filtered document retrieval with frequency-sorted indexes. J. Amer. Soc. Inf. Sci. 47, 10, 749--764. Google ScholarDigital Library
- Poblete, B. and Baeza-Yates, R. 2008. Query-sets: using implicit feedback and query patterns to organize Web documents. In Proceeding of the 17th International Conference on World Wide Web. 41--50. Google ScholarDigital Library
- Puppin, D., Perego, R., Silvestri, F., and Baeza-Yates, R. 2010. Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Trans. Inf. Syst. 28, 2. Google ScholarDigital Library
- Puppin, D., Silvestri, F., and Laforenza, D. 2006. Query-driven document partitioning and collection selection. In Proceedings of the 1st International Conference on Scalable Information Systems. 34. Google ScholarDigital Library
- Skobeltsyn, G., Junqueira, F., Plachouras, V., and Baeza-Yates, R. 2008. ResIn: a combination of results caching and index pruning for high-performance Web search engines. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 131--138. Google ScholarDigital Library
- Tonellotto, N., Macdonald, C., and Ounis, I. 2010. Efficient dynamic pruning with proximity support. In Proceedings of LSDS-IR Workshop at SIGIR. 31--35.Google Scholar
- Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Comput. Sur. 38, 2, 6. Google ScholarDigital Library
Index Terms
- Static index pruning in web search engines: Combining term and document popularities with query views
Recommendations
Exploiting query views for static index pruning in web search engines
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementWe propose incorporating query views in a number of static pruning strategies, namely term-centric, document-centric and access-based approaches. These query-view based strategies considerably outperform their counterparts for both disjunctive and ...
Cache-Based Query Processing for Search Engines
In practice, a search engine may fail to serve a query due to various reasons such as hardware/network failures, excessive query load, lack of matching documents, or service contract limitations (e.g., the query rate limits for third-party users of a ...
On Using Query Logs for Static Index Pruning
WI-IAT '10: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01Static index pruning techniques aim at removing from the posting lists of an inverted file the references to documents which are likely to be not relevant for answering user queries. The reduction in the size of the index results in a better ...
Comments