ABSTRACT
Indexes for large collections are often divided into shards that are distributed across multiple computers and searched in parallel to provide rapid interactive search. Typically, all index shards are searched for each query. For organizations with modest computational resources the high query processing cost incurred in this exhaustive search setup can be a deterrent to working with large collections. This paper investigates document allocation policies that permit searching only a few shards for each query (selective search) without sacrificing search accuracy. Random, source-based and topic-based document-to-shard allocation policies are studied in the context of selective search.
A thorough study of the tradeoff between search cost and search accuracy in a sharded index environment is performed using three large TREC collections. The experimental results demonstrate that selective search using topic-based shards cuts the search cost to less than 1/5th of that of the exhaustive search without reducing search accuracy across all the three datasets. Stability analysis shows that 90% of the queries do as well or improve with selective search. An overlap-based evaluation with an additional 1000 queries for each dataset tests and confirms the conclusions drawn using the smaller TREC query sets.
- J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In Proceeding of the 18th ACM conference on Information and knowledge management, pages 1277--1286, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo. Sources of evidence for vertical selection. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 315--322, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- R. Baeza-Yates, V. Murdock, and C. Hauff. Efficiency trade-offs in two-tier web search systems. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 163--170, Boston, MA, USA, 2009. ACM. Google ScholarDigital Library
- L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23(2):22--28, 2003. Google ScholarDigital Library
- D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003. Google ScholarDigital Library
- J. Callan. Distributed information retrieval. In Advances in Information Retrieval, pages 127--150. Kluwer Academic Publishers, 2000.Google Scholar
- J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 21--28, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
- C. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 Terabyte track. In Proceedings of the 2004 Text Retrieval Conference, 2004.Google Scholar
- J. Heaps. Information Retrieval - Computational and Theoretical Aspects. Academic Press Inc., New York, NY, 1978. Google ScholarDigital Library
- L. S. Larkey, M. E. Connell, and J. Callan. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the conference on Information and knowledge management, pages 282--289, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297. University of California Press, 1967.Google Scholar
- D. Metzler and W. B. Croft. Combining the language model and inference network approaches to retrieval. Information Processing and Management, 40(5):735--750, 2004. Google ScholarDigital Library
- D. Metzler and W. B. Croft. A markov random field model for term dependencies. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 472--479, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- P. Ogilvie and J. Callan. Experiments using the lemur toolkit. In Proceedings of the 2001 Text Retrieval Conference, pages 103--108, 2001.Google Scholar
- D. Puppin, F. Silvestri, and D. Laforenza. Query-driven document partitioning and collection selection. In Proceedings of the 1st international conference on Scalable information systems, page 34, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- D. Puppin, F. Silvestri, R. Perego, and R. Baeza-Yates. Load-balancing and caching for collection selection architectures. In Proceedings of the 2nd international conference on Scalable information systems, pages 1--10, Suzhou, China, 2007. ICST. Google ScholarDigital Library
- K. M. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. Web Congress, Latin American, 0:132, 2003. Google ScholarDigital Library
- L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 298--305, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- P. Thomas and M. Shokouhi. Sushi: Scoring scaled samples for server selection. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 419--426, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- C. J. van Rijsbergen. Information Retrieval. Butterworths, 1979. Google ScholarDigital Library
- J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 254--261, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179--214, 2004. Google ScholarDigital Library
Index Terms
- Document allocation policies for selective searching of distributed indexes
Recommendations
Load-Balancing in Distributed Selective Search
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information RetrievalSimulation and analysis have shown that selective search can reduce the cost of large-scale distributed information retrieval. By partitioning the collection into small topical shards, and then using a resource ranking algorithm to choose a subset of ...
Dynamic Shard Cutoff Prediction for Selective Search
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalSelective search architectures use resource selection algorithms such as Rank-S or Taily to rank index shards and determine how many to search for a given query. Most prior research evaluated solutions by their ability to improve efficiency without ...
Graph Searching and Interval Completion
In the early studies on graph searching a graph was considered as a system of tunnels in which a fast and clever fugitive is hidden. The "classical" search problem is to find a search plan using the minimal number of searchers. In this paper, we ...
Comments