ABSTRACT
Selective search is a cluster-based distributed retrieval architecture that reduces computational costs by partitioning a corpus into topical shards, and selectively searching them. Prior research formed topical shards by clustering the corpus based on the documents' contents. This content-based partitioning strategy reveals common topics in a corpus. However, the topic distribution produced by clustering may not match the distribution of topics in search traffic, which may reduce the effectiveness of selective search.
This paper presents a query-biased partitioning strategy that aligns document partitions with topics from query logs. It focuses on two parts of the partitioning process: clustering initialization and document similarity calculation. A query-driven clustering initialization algorithm uses topics from query logs to form cluster seeds. A query-biased similarity metric favors terms that are important in query logs. Both methods boost retrieval effectiveness, reduce variance, and produce a more balanced distribution of shard sizes.
- I. S. Altingövde, E. Demir, F. Can, and Ö. Ulusoy. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Transactions on Information Systems, 26(3), 2008. Google ScholarDigital Library
- R. Aly, T. Demeester, and D. Hiemstra. Taily: Shard selection using the tail of score distributions. In Proceedings of SIGIR, 2013. Google ScholarDigital Library
- Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137--1155, 2003. Google ScholarDigital Library
- M. E. Celebi, H. A. Kingravi, and P. A. Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1):200--210, 2013. Google ScholarDigital Library
- C. L. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2004 terabyte track. In TREC, volume 4, page 74, 2004.Google Scholar
- Z. Dai, Y. Kim, and J. Callan. How random decisions affect selective distributed search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 771--774. ACM, 2015. Google ScholarDigital Library
- J. C. French, A. L. Powell, F. C. Gey, and N. Perelman. Exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, pages 199--206, 2001. Google ScholarDigital Library
- Y. Kim, J. Callan, S. Culpepper, and A. Moffat. Load-balancing in distributed selective search. In Proceedings of the 39th International ACM SIGIR conference on on Research and Development in Information Retrieval, pages ??--?? ACM, 2016. Google ScholarDigital Library
- A. Kulkarni. Efficient and Effective Large-scale Search. PhD thesis, Carnegie Mellon University, 2013.Google Scholar
- A. Kulkarni and J. Callan. Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM '10), pages 449--458. ACM, 2010. Google ScholarDigital Library
- A. Kulkarni and J. Callan. Selective search: Efficient and effective search of large textual collections. ACM Transactions on Information Systems (TOIS), 33(4):17, 2015. Google ScholarDigital Library
- A. Kulkarni, A. Tigelaar, J. Callan, and D. Hiemstra. Shard ranking and cutoff estimation for topically partitioned collections. In Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM'12), 2012. Google ScholarDigital Library
- D. Metzler and W. B. Croft. A markov random field model for term dependencies. In Proceedings of the 28th International ACM SIGIR conference on Research and development in information retrieval, pages 472--479. ACM, 2005. Google ScholarDigital Library
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013. Google ScholarDigital Library
- B. Poblete and R. Baeza-Yates. Query-sets: using implicit feedback and query patterns to organize web documents. In Proceedings of the 17th International Conference on World Wide Web, pages 41--50. ACM, 2008. Google ScholarDigital Library
- D. Puppin, F. Silvestri, and D. Laforenza. Query-driven document partitioning and collection selection. In Proceedings of the 1st international conference on Scalable information systems, page 34. ACM, 2006. Google ScholarDigital Library
- L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 298--305. ACM, 2003. Google ScholarDigital Library
- P. Thomas and M. Shokouhi. Sushi: scoring scaled samples for server selection. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 419--426. ACM, 2009. Google ScholarDigital Library
- K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al. Constrained k-means clustering with background knowledge. In Proceedings of the 18th International Conference on Machine Learning (ICML'01), volume 1, pages 577--584, 2001. Google ScholarDigital Library
- E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pages 505--512, 2002. Google ScholarDigital Library
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2):179--214, 2004. Google ScholarDigital Library
Index Terms
- Query-Biased Partitioning for Selective Search
Recommendations
Load-Balancing in Distributed Selective Search
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information RetrievalSimulation and analysis have shown that selective search can reduce the cost of large-scale distributed information retrieval. By partitioning the collection into small topical shards, and then using a resource ranking algorithm to choose a subset of ...
Dynamic Shard Cutoff Prediction for Selective Search
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalSelective search architectures use resource selection algorithms such as Rank-S or Taily to rank index shards and determine how many to search for a given query. Most prior research evaluated solutions by their ability to improve efficiency without ...
Selective Search: Efficient and Effective Search of Large Textual Collections
The traditional search solution for large collections divides the collection into subsets (shards), and processes the query against all shards in parallel (exhaustive search). The search cost and the computational requirements of this approach are often ...
Comments