ABSTRACT
Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the effectiveness of these approaches varies substantially with the sampled documents. This paper proposes Taily, a novel shard selection algorithm that models a query's score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function's features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time.
- A. Arampatzis and S. E. Robertson. Modeling score distributions in information retrieval. Information Retrieval, 14: 1--21, 2010. 10.1007/s10791-010-9145-5. Google ScholarDigital Library
- A. Arampatzis, N. Nussbaum, and J. Kamps. Where to stop reading a ranked list? In TREC'08, 2008.Google Scholar
- J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In CIKM'09, pages 1277--1286. ACM, 2009. 10.1145/1645953.1646115. Google ScholarDigital Library
- R. Baeza-Yates, C. Castillo, F. Junqueira, V. Plachouras, and F. Silvestri. Challenges on distributed web retrieval. In ICDE'07, pages 6--20, 2007. 10.1109/ICDE.2007.367846.Google ScholarCross Ref
- R. Baeza-Yates, V. Murdock, and C. Hauff. Efficiency trade-offs in two-tier web search systems. In SIGIR'09, pages 163--170. ACM, 2009. 10.1145/1571941.1571971. Google ScholarDigital Library
- J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In SIGIR '95, pages 21--28. ACM, 1995. 10.1145/215206.215328. Google ScholarDigital Library
- W. S. Cooper. Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Trans. Inf. Syst., 13 (1): 100--111, 1995. 10.1145/195705.195735. Google ScholarDigital Library
- G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. CoRR, abs/1004.5168, 2010.Google Scholar
- S. Ganguly, W. Hasan, and R. Krishnamurthy. Query optimization for parallel execution. In SIGMOD'92, pages 9--18, NY, USA, 1992. ACM. 10.1145/130283.130291. Google ScholarDigital Library
- L. Gravano and H. Garcia-Molina. Generalizing gloss to vector-space databases and broker hierarchies. In VLDB'95, pages 78--89. Morgan Kaufmann Publishers Inc., 1995. ISBN 1-55860-379-4. Google ScholarDigital Library
- D. Hiemstra and C. Hauff. Mapreduce for information retrieval evaluation: "let's quickly test this on 12 tb of data". In Multilingual and Multimodal Information Access Evaluation. Springer Verlag, 2010. Google ScholarDigital Library
- E. Kanoulas, K. Dai, V. Pavlu, and J. A. Aslam. Score distribution models: assumptions, intuition, and robustness to score manipulation. In SIGIR'10, pages 242--249. ACM, 2010. 10.1145/1835449.1835491. Google ScholarDigital Library
- A. Kulkarni and J. Callan. Document allocation policies for selective searching of distributed indexes. In CIKM '10, pages 449--458. ACM, 2010. 10.1145/1871437.1871497. Google ScholarDigital Library
- Kulkarni2012A. Kulkarni, A. S. Tigelaar, D. Hiemstra, and J. Callan. Shard ranking and cutoff estimation for topically partitioned collections. In CIKM'12. ACM, 2012. 10.1007/s10791-006-9014-4. Google ScholarDigital Library
- I. Markov. Modeling document scores for distributed information retrieval. In SIGIR'11, pages 1321--1322. ACM, 2011. 10.1145/2009916.2010180. Google ScholarDigital Library
- D. Metzler and W. B. Croft. A markov random field model for term dependencies. In SIGIR'05, pages 472--479. ACM, 2005. 10.1145/1076034.1076115. Google ScholarDigital Library
- D. Metzler, T. Strohman, H. Turtle, and W. Croft. Indri at trec 2004: Terabyte track. In TREC 2004, 2004.Google Scholar
- A. Moffat, W. Webber, J. Zobel, and R. Baeza-Yates. A pipelined architecture for distributed text query evaluation. In Information Retrieval, volume 10, pages 205--231. Kluwer Academic Publishers, 2007. 10.1007/s10791-006-9014-4. Google ScholarDigital Library
- D. Nguyen and J. Callan. Combination of evidence for effective web search. In TREC 2010, 2010.Google Scholar
- M. Shokouhi and L. Si. Federated search, volume 5. Now Publishers Inc, 2011. 10.1561/1500000010. Google ScholarDigital Library
- L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In SIGIR'03, pages 298--305. ACM, 2003. 10.1145/860435.860490. Google ScholarDigital Library
- P. Thomas and M. Shokouhi. Sushi: scoring scaled samples for server selection. In SIGIR'09, pages 419--426. ACM, 2009. 10.1145/1571941.1572014. Google ScholarDigital Library
- J. Xu and W. Croft. Cluster-based language models for distributed retrieval. In SIGIR'99, pages 254--261. ACM, 1999. Google ScholarDigital Library
Index Terms
- Taily: shard selection using the tail of score distributions
Recommendations
Where to stop reading a ranked list?: threshold optimization using truncated score distributions
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalRanked retrieval has a particular disadvantage in comparison with traditional Boolean retrieval: there is no clear cut-off point where to stop consulting results. This is a serious problem in some setups. We investigate and further develop methods to ...
Query-Biased Partitioning for Selective Search
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementSelective search is a cluster-based distributed retrieval architecture that reduces computational costs by partitioning a corpus into topical shards, and selectively searching them. Prior research formed topical shards by clustering the corpus based on ...
A signal-to-noise approach to score normalization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementScore normalization is indispensable in distributed retrieval and fusion or meta-search where merging of result-lists is required. Distributional approaches to score normalization with reference to relevance, such as binary mixture models like the ...
Comments