research-article

Taily: shard selection using the tail of score distributions

Authors:
Robin Aly

University Twente, Enschede, Netherlands

University Twente, Enschede, Netherlands
View Profile

,
Djoerd Hiemstra

University Twente, Enschede, Netherlands

University Twente, Enschede, Netherlands
View Profile

,
Thomas Demeester

Ghent University, Ghent, Belgium

Ghent University, Ghent, Belgium
View Profile

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalJuly 2013Pages 673–682https://doi.org/10.1145/2484028.2484033

Published:28 July 2013Publication History

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pages 673–682

ABSTRACT

Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the effectiveness of these approaches varies substantially with the sampled documents. This paper proposes Taily, a novel shard selection algorithm that models a query's score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function's features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time.

References

A. Arampatzis and S. E. Robertson. Modeling score distributions in information retrieval. Information Retrieval, 14: 1--21, 2010. 10.1007/s10791-010-9145-5. Google ScholarDigital Library
A. Arampatzis, N. Nussbaum, and J. Kamps. Where to stop reading a ranked list? In TREC'08, 2008.Google Scholar
J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In CIKM'09, pages 1277--1286. ACM, 2009. 10.1145/1645953.1646115. Google ScholarDigital Library
R. Baeza-Yates, C. Castillo, F. Junqueira, V. Plachouras, and F. Silvestri. Challenges on distributed web retrieval. In ICDE'07, pages 6--20, 2007. 10.1109/ICDE.2007.367846.Google ScholarCross Ref
R. Baeza-Yates, V. Murdock, and C. Hauff. Efficiency trade-offs in two-tier web search systems. In SIGIR'09, pages 163--170. ACM, 2009. 10.1145/1571941.1571971. Google ScholarDigital Library
J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In SIGIR '95, pages 21--28. ACM, 1995. 10.1145/215206.215328. Google ScholarDigital Library
W. S. Cooper. Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Trans. Inf. Syst., 13 (1): 100--111, 1995. 10.1145/195705.195735. Google ScholarDigital Library
G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. CoRR, abs/1004.5168, 2010.Google Scholar
S. Ganguly, W. Hasan, and R. Krishnamurthy. Query optimization for parallel execution. In SIGMOD'92, pages 9--18, NY, USA, 1992. ACM. 10.1145/130283.130291. Google ScholarDigital Library
L. Gravano and H. Garcia-Molina. Generalizing gloss to vector-space databases and broker hierarchies. In VLDB'95, pages 78--89. Morgan Kaufmann Publishers Inc., 1995. ISBN 1-55860-379-4. Google ScholarDigital Library
D. Hiemstra and C. Hauff. Mapreduce for information retrieval evaluation: "let's quickly test this on 12 tb of data". In Multilingual and Multimodal Information Access Evaluation. Springer Verlag, 2010. Google ScholarDigital Library
E. Kanoulas, K. Dai, V. Pavlu, and J. A. Aslam. Score distribution models: assumptions, intuition, and robustness to score manipulation. In SIGIR'10, pages 242--249. ACM, 2010. 10.1145/1835449.1835491. Google ScholarDigital Library
A. Kulkarni and J. Callan. Document allocation policies for selective searching of distributed indexes. In CIKM '10, pages 449--458. ACM, 2010. 10.1145/1871437.1871497. Google ScholarDigital Library
Kulkarni2012A. Kulkarni, A. S. Tigelaar, D. Hiemstra, and J. Callan. Shard ranking and cutoff estimation for topically partitioned collections. In CIKM'12. ACM, 2012. 10.1007/s10791-006-9014-4. Google ScholarDigital Library
I. Markov. Modeling document scores for distributed information retrieval. In SIGIR'11, pages 1321--1322. ACM, 2011. 10.1145/2009916.2010180. Google ScholarDigital Library
D. Metzler and W. B. Croft. A markov random field model for term dependencies. In SIGIR'05, pages 472--479. ACM, 2005. 10.1145/1076034.1076115. Google ScholarDigital Library
D. Metzler, T. Strohman, H. Turtle, and W. Croft. Indri at trec 2004: Terabyte track. In TREC 2004, 2004.Google Scholar
A. Moffat, W. Webber, J. Zobel, and R. Baeza-Yates. A pipelined architecture for distributed text query evaluation. In Information Retrieval, volume 10, pages 205--231. Kluwer Academic Publishers, 2007. 10.1007/s10791-006-9014-4. Google ScholarDigital Library
D. Nguyen and J. Callan. Combination of evidence for effective web search. In TREC 2010, 2010.Google Scholar
M. Shokouhi and L. Si. Federated search, volume 5. Now Publishers Inc, 2011. 10.1561/1500000010. Google ScholarDigital Library
L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In SIGIR'03, pages 298--305. ACM, 2003. 10.1145/860435.860490. Google ScholarDigital Library
P. Thomas and M. Shokouhi. Sushi: scoring scaled samples for server selection. In SIGIR'09, pages 419--426. ACM, 2009. 10.1145/1571941.1572014. Google ScholarDigital Library
J. Xu and W. Croft. Cluster-based language models for distributed retrieval. In SIGIR'99, pages 254--261. ACM, 1999. Google ScholarDigital Library

Index Terms

Taily: shard selection using the tail of score distributions
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Where to stop reading a ranked list?: threshold optimization using truncated score distributions
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Ranked retrieval has a particular disadvantage in comparison with traditional Boolean retrieval: there is no clear cut-off point where to stop consulting results. This is a serious problem in some setups. We investigate and further develop methods to ...
Read More
Query-Biased Partitioning for Selective Search
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Selective search is a cluster-based distributed retrieval architecture that reduces computational costs by partitioning a corpus into topical shards, and selectively searching them. Prior research formed topical shards by clustering the corpus based on ...
Read More
A signal-to-noise approach to score normalization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Score normalization is indispensable in distributed retrieval and fusion or meta-search where merging of result-lists is required. Distributional approaches to score normalization with reference to relevance, such as binary mixture models like the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
July 2013
1188 pages
ISBN:9781450320344
DOI:10.1145/2484028
General Chairs:
Gareth J.F. Jones
Dublin City University, Ireland
,
Páraic Sheridan
Dublin City University, Ireland
,
Program Chairs:
Diane Kelly
University of North Carolina, Chapel Hill, USA
,
Maarten de Rijke
University of Amsterdam, The Netherlands
,
Tetsuya Sakai
Microsoft Research Asia, China
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 July 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
database selection
distributed retrieval
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '13 Paper Acceptance Rate73of366submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 339
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Taily: shard selection using the tail of score distributions

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Where to stop reading a ranked list?: threshold optimization using truncated score distributions

Query-Biased Partitioning for Selective Search

A signal-to-noise approach to score normalization