skip to main content
10.1145/2484028.2484033acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Taily: shard selection using the tail of score distributions

Published:28 July 2013Publication History

ABSTRACT

Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the effectiveness of these approaches varies substantially with the sampled documents. This paper proposes Taily, a novel shard selection algorithm that models a query's score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function's features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time.

References

  1. A. Arampatzis and S. E. Robertson. Modeling score distributions in information retrieval. Information Retrieval, 14: 1--21, 2010. 10.1007/s10791-010-9145-5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arampatzis, N. Nussbaum, and J. Kamps. Where to stop reading a ranked list? In TREC'08, 2008.Google ScholarGoogle Scholar
  3. J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In CIKM'09, pages 1277--1286. ACM, 2009. 10.1145/1645953.1646115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Baeza-Yates, C. Castillo, F. Junqueira, V. Plachouras, and F. Silvestri. Challenges on distributed web retrieval. In ICDE'07, pages 6--20, 2007. 10.1109/ICDE.2007.367846.Google ScholarGoogle ScholarCross RefCross Ref
  5. R. Baeza-Yates, V. Murdock, and C. Hauff. Efficiency trade-offs in two-tier web search systems. In SIGIR'09, pages 163--170. ACM, 2009. 10.1145/1571941.1571971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In SIGIR '95, pages 21--28. ACM, 1995. 10.1145/215206.215328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. S. Cooper. Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Trans. Inf. Syst., 13 (1): 100--111, 1995. 10.1145/195705.195735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. CoRR, abs/1004.5168, 2010.Google ScholarGoogle Scholar
  9. S. Ganguly, W. Hasan, and R. Krishnamurthy. Query optimization for parallel execution. In SIGMOD'92, pages 9--18, NY, USA, 1992. ACM. 10.1145/130283.130291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Gravano and H. Garcia-Molina. Generalizing gloss to vector-space databases and broker hierarchies. In VLDB'95, pages 78--89. Morgan Kaufmann Publishers Inc., 1995. ISBN 1-55860-379-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Hiemstra and C. Hauff. Mapreduce for information retrieval evaluation: "let's quickly test this on 12 tb of data". In Multilingual and Multimodal Information Access Evaluation. Springer Verlag, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Kanoulas, K. Dai, V. Pavlu, and J. A. Aslam. Score distribution models: assumptions, intuition, and robustness to score manipulation. In SIGIR'10, pages 242--249. ACM, 2010. 10.1145/1835449.1835491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Kulkarni and J. Callan. Document allocation policies for selective searching of distributed indexes. In CIKM '10, pages 449--458. ACM, 2010. 10.1145/1871437.1871497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kulkarni2012A. Kulkarni, A. S. Tigelaar, D. Hiemstra, and J. Callan. Shard ranking and cutoff estimation for topically partitioned collections. In CIKM'12. ACM, 2012. 10.1007/s10791-006-9014-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Markov. Modeling document scores for distributed information retrieval. In SIGIR'11, pages 1321--1322. ACM, 2011. 10.1145/2009916.2010180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Metzler and W. B. Croft. A markov random field model for term dependencies. In SIGIR'05, pages 472--479. ACM, 2005. 10.1145/1076034.1076115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Metzler, T. Strohman, H. Turtle, and W. Croft. Indri at trec 2004: Terabyte track. In TREC 2004, 2004.Google ScholarGoogle Scholar
  18. A. Moffat, W. Webber, J. Zobel, and R. Baeza-Yates. A pipelined architecture for distributed text query evaluation. In Information Retrieval, volume 10, pages 205--231. Kluwer Academic Publishers, 2007. 10.1007/s10791-006-9014-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Nguyen and J. Callan. Combination of evidence for effective web search. In TREC 2010, 2010.Google ScholarGoogle Scholar
  20. M. Shokouhi and L. Si. Federated search, volume 5. Now Publishers Inc, 2011. 10.1561/1500000010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In SIGIR'03, pages 298--305. ACM, 2003. 10.1145/860435.860490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Thomas and M. Shokouhi. Sushi: scoring scaled samples for server selection. In SIGIR'09, pages 419--426. ACM, 2009. 10.1145/1571941.1572014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Xu and W. Croft. Cluster-based language models for distributed retrieval. In SIGIR'99, pages 254--261. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Taily: shard selection using the tail of score distributions

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
      July 2013
      1188 pages
      ISBN:9781450320344
      DOI:10.1145/2484028

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 July 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGIR '13 Paper Acceptance Rate73of366submissions,20%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader