skip to main content
10.1145/1871437.1871497acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Document allocation policies for selective searching of distributed indexes

Published:26 October 2010Publication History

ABSTRACT

Indexes for large collections are often divided into shards that are distributed across multiple computers and searched in parallel to provide rapid interactive search. Typically, all index shards are searched for each query. For organizations with modest computational resources the high query processing cost incurred in this exhaustive search setup can be a deterrent to working with large collections. This paper investigates document allocation policies that permit searching only a few shards for each query (selective search) without sacrificing search accuracy. Random, source-based and topic-based document-to-shard allocation policies are studied in the context of selective search.

A thorough study of the tradeoff between search cost and search accuracy in a sharded index environment is performed using three large TREC collections. The experimental results demonstrate that selective search using topic-based shards cuts the search cost to less than 1/5th of that of the exhaustive search without reducing search accuracy across all the three datasets. Stability analysis shows that 90% of the queries do as well or improve with selective search. An overlap-based evaluation with an additional 1000 queries for each dataset tests and confirms the conclusions drawn using the smaller TREC query sets.

References

  1. J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In Proceeding of the 18th ACM conference on Information and knowledge management, pages 1277--1286, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo. Sources of evidence for vertical selection. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 315--322, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baeza-Yates, V. Murdock, and C. Hauff. Efficiency trade-offs in two-tier web search systems. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 163--170, Boston, MA, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23(2):22--28, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Callan. Distributed information retrieval. In Advances in Information Retrieval, pages 127--150. Kluwer Academic Publishers, 2000.Google ScholarGoogle Scholar
  7. J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 21--28, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 Terabyte track. In Proceedings of the 2004 Text Retrieval Conference, 2004.Google ScholarGoogle Scholar
  9. J. Heaps. Information Retrieval - Computational and Theoretical Aspects. Academic Press Inc., New York, NY, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. S. Larkey, M. E. Connell, and J. Callan. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the conference on Information and knowledge management, pages 282--289, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297. University of California Press, 1967.Google ScholarGoogle Scholar
  12. D. Metzler and W. B. Croft. Combining the language model and inference network approaches to retrieval. Information Processing and Management, 40(5):735--750, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Metzler and W. B. Croft. A markov random field model for term dependencies. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 472--479, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Ogilvie and J. Callan. Experiments using the lemur toolkit. In Proceedings of the 2001 Text Retrieval Conference, pages 103--108, 2001.Google ScholarGoogle Scholar
  15. D. Puppin, F. Silvestri, and D. Laforenza. Query-driven document partitioning and collection selection. In Proceedings of the 1st international conference on Scalable information systems, page 34, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Puppin, F. Silvestri, R. Perego, and R. Baeza-Yates. Load-balancing and caching for collection selection architectures. In Proceedings of the 2nd international conference on Scalable information systems, pages 1--10, Suzhou, China, 2007. ICST. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. M. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. Web Congress, Latin American, 0:132, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 298--305, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Thomas and M. Shokouhi. Sushi: Scoring scaled samples for server selection. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 419--426, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. J. van Rijsbergen. Information Retrieval. Butterworths, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 254--261, New York, NY, USA, 1999. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179--214, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Document allocation policies for selective searching of distributed indexes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
      October 2010
      2036 pages
      ISBN:9781450300995
      DOI:10.1145/1871437

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 October 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader