research-article

Document allocation policies for selective searching of distributed indexes

Authors:
Anagha Kulkarni

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Jamie Callan

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementOctober 2010Pages 449–458https://doi.org/10.1145/1871437.1871497

Published:26 October 2010Publication History

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 449–458

ABSTRACT

Indexes for large collections are often divided into shards that are distributed across multiple computers and searched in parallel to provide rapid interactive search. Typically, all index shards are searched for each query. For organizations with modest computational resources the high query processing cost incurred in this exhaustive search setup can be a deterrent to working with large collections. This paper investigates document allocation policies that permit searching only a few shards for each query (selective search) without sacrificing search accuracy. Random, source-based and topic-based document-to-shard allocation policies are studied in the context of selective search.

A thorough study of the tradeoff between search cost and search accuracy in a sharded index environment is performed using three large TREC collections. The experimental results demonstrate that selective search using topic-based shards cuts the search cost to less than 1/5th of that of the exhaustive search without reducing search accuracy across all the three datasets. Stability analysis shows that 90% of the queries do as well or improve with selective search. An overlap-based evaluation with an additional 1000 queries for each dataset tests and confirms the conclusions drawn using the smaller TREC query sets.

References

J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In Proceeding of the 18th ACM conference on Information and knowledge management, pages 1277--1286, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo. Sources of evidence for vertical selection. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 315--322, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
R. Baeza-Yates, V. Murdock, and C. Hauff. Efficiency trade-offs in two-tier web search systems. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 163--170, Boston, MA, USA, 2009. ACM. Google ScholarDigital Library
L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23(2):22--28, 2003. Google ScholarDigital Library
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003. Google ScholarDigital Library
J. Callan. Distributed information retrieval. In Advances in Information Retrieval, pages 127--150. Kluwer Academic Publishers, 2000.Google Scholar
J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 21--28, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
C. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 Terabyte track. In Proceedings of the 2004 Text Retrieval Conference, 2004.Google Scholar
J. Heaps. Information Retrieval - Computational and Theoretical Aspects. Academic Press Inc., New York, NY, 1978. Google ScholarDigital Library
L. S. Larkey, M. E. Connell, and J. Callan. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the conference on Information and knowledge management, pages 282--289, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297. University of California Press, 1967.Google Scholar
D. Metzler and W. B. Croft. Combining the language model and inference network approaches to retrieval. Information Processing and Management, 40(5):735--750, 2004. Google ScholarDigital Library
D. Metzler and W. B. Croft. A markov random field model for term dependencies. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 472--479, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
P. Ogilvie and J. Callan. Experiments using the lemur toolkit. In Proceedings of the 2001 Text Retrieval Conference, pages 103--108, 2001.Google Scholar
D. Puppin, F. Silvestri, and D. Laforenza. Query-driven document partitioning and collection selection. In Proceedings of the 1st international conference on Scalable information systems, page 34, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
D. Puppin, F. Silvestri, R. Perego, and R. Baeza-Yates. Load-balancing and caching for collection selection architectures. In Proceedings of the 2nd international conference on Scalable information systems, pages 1--10, Suzhou, China, 2007. ICST. Google ScholarDigital Library
K. M. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. Web Congress, Latin American, 0:132, 2003. Google ScholarDigital Library
L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 298--305, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
P. Thomas and M. Shokouhi. Sushi: Scoring scaled samples for server selection. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 419--426, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
C. J. van Rijsbergen. Information Retrieval. Butterworths, 1979. Google ScholarDigital Library
J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, pages 254--261, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179--214, 2004. Google ScholarDigital Library

Index Terms

Document allocation policies for selective searching of distributed indexes
1. Information systems
  1. Information retrieval

Recommendations

Load-Balancing in Distributed Selective Search
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Simulation and analysis have shown that selective search can reduce the cost of large-scale distributed information retrieval. By partitioning the collection into small topical shards, and then using a resource ranking algorithm to choose a subset of ...
Read More
Dynamic Shard Cutoff Prediction for Selective Search
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Selective search architectures use resource selection algorithms such as Rank-S or Taily to rank index shards and determine how many to search for a given query. Most prior research evaluated solutions by their ability to improve efficiency without ...
Read More
Graph Searching and Interval Completion

In the early studies on graph searching a graph was considered as a system of tunnels in which a fast and clever fugitive is hidden. The "classical" search problem is to find a search plan using the minimal number of searchers. In this paper, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
collection selection
distributed search
document allocation policies
search cost
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 43
  Total Citations
  View Citations
- 305
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Document allocation policies for selective searching of distributed indexes

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Load-Balancing in Distributed Selective Search

Dynamic Shard Cutoff Prediction for Selective Search

Graph Searching and Interval Completion