Query-Biased Partitioning for Selective Search

Authors:
Zhuyun Dai

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Chenyan Xiong

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Jamie Callan

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementOctober 2016Pages 1119–1128https://doi.org/10.1145/2983323.2983706

Published:24 October 2016Publication History

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Pages 1119–1128

ABSTRACT

Selective search is a cluster-based distributed retrieval architecture that reduces computational costs by partitioning a corpus into topical shards, and selectively searching them. Prior research formed topical shards by clustering the corpus based on the documents' contents. This content-based partitioning strategy reveals common topics in a corpus. However, the topic distribution produced by clustering may not match the distribution of topics in search traffic, which may reduce the effectiveness of selective search.

This paper presents a query-biased partitioning strategy that aligns document partitions with topics from query logs. It focuses on two parts of the partitioning process: clustering initialization and document similarity calculation. A query-driven clustering initialization algorithm uses topics from query logs to form cluster seeds. A query-biased similarity metric favors terms that are important in query logs. Both methods boost retrieval effectiveness, reduce variance, and produce a more balanced distribution of shard sizes.

References

I. S. Altingövde, E. Demir, F. Can, and Ö. Ulusoy. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Transactions on Information Systems, 26(3), 2008. Google ScholarDigital Library
R. Aly, T. Demeester, and D. Hiemstra. Taily: Shard selection using the tail of score distributions. In Proceedings of SIGIR, 2013. Google ScholarDigital Library
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137--1155, 2003. Google ScholarDigital Library
M. E. Celebi, H. A. Kingravi, and P. A. Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1):200--210, 2013. Google ScholarDigital Library
C. L. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2004 terabyte track. In TREC, volume 4, page 74, 2004.Google Scholar
Z. Dai, Y. Kim, and J. Callan. How random decisions affect selective distributed search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 771--774. ACM, 2015. Google ScholarDigital Library
J. C. French, A. L. Powell, F. C. Gey, and N. Perelman. Exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, pages 199--206, 2001. Google ScholarDigital Library
Y. Kim, J. Callan, S. Culpepper, and A. Moffat. Load-balancing in distributed selective search. In Proceedings of the 39th International ACM SIGIR conference on on Research and Development in Information Retrieval, pages ??--?? ACM, 2016. Google ScholarDigital Library
A. Kulkarni. Efficient and Effective Large-scale Search. PhD thesis, Carnegie Mellon University, 2013.Google Scholar
A. Kulkarni and J. Callan. Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM '10), pages 449--458. ACM, 2010. Google ScholarDigital Library
A. Kulkarni and J. Callan. Selective search: Efficient and effective search of large textual collections. ACM Transactions on Information Systems (TOIS), 33(4):17, 2015. Google ScholarDigital Library
A. Kulkarni, A. Tigelaar, J. Callan, and D. Hiemstra. Shard ranking and cutoff estimation for topically partitioned collections. In Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM'12), 2012. Google ScholarDigital Library
D. Metzler and W. B. Croft. A markov random field model for term dependencies. In Proceedings of the 28th International ACM SIGIR conference on Research and development in information retrieval, pages 472--479. ACM, 2005. Google ScholarDigital Library
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013. Google ScholarDigital Library
B. Poblete and R. Baeza-Yates. Query-sets: using implicit feedback and query patterns to organize web documents. In Proceedings of the 17th International Conference on World Wide Web, pages 41--50. ACM, 2008. Google ScholarDigital Library
D. Puppin, F. Silvestri, and D. Laforenza. Query-driven document partitioning and collection selection. In Proceedings of the 1st international conference on Scalable information systems, page 34. ACM, 2006. Google ScholarDigital Library
L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 298--305. ACM, 2003. Google ScholarDigital Library
P. Thomas and M. Shokouhi. Sushi: scoring scaled samples for server selection. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 419--426. ACM, 2009. Google ScholarDigital Library
K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al. Constrained k-means clustering with background knowledge. In Proceedings of the 18th International Conference on Machine Learning (ICML'01), volume 1, pages 577--584, 2001. Google ScholarDigital Library
E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pages 505--512, 2002. Google ScholarDigital Library
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2):179--214, 2004. Google ScholarDigital Library

Index Terms

Query-Biased Partitioning for Selective Search
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Distributed retrieval

Recommendations

Load-Balancing in Distributed Selective Search
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Simulation and analysis have shown that selective search can reduce the cost of large-scale distributed information retrieval. By partitioning the collection into small topical shards, and then using a resource ranking algorithm to choose a subset of ...
Read More
Dynamic Shard Cutoff Prediction for Selective Search
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Selective search architectures use resource selection algorithms such as Rank-S or Taily to rank index shards and determine how many to search for a given query. Most prior research evaluated solutions by their ability to improve efficiency without ...
Read More
Selective Search: Efficient and Effective Search of Large Textual Collections

The traditional search solution for large collections divides the collection into subsets (shards), and processes the query against all shards in parallel (exhaustive search). The search cost and the computational requirements of this approach are often ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
October 2016
2566 pages
ISBN:9781450340731
DOI:10.1145/2983323
General Chairs:
Snehasis Mukhopadhyay
Indiana University Purdue University Indianapolis, USA
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Program Chairs:
Elisa Bertino
Purdue University
,
Fabio Crestani
University of Lugano
,
Javed Mostafa
University of North Carolina
,
Jie Tang
Tsinghua University
,
Luo Si
Alibaba Group Inc & Purdue University
,
Xiaofang Zhou
University of Queensland
,
Yi Chang
Yahoo Research
,
Yunyao Li
IBM Research - Almaden
,
Parikshit Sondhi
WalmartLabs
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed retrieval
selective search
shard partitioning
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '16 Paper Acceptance Rate160of701submissions,23%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 397
  Total Downloads
- Downloads (Last 12 months)49
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Query-Biased Partitioning for Selective Search

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Load-Balancing in Distributed Selective Search

Dynamic Shard Cutoff Prediction for Selective Search

Selective Search: Efficient and Effective Search of Large Textual Collections