ABSTRACT
In some retrieval situations, a system must search across multiple collections. This task, referred to as federated search, occurs for example when searching a distributed index or aggregating content for web search. Resource selection refers to the subtask of deciding, given a query, which collections to search. Most existing resource selection methods rely on evidence found in collection content. We present an approach to resource selection that combines multiple sources of evidence to inform the selection decision. We derive evidence from three different sources: collection documents, the topic of the query, and query click-through data. We combine this evidence by treating resource selection as a multiclass machine learning problem. Although machine learned approaches often require large amounts of manually generated training data, we present a method for using automatically generated training data. We make use of and compare against prior resource selection work and evaluate across three experimental testbeds.
- J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo. Sources of evidence for vertical selection. In SIGIR 2009, pages 315--322. ACM, 2009. Google ScholarDigital Library
- S. M. Beitzel, E. C. Jensen, O. Frieder, D. D. Lewis, A. Chowdhury, and A. Kolcz. Improving automatic query classification via semi-supervised learning. In ICDM 2005, pages 42--49. IEEE, 2005. Google ScholarDigital Library
- A. Bhattacharyya. On a measure of divergence between two statistical populations defined by probability distributions. Bull. Calcutta Math. Soc., 35:99--109, 1943.Google Scholar
- J. Callan and M. Connell. Query-based sampling of text databases. In TOIS. ACM, 2001. Google ScholarDigital Library
- J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In SIGIR 1995, pages 21--28. ACM, 1995. Google ScholarDigital Library
- F. Diaz. Integration of news content into web results. In WSDM 2009, pages 182--191. ACM, 2009. Google ScholarDigital Library
- F. Diaz and J. Arguello. Adaptation of online vertical selection predictions in the presence of user feedback. In SIGIR 2009, pages 323--330. ACM, 2009. Google ScholarDigital Library
- C. T. Fallen and G. B. Newby. Partitioning the gov2 corpus by internet domain name: A result-set merging experiment. In TREC 2006, 2006.Google Scholar
- L. Gravano, H. Garcia-Molina, and A. Tomasic. Gloss: Text-source discovery over the internet. TOIS, 24:229--264, 1999. Google ScholarDigital Library
- P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: hierarchical database sampling and selection. In VLDB 2002, pages 394--405. VLDB Endowment, 2002. Google ScholarDigital Library
- H. Je reys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 186(1007):453--461, 1946.Google ScholarCross Ref
- X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In SIGIR 2008, pages 339--346. ACM, 2008. Google ScholarDigital Library
- Y. Li, Z. Zheng, and H. K. Dai. Kdd cup-2005 report: facing a great challenge. SIGKDD Explor. Newsl., 7(2):91--99, 2005. Google ScholarDigital Library
- D. Metzler. A markov random Field model for term dependencies. In SIGIR 2005, pages 472--479. ACM Press, 2005. Google ScholarDigital Library
- M. F. Porter. An algorithm for suffix stripping. pages 313--316, 1997. Google ScholarDigital Library
- J. Seo and B. W. Croft. Blog site search using resource selection. In CIKM 2008, pages 1053--1062. ACM, 2008. Google ScholarDigital Library
- D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang. Q2c@ust: our winning solution to query classification in kddcup 2005. SIGKDD Explor. Newsl., 7(2):100--110, 2005. Google ScholarDigital Library
- D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. Building bridges for web query classification. In SIGIR 2006, pages 131--138. ACM, 2006. Google ScholarDigital Library
- M. Shokouhi. Central rank based collection selection in uncooperative distributed information retrieval. In ECIR 2007, pages 160--172. ACM, 2007. Google ScholarDigital Library
- M. Shokouhi, F. Scholer, and J. Zobel. Sample sizes for query probing in uncooperative distributed information retrieval. In APWeb 2006, pages 63--75. Springer, 2006. Google ScholarDigital Library
- L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In SIGIR 2003, pages 298--305. ACM, 2003. Google ScholarDigital Library
- L. Si and J. Callan. Unified utility maximization framework for resource selection. In CIKM 2004, pages 32--41. ACM, 2004. Google ScholarDigital Library
- L. Si, R. Jin, J. Callan, and P. Ogilvie. A language modeling framework for resource selection and results merging. In CIKM 2002, pages 391--397. ACM, 2002. Google ScholarDigital Library
- P. Thomas and M. Shokouhi. Sushi: Scoring scaled samples for server selection. In SIGIR 2009. ACM, 2009. Google ScholarDigital Library
- J.-R. Wen, J.-Y. Nie, and H.-J. Zhang. Query clustering using content words and user feedback. In SIGIR 2001, pages 442--443. ACM, 2001. Google ScholarDigital Library
- J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In SIGIR 1999, pages 254--261. ACM, 1999. Google ScholarDigital Library
Index Terms
- Classification-based resource selection
Recommendations
A joint probabilistic classification model for resource selection
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalResource selection is an important task in Federated Search to select a small number of most relevant information sources. Current resource selection algorithms such as GlOSS, CORI, ReDDE, Geometric Average and the recent classification-based method ...
Sources of evidence for vertical selection
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalWeb search providers often include search services for domain-specific subcollections, called verticals, such as news, images, videos, job postings, company summaries, and artist profiles. We address the problem of vertical selection, predicting ...
A Set-Covering-Based Approach for Overlapping Resource Selection in Distributed Information Retrieval
CSIE '09: Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 04Resource selection, also called server selection, collection selection or database selection, is a foundational problem in distributed information retrieval (DIR). This paper introduces a set-covering-based algorithm for resource selection in DIR, with ...
Comments