ABSTRACT
Prior research under a variety of conditions has shown the CORI algorithm to be one of the most effective resource selection algorithms, but the range of database sizes studied was not large. This paper shows that the CORI algorithm does not do well in environments with a mix of "small" and "very large" databases. A new resource selection algorithm is proposed that uses information about database sizes as well as database contents. We also show how to acquire database size estimates in uncooperative environments as an extension of the query-based sampling used to acquire resource descriptions. Experiments demonstrate that the database size estimates are more accurate for large databases than estimates produced by a competing method; the new resource ranking algorithm is always at least as effective as the CORI algorithm; and the new algorithm results in better document rankings than the CORI algorithm.
- J. Callan. (2000). Distributed information retrieval. In W.B. Croft, editor, Advances in Information Retrieval. Kluwer Academic Publishers. (pp. 127--150).Google Scholar
- J. Callan, W.B. Croft, and J. Broglio. (1995). TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3). (pp. 327--343). Google ScholarDigital Library
- N. Craswell. (2000). Methods for distributed information retrieval http://pigfish.vic.cmis.csiro.au/~nickc/pubs/craswellthesis.pdf. Ph. D. thesis, The Australian Nation University.Google Scholar
- A. Le Calv and J. Savoy. (2000). Database merging strategy based on logistic regression. Information Processing and Management, 36(3). (pp. 341--359). Google ScholarDigital Library
- D. D'Souza, J. Thom, and J. Zobel. (2000). A comparison of techniques for selecting text collections. In Proceedings of the Eleventh Australasian Database Conference (ADC). Google ScholarDigital Library
- L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. (1997). STARTS: Stanford proposal for internet meta-searching. In Proceedings of the 20th ACM-SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- J.C. French, A.L. Powell, J. Callan, C.L. Viles, T. Emmitt, K.J. Prey, and Y. Mou. (1999). Comparing the performance of database selection algorithms http://www-2.cs.cmu.edu/~callan/Papers/sigir99b.ps. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- P. Ipeirotis and L. Gravano. (2002). Distributed search over the hidden web: Hierarchical database sampling and selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- The lemur toolkit. http://www.cs.cmu.edu/~lemurGoogle Scholar
- InvisibleWeb.com. http://www.invisibleweb.com/Google Scholar
- K.L. Liu, C. Yu, W. Meng, A. Santos and C. Zhang. (2001). Discovering the representative of a search engine. In Proceedings of 10th ACM International Conference on Information and Knowledge Management (CIKM). Google ScholarDigital Library
- A.L. Powell, J.C. French, J. Callan, M. Connell, and C.L. Viles, (2000). The impact of database selection on distributed searching. http://www.cs.cmu.edu/~callan/Papers/sigir00b.ps In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- L. Si and J. Callan. (2002). Using sampled data and regression to merge search engine results. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- J. Xu and J. Callan. (1998). Effective retrieval with distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- S. Robertson and S. Walker. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
Index Terms
- Relevant document distribution estimation method for resource selection
Recommendations
A Set-Covering-Based Approach for Overlapping Resource Selection in Distributed Information Retrieval
CSIE '09: Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 04Resource selection, also called server selection, collection selection or database selection, is a foundational problem in distributed information retrieval (DIR). This paper introduces a set-covering-based algorithm for resource selection in DIR, with ...
Resource bricolage and resource selection for parallel database systems
Running parallel database systems in an environment with heterogeneous resources has become increasingly common, due to cluster evolution and increasing interest in moving applications into public clouds. Performance differences among machines in the ...
Evaluating Document Retrieval Methods for Resource Selection in Clustered P2P IR
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementResource Selection (or Query Routing) is an important step in P2P IR. Though analogous to document retrieval in the sense of choosing a relevant subset of resources, resource selection methods have evolved independently from those for document ...
Comments