ABSTRACT
This paper presents a unified utility framework for resource selection of distributed text information retrieval. This new framework shows an efficient and effective way to infer the probabilities of relevance of all the documents across the text databases. With the estimated relevance information, resource selection can be made by explicitly optimizing the goals of different applications. Specifically, when used for database recommendation, the selection is optimized for the goal of high-recall (include as many relevant documents as possible in the selected databases); when used for distributed document retrieval, the selection targets the high-precision goal (high precision in the final merged list of documents). This new model provides a more solid framework for distributed information retrieval. Empirical studies show that it is at least as effective as other state-of-the-art algorithms.
- J. Callan. (2000). Distributed information retrieval. In W.B. Croft, editor, Advances in Information Retrieval. Kluwer Academic Publishers. (pp. 127--150).Google Scholar
- J. Callan, W.B. Croft, and J. Broglio. (1995). TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3). (pp. 327--343). Google ScholarDigital Library
- J. G. Conrad, X. S. Guo, P. Jackson and M. Meziou. (2002). Database selection using actual physical and acquired logical collection resources in a massive domain-specific operational environment. Distributed search over the hidden web: Hierarchical database sampling and selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- N. Craswell. (2000). Methods for distributed information retrieval. Ph. D. thesis, The Australian Nation University.Google Scholar
- N. Craswell, D. Hawking, and P. Thistlewaite. (1999). Merging results from isolated search engines. In Proceedings of 10th Australasian Database Conference.Google Scholar
- D. D'Souza, J. Thom, and J. Zobel. (2000). A comparison of techniques for selecting text collections. In Proceedings of the 11th Australasian Database Conference. Google ScholarDigital Library
- N. Fuhr. (1999). A Decision-Theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3). (pp. 229--249). Google ScholarDigital Library
- L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. (1997). STARTS: Stanford proposal for internet meta-searching. In Proceedings of the 20th ACM-SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- L. Gravano, P. Ipeirotis and M. Sahami. (2003). QProber: A System for Automatic Classification of Hidden-Web Databases. ACM Transactions on Information Systems, 21(1). Google ScholarDigital Library
- P. Ipeirotis and L. Gravano. (2002). Distributed search over the hidden web: Hierarchical database sampling and selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- InvisibleWeb.com. http://www.invisibleweb.comGoogle Scholar
- The lemur toolkit. http://www.cs.cmu.edu/ lemurGoogle Scholar
- J. Lu and J. Callan. (2003). Content-based information retrieval in peer-to-peer networks. In Proceedings of the 12th International Conference on Information and Knowledge Management. Google ScholarDigital Library
- W. Meng, C.T. Yu and K.L. Liu. (2002) Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1). Google ScholarDigital Library
- H. Nottelmann and N. Fuhr. (2003). Evaluating different method of estimating retrieval quality for resource selection. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- H., Nottelmann and N., Fuhr. (2003). The MIND architecture for heterogeneous multimedia federated digital libraries. ACM SIGIR 2003 Workshop on Distributed Information Retrieval.Google Scholar
- A.L. Powell, J.C. French, J. Callan, M. Connell, and C.L. Viles. (2000). The impact of database selection on distributed searching. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- A.L. Powell and J.C. French. (2003). Comparing the performance of database selection algorithms. ACM Transactions on Information Systems, 21(4). (pp. 412--456). Google ScholarDigital Library
- C. Sherman (2001). Search for the invisible web. Guardian Unlimited.Google Scholar
- L. Si and J. Callan. (2002). Using sampled data and regression to merge search engine results. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- L. Si and J. Callan. (2003). Relevant document distribution estimation method for resource selection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- L. Si and J. Callan. (2003). A Semi-Supervised learning method to merge search engine results. ACM Transactions on Information Systems, 21(4). (pp. 457--491). Google ScholarDigital Library
Index Terms
- Unified utility maximization framework for resource selection
Recommendations
A Set-Covering-Based Approach for Overlapping Resource Selection in Distributed Information Retrieval
CSIE '09: Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 04Resource selection, also called server selection, collection selection or database selection, is a foundational problem in distributed information retrieval (DIR). This paper introduces a set-covering-based algorithm for resource selection in DIR, with ...
Evaluating Document Retrieval Methods for Resource Selection in Clustered P2P IR
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementResource Selection (or Query Routing) is an important step in P2P IR. Though analogous to document retrieval in the sense of choosing a relevant subset of resources, resource selection methods have evolved independently from those for document ...
A semisupervised learning method to merge search engine results
The proliferation of searchable text databases on local area networks and the Internet causes the problem of finding information that may be distributed among many disjoint text databases (distributed information retrieval). How to merge the results ...
Comments