Article

Using sampled data and regression to merge search engine results

Authors:
Luo Si

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Jamie Callan

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2002Pages 19–26https://doi.org/10.1145/564376.564382

Published:11 August 2002Publication History

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 19–26

ABSTRACT

This paper addresses the problem of merging results obtained from different databases and search engines in a distributed information retrieval environment. The prior research on this problem either assumed the exchange of statistics necessary for normalizing scores (cooperative solutions) or is heuristic. Both approaches have disadvantages. We show that the problem in uncooperative environments is simpler when viewed as a component of a distributed IR system that uses query-based sampling to create resource descriptions. Documents sampled for creating resource descriptions can also be used to create a sample centralized index, and this index is a source of training data for adaptive results merging algorithms. A variety of experiments demonstrate that this new approach is more effective than a well-known alternative, and that it allows query-by-query tuning of the results merging function.

References

J. Callan. Distributed information retrieval. In W.B. Croft, editor, Advances in information retrieval. pp. 127--150. Kluwer Academic Publishers, 2000.Google Scholar
J. Callan, W.B. Croft, and J. Broglio, TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327--343, 1995. Google ScholarDigital Library
L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford Proposal for Internet Meta-Searching. In Proc. of the ACM-SIGMOD Int'l Conference on Management of Data, 1997. Google ScholarDigital Library
J. Callan and M. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130, 2001. Google ScholarDigital Library
N. Fuhr. A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3):229--249, 1999. Google ScholarDigital Library
L. Gravano and H. Garcia-Molina. Generalizing GloSS to Vector-Space Databases and Broker Hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB), 1995. Google ScholarDigital Library
J. Xu and J. Callan. Effective Retrieval with Distributed Collections. In Proc. of the 21st Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 1998. Google ScholarDigital Library
B. Yuwono and D. Lee. Server Ranking for Distributed Text Retrieval Systems on Internet. In Proc. of the Int. Conf. on Database Systems for Adv. Applications, pages 41--49, 1997. Google ScholarDigital Library
N. Craswell, P.Bailey, and D.Hawking. Server selection on the World Wide Web. In Proc. of the Fifth ACM Conference on Digital Libraries, pp. 37--46. ACM, 2000. Google ScholarDigital Library
C. L. Viles and J. C. French. Dissemination of Collection Wide Information in a Distributed Information Retrieval System. In Proc. of the 18th Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 1995. Google ScholarDigital Library
S. T. Kirsch. Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents. U.S. Patent 5,659,732.Google Scholar
N. Craswell, D. Hawking, and P. Thistlewaite. Merging Results from Isolated Search Engines. In Proc. of the Tenth Australasian Database Conf., pages 189--200, 1999.Google Scholar
J.C. French, A.L. Powell, J. Callan, C.L. Viles, T. Emmitt, K.J. Prey, and Y. Mou. Comparing the performance of database selection algorithms. In Proc. of the 22nd Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. Google ScholarDigital Library
J. H. Lee. Analyses of multiple evidence combination. In Proc. of the 20th Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 1997. Google ScholarDigital Library
R. Manmatha, T. Rath, and F. Feng. Modeling score distributions for combining the outputs of search engines. In Proc. of the 24th Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 2001. Google ScholarDigital Library
A. Le Calv , J. Savoy. Database Merging Strategy Based on Logistic Regression. Information Processing & Management, 36(3), 2000. Google ScholarDigital Library
A.L. Powell, J.C. French, J. Callan, M. Connell, and C.L. Viles, The impact of database selection on distributed searching. In Proc. of the 23rd Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 2000. Google ScholarDigital Library
J. Xu and W.B. Croft, Cluster-based language models for distributed retrieval. In Proc. of the 22nd Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. Google ScholarDigital Library
C. Buckley, A. Singhal, M. Mitra, and G. Salton, New retrieval approaches using SMART. In Proceedings of 1995 Text REtrieval Conference (TREC-3). National Institute of Standards and Technology, special publication.Google Scholar
L. Larkey, M. Connell, and J. Callan. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of Conference of Information and Knowledge Management, 2000. Google ScholarDigital Library
J. A. Aslam, M. Montague. Models for Metasearch. In Proc. of the 23rd Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 2001. Google ScholarDigital Library
P. Ogilvie, J. Callan. Experiments using the Lemur toolkit. In Proc of 2001 Text REtrieval Conference (TREC 2001). National Institute of Standards and Technology, special publication.Google Scholar
Ellen Voorhees, Narendra K. Gupta, and Ben Johnson-Laird. Learning Collection Fusion Strategies. In Proc. of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995. Google ScholarDigital Library

Index Terms

Using sampled data and regression to merge search engine results
1. Information systems
  1. Information retrieval

Recommendations

A semisupervised learning method to merge search engine results

The proliferation of searchable text databases on local area networks and the Internet causes the problem of finding information that may be distributed among many disjoint text databases (distributed information retrieval). How to merge the results ...
Read More
An Implemented Rank Merging Algorithm for Meta Search Engine
ICRCCS '09: Proceedings of the 2009 International Conference on Research Challenges in Computer Science

In order to improve the precision of meta search engine, In the foundation of analysis two kinds traditional merging algorithm of meta search engine, and the disadvantage of two algorithms are given, one kind of new about the results merging method ...
Read More
Results merging algorithm using multiple regression models
ECIR'07: Proceedings of the 29th European conference on IR research

This paper describes a new algorithm for merging the results of remote collections in a distributed information retrieval environment. The algorithm makes use only of the ranks of the returned documents, thus making it very efficient in environments ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
August 2002
478 pages
ISBN:1581135610
DOI:10.1145/564376
General Chair:
Kalervo Järvelin
University of Tampere, Finland
,
Program Chairs:
Micheline Beaulieu
University of Sheffield, UK
,
Ricardo Baeza-Yates
University of Chile, Chile
,
Sung Hyon Myaeng
Chungnam National University, Korea
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed information retrieval
regression
results merging
Qualifiers
- Article
Conference

Acceptance Rates
SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 55
  Total Citations
  View Citations
- 1,525
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Using sampled data and regression to merge search engine results

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A semisupervised learning method to merge search engine results

An Implemented Rank Merging Algorithm for Meta Search Engine

Results merging algorithm using multiple regression models