Article

Modeling score distributions for combining the outputs of search engines

Authors:
R. Manmatha

Univ. of Massashusetts, Amherst

Univ. of Massashusetts, Amherst
View Profile

,
T. Rath

Univ. of Massashusetts, Amherst

Univ. of Massashusetts, Amherst
View Profile

,
F. Feng

Univ. of Massashusetts, Amherst

Univ. of Massashusetts, Amherst
View Profile

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrievalSeptember 2001Pages 267–275https://doi.org/10.1145/383952.384005

Published:01 September 2001Publication History

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 267–275

ABSTRACT

In this paper the score distributions of a number of text search engines are modeled. It is shown empirically that the score distributions on a per query basis may be fitted using an exponential distribution for the set of non-relevant documents and a normal distribution for the set of relevant documents. Experiments show that this model fits TREC-3 and TREC-4 data for not only probabilistic search engines like INQUERY but also vector space search engines like SMART for English. We have also used this model to fit the output of other search engines like LSI search engines and search engines indexing other languages like Chinese.

It is then shown that given a query for which relevance information is not available, a mixture model consisting of an exponential and a normal distribution can be fitted to the score distribution. These distributions can be used to map the scores of a search engine to probabilities. We also discuss how the shape of the score distributions arise given certain assumptions about word distributions in documents. We hypothesize that all 'good' text search engines operating on any language have similar characteristics.

This model has many possible applications. For example, the outputs of different search engines can be combined by averaging the probabilities (optimal if the search engines are independent) or by using the probabilities to select the best engine for each query. Results show that the technique performs as well as the best current combination techniques.

References

1.A. Arampatzis, J. Beney, C. H. A. Koster, and T. P. van der Weide. Incrementality, half-life and threshold optimization for adaptive document filtering. In Proc. of the 9th Text Retrieval Conference (TREC-9). NIST, Nov 2000, To be published in late 2001.Google Scholar
2.J. A. Aslam, , and M. Montague. Bayes optimal metasearch: A probabilistic model for combining the results of multiple retrieval systems. In the Proc. of the 23rd ACM SIGIR conf. on Research and Developement in Information Retrieval, pages 379-381, 2000. Google ScholarDigital Library
3.C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. Google ScholarDigital Library
4.A. Bookstein. When the most Pertinent document should not be retrieved - an analysis of the Swets model. Information Processing and Management, 13:377-383, 1977.Google ScholarCross Ref
5.J. Callan, Z. Lu, and W. B. Croft. TREC and TIPSTER experiments with INQUERY. In the Proc. of the 18th ACM SIGIR conf. on Research and Developement in Information Retrieval, pages 21-28, 1995. Google ScholarDigital Library
6.K. W. Church and W. A. Gale. Poisson mixtures. Natural Language Engineering, 1(2):163-190, 1995.Google Scholar
7.W. B. Croft. Combining approaches to information retrieval. In W. B. Croft, editor, Advances in Information Retrieval, pages 1-36. Kluwer Academic Publishers, 2000.Google Scholar
8.R. Fagin. Fuzzy queries in multimedia database systems. In the Proc. of the 17th ACM Conference on Prnciples of Database Systems (PODS), pages 1-10, 1998. Google ScholarDigital Library
9.M. Flickner, H. S. Sawhney, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: The QBIC system. IEEE Computer Magazine, 28(9):23-30, Sept. 1995. Google ScholarDigital Library
10.E. Fox and J. Shaw. Combination of multiple searches. In the Proc. of the 2nd Text Retrieval Conference (TREC-2), pages 243-252. National Institute of Standards and Technology Special Publications 500-215, 1994.Google Scholar
11.W. Greiff. The use of exploratory data analysis in information retrieval research. In W. B. Croft, editor, Advances in Information Retrieval, pages 37-72. Kluwer Academic Publishers, 2000.Google Scholar
12.S. P. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 20:197-206, 1975.Google ScholarCross Ref
13.J. H. Lee. Combining multiple evidence form different properties of weighting schemes. In the Proc. of the 18th Intl. Conf. on Research and Development in Information Retrieval (SIGIR'95), pages 180-188, 1995. Google ScholarDigital Library
14.J. H. Lee. Analyses of multiple evidence combination. In the Proc. of the 20th Intl. Conf. on Research and Development in Information Retrieval (SIGIR'97), pages 267-276, 1997. Google ScholarDigital Library
15.G. McLachlan and D. Peel. Finite Mixture Models. John Wiley, 2000.Google ScholarCross Ref
16.F. Mosteller and D. Wallace. Inference and Disputed Authorship: The Federalist. Addison Weseley, 1964.Google Scholar
17.S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In the Proc. of the 17th ACM SIGIR conf. on Research and Developement in Information Retrieval, pages 232-241, 1994. Google ScholarDigital Library
18.J. A. Swets. Information retrieval systems. Science, 141:245-250, 1963.Google Scholar
19.K. Tumer and J. Ghosh. Linear and order statistics combiners for pattern clasification. In A. Sharkey, editor, Combining Artificial Neural Networks, pages 127-162. Springer-Verlag, 1999.Google Scholar
20.C. J. van Rijsbergen. Information Retrieval. Butterworths, 1979. Google ScholarDigital Library
21.C. Vogt and G. Cottrell. Predicting the performance of linearly combined IR systems. In the Proc. of the 21st ACM SIGIR conf. on Research and Developement in Information Retrieval, pages 190-196, 1998. Google ScholarDigital Library
22.E. Voorhees, N. Gupta, and B. Johnson-Laird. Learning collection fusion strategies. In the Proc. of the 18th ACM SIGIR conf. on Research and Developement in Information Retrieval, pages 172-179, 1995. Google ScholarDigital Library

Index Terms

Modeling score distributions for combining the outputs of search engines
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
  2. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment
    2. Information retrieval query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Modeling score distributions for information retrieval
Read More
Overlap Among Major Web Search Engines
ITNG '06: Proceedings of the Third International Conference on Information Technology: New Generations

Our study examined the overlap among results retrieved by three major Web search engines for a large set of more than 10,316 queries. Previous smaller studies have discussed the lack of overlap in results returned by Web search engines for the same ...
Read More
A study of results overlap and uniqueness among major web search engines

The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
September 2001
454 pages
ISBN:1581133316
DOI:10.1145/383952
Chairmen:
Donald H. Kraft
Louisiana State Univ.
,
W. Bruce Croft
University of Massachusetts, (For the Americas)
,
David J. Harper
The Robert Gordon University, (For Europe and Africa)
,
Justin Zobel
RMIT University, (For Asia and Australasia)
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGIR '01 Paper Acceptance Rate47of201submissions,23%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 124
  Total Citations
  View Citations
- 1,128
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Modeling score distributions for combining the outputs of search engines

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Modeling score distributions for information retrieval

Overlap Among Major Web Search Engines

A study of results overlap and uniqueness among major web search engines