article

Probabilistic models of information retrieval based on measuring the divergence from randomness

Authors:
Gianni Amati

University of Glasgow, Fondazione Ugo Bordoni, Roma, Italy

University of Glasgow, Fondazione Ugo Bordoni, Roma, Italy
View Profile

,
Cornelis Joost Van Rijsbergen

University of Glasgow, Glasgow, Scotland

University of Glasgow, Glasgow, Scotland
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 20 Issue 4pp 357–389https://doi.org/10.1145/582415.582416

Published:01 October 2002Publication History

ACM Transactions on Information Systems

Abstract

We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose--Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document--query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model.

References

Allan, J., Callan, J. P., Croft, W. B., Ballesteros, L., Broglio, J., Xu, J., and Shu, H. 1996. INQUERY at TREC-5. In Proceedings of the Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500-238, Gaithersburg, Md., 119--132.Google Scholar
Amati, G., Carpineto, C., and Romano, G. 2001. FUB at TREC 10 web track: A probabilistic framework for topic relevance term weighting. In Proceedings of the Tenth Text Retrieval Conference (TREC-10). NIST Special Publication 500-250, Gaithersburg, Md.Google Scholar
Bookstein, A. and Swanson, D. 1974. Probabilistic models for automatic indexing. J. Am. Soc. Inf. Sci. 25, 312--318.Google Scholar
Carpineto, C. and Romano, G. 2000. Trec-8 automatic ad-hoc experiments at fub. In Proceedings of the Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246, Gaithersburg, Md., 377--380.Google Scholar
Cooper, W. and Maron, M. 1978. Foundations of probabilistic and utility-theoretic indexing. J. ACM 25, 67--80. Google Scholar
Cox, R. T. 1961. The Algebra of Probable Inference. Johns Hopkins Press, Baltimore, Md.Google Scholar
Croft, W. and Harper, D. 1979. Using probabilistic models of document retrieval without relevance information. J. Doc. 35, 285--295.Google Scholar
Damerau, F. 1965. An experiment in automatic indexing. Am. Doc. 16, 283--289.Google Scholar
Feller, W. 1968. An Introduction to Probability Theory and Its Applications, Vol. I, third ed. Wiley, New York.Google Scholar
Fuhr, N. 1989. Models for retrieval with probabilistic indexing. Inf. Process. Manage. 25, 1, 55--72. Google Scholar
Good, I. J. 1968. The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Vol. 30. MIT Press, Cambridge, Mass.Google Scholar
Harman, D. 1993. Overview of the Second Text REtrieval Conference (TREC--2). In Proceedings of the TREC Conference. NIST Special publication 500-215, Gaithersburg, Md, 1--20. Google Scholar
Harter, S. P. 1974. A probabilistic approach to automatic keyword indexing. PhD Thesis, Graduate Library, The University of Chicago, Thesis No. T25146.Google Scholar
Harter, S. P. 1975a. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. ASIS 26, 197--216.Google Scholar
Harter, S. P. 1975b. A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. J. ASIS 26, 280--289.Google Scholar
Hiemstra, D. and de Vries, A. 2000. Relating the new language models of information retrieval to the traditional retrieval models. Res. Rep. TR--CTIT--00--09, Centre for Telematics and Information Technology.Google Scholar
Hintikka, J. 1970. On semantic information. In Information and Inference, J. Hintikka, and P. Suppes, Eds., Synthese Library. D. Reidel, Dordrecht, Holland, 3--27.Google Scholar
Kwok, K. 1990. Experiments with component theory of probabilistic information retrieval based on single terms as document components. ACM Trans. Inf. Syst. 8, 4, 363--386. Google Scholar
Lafferty, J. and Zhai, C. 2001. Document language models, query models, and risk minimization for information retrieval. In Proceedings of ACM SIGIR (New Orleans), ACM, New York, 111--119. Google Scholar
Margulis, E. 1992. N-Poisson document modelling. In Proceedings of ACM--SIGIR 92 Conference (Denmark), ACM, New York, 177--189. Google Scholar
Ponte, J. and Croft, B. 1998. A language modeling approach in information retrieval. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval, (Melbourne, Australia), B. Croft, A. Moffat, and C. van Rijsbergen, Eds., ACM, New York, 275--281. Google Scholar
Popper, K. 1995. The Logic of Scientific Discovery (The bulk of the work was first published in Vienna in 1935; this reprint was first published by Hutchinson in 1959, new notes and footnotes in the present reprint). Routledge, London.Google Scholar
Renyi, A. 1969. Foundations of Probability. Holden-Day, San Francisco.Google Scholar
Robertson, S. 1986. On relevance weight estimation and query expansion. J. Doc. 42, 3, 288--297.Google Scholar
Robertson, S. and Walker, S. 1994. Some simple approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (Dublin), Springer-Verlag, New York, 232--241. Google Scholar
Robertson, S., Walker, S., Beaulieu, M., Gatford, M., and Payne, A. 1996. Okapi at Trec-4. In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), D. Harman, Ed., Department of Commerce, National Institute of Standards and Technology, Gaithersburg, Md., 182--191.Google Scholar
Robertson, S. E. and Sparck-Jones, K. 1976. Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27, 129--146.Google Scholar
Robertson, S. E., van Rijsbergen, C. J., and Porter, M. 1981. Probabilistic models of indexing and searching. In Information Retrieval Research, S. E. Robertson, C. J. van Rijsbergen, and P. Williams, Eds., Butterworths, Oxford, UK, Chapter 4, 35--56. Google Scholar
Salton, G. and Buckley, C. 1988. Term-weight approaches in automatic text retrieval. Inf. Process. Manage. 24, 5, 513--523. Google Scholar
Singhal, A., Salton, G., Mitra, M., and Buckley, C. 1996. Document length normalization. Inf. Process. Manage. 32, 5, 619--633. Google Scholar
Solomonoff, R. 1964a. A formal theory of inductive inference. Part I. Inf. Control 7, 1 (March), 1--22.Google Scholar
Solomonoff, R. 1964b. A formal theory of inductive inference. Part II. Inf. Control 7, 2 (June), 224--254.Google Scholar
Titterington, D. M., Smith, A. F. M., and Makov, U. E. 1985. Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester.Google Scholar
Turtle, H. and Croft, W. 1992. A comparison of text retrieval models. Comput. J. 35, 3 (June), 279--290. Google Scholar
van Rijsbergen, C. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. J. Doc. 33, 106--119.Google Scholar
Willis, D. 1970. Computational complexity and probability constructions. J. ACM 17, 2, 241--259. Google Scholar
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes, second ed. Morgan Kaufmann, San Francisco.Google Scholar
Wong, S. and Yao, Y. 1995. On modeling information retrieval with probabilistic inference. ACM Trans. Inf. Syst. 16, 38--68. Google Scholar

Index Terms

Probabilistic models of information retrieval based on measuring the divergence from randomness
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Term weighting for information retrieval based on term's discrimination power

One of the most important research topics in Information Retrieval is term weighting for document ranking and retrieval, such as TFIDF, BM25, etc. We propose a term weighting method that utilizes past retrieval results consisting of the queries that ...
Read More
Modeling term proximity for probabilistic information retrieval models

Proximity among query terms has been found to be useful for improving retrieval performance. However, its application to classical probabilistic information retrieval models, such as Okapi's BM25, remains a challenging research problem. In this paper, ...
Read More
Text Retrieval based on Least Information Measurement
ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval

We developed a new information retrieval framework based on the Least Information (LI) metric. We derived multiple term weighting schemes and combined them with a vector space representation for ad hoc retrieval. Given probability distributions in a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 20, Issue 4
October 2002
90 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/582415
Issue’s Table of Contents

Copyright © 2002 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2002
Published in tois Volume 20, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Aftereffect model
BM25
Bose--Einstein statistics
Laplace
Poisson
binomial law
document length normalization
eliteness
idf
information retrieval
probabilistic models
randomness
succession law
term frequency normalization
term weighting
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 597
  Total Citations
  View Citations
- 4,538
  Total Downloads
- Downloads (Last 12 months)149
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Term weighting for information retrieval based on term's discrimination power

Modeling term proximity for probabilistic information retrieval models

Text Retrieval based on Least Information Measurement

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Term weighting for information retrieval based on term's discrimination power

Modeling term proximity for probabilistic information retrieval models

Text Retrieval based on Least Information Measurement

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media