Abstract
We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose--Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document--query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model.
- Allan, J., Callan, J. P., Croft, W. B., Ballesteros, L., Broglio, J., Xu, J., and Shu, H. 1996. INQUERY at TREC-5. In Proceedings of the Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500-238, Gaithersburg, Md., 119--132.Google Scholar
- Amati, G., Carpineto, C., and Romano, G. 2001. FUB at TREC 10 web track: A probabilistic framework for topic relevance term weighting. In Proceedings of the Tenth Text Retrieval Conference (TREC-10). NIST Special Publication 500-250, Gaithersburg, Md.Google Scholar
- Bookstein, A. and Swanson, D. 1974. Probabilistic models for automatic indexing. J. Am. Soc. Inf. Sci. 25, 312--318.Google Scholar
- Carpineto, C. and Romano, G. 2000. Trec-8 automatic ad-hoc experiments at fub. In Proceedings of the Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246, Gaithersburg, Md., 377--380.Google Scholar
- Cooper, W. and Maron, M. 1978. Foundations of probabilistic and utility-theoretic indexing. J. ACM 25, 67--80. Google Scholar
- Cox, R. T. 1961. The Algebra of Probable Inference. Johns Hopkins Press, Baltimore, Md.Google Scholar
- Croft, W. and Harper, D. 1979. Using probabilistic models of document retrieval without relevance information. J. Doc. 35, 285--295.Google Scholar
- Damerau, F. 1965. An experiment in automatic indexing. Am. Doc. 16, 283--289.Google Scholar
- Feller, W. 1968. An Introduction to Probability Theory and Its Applications, Vol. I, third ed. Wiley, New York.Google Scholar
- Fuhr, N. 1989. Models for retrieval with probabilistic indexing. Inf. Process. Manage. 25, 1, 55--72. Google Scholar
- Good, I. J. 1968. The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Vol. 30. MIT Press, Cambridge, Mass.Google Scholar
- Harman, D. 1993. Overview of the Second Text REtrieval Conference (TREC--2). In Proceedings of the TREC Conference. NIST Special publication 500-215, Gaithersburg, Md, 1--20. Google Scholar
- Harter, S. P. 1974. A probabilistic approach to automatic keyword indexing. PhD Thesis, Graduate Library, The University of Chicago, Thesis No. T25146.Google Scholar
- Harter, S. P. 1975a. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. ASIS 26, 197--216.Google Scholar
- Harter, S. P. 1975b. A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. J. ASIS 26, 280--289.Google Scholar
- Hiemstra, D. and de Vries, A. 2000. Relating the new language models of information retrieval to the traditional retrieval models. Res. Rep. TR--CTIT--00--09, Centre for Telematics and Information Technology.Google Scholar
- Hintikka, J. 1970. On semantic information. In Information and Inference, J. Hintikka, and P. Suppes, Eds., Synthese Library. D. Reidel, Dordrecht, Holland, 3--27.Google Scholar
- Kwok, K. 1990. Experiments with component theory of probabilistic information retrieval based on single terms as document components. ACM Trans. Inf. Syst. 8, 4, 363--386. Google Scholar
- Lafferty, J. and Zhai, C. 2001. Document language models, query models, and risk minimization for information retrieval. In Proceedings of ACM SIGIR (New Orleans), ACM, New York, 111--119. Google Scholar
- Margulis, E. 1992. N-Poisson document modelling. In Proceedings of ACM--SIGIR 92 Conference (Denmark), ACM, New York, 177--189. Google Scholar
- Ponte, J. and Croft, B. 1998. A language modeling approach in information retrieval. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval, (Melbourne, Australia), B. Croft, A. Moffat, and C. van Rijsbergen, Eds., ACM, New York, 275--281. Google Scholar
- Popper, K. 1995. The Logic of Scientific Discovery (The bulk of the work was first published in Vienna in 1935; this reprint was first published by Hutchinson in 1959, new notes and footnotes in the present reprint). Routledge, London.Google Scholar
- Renyi, A. 1969. Foundations of Probability. Holden-Day, San Francisco.Google Scholar
- Robertson, S. 1986. On relevance weight estimation and query expansion. J. Doc. 42, 3, 288--297.Google Scholar
- Robertson, S. and Walker, S. 1994. Some simple approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (Dublin), Springer-Verlag, New York, 232--241. Google Scholar
- Robertson, S., Walker, S., Beaulieu, M., Gatford, M., and Payne, A. 1996. Okapi at Trec-4. In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), D. Harman, Ed., Department of Commerce, National Institute of Standards and Technology, Gaithersburg, Md., 182--191.Google Scholar
- Robertson, S. E. and Sparck-Jones, K. 1976. Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27, 129--146.Google Scholar
- Robertson, S. E., van Rijsbergen, C. J., and Porter, M. 1981. Probabilistic models of indexing and searching. In Information Retrieval Research, S. E. Robertson, C. J. van Rijsbergen, and P. Williams, Eds., Butterworths, Oxford, UK, Chapter 4, 35--56. Google Scholar
- Salton, G. and Buckley, C. 1988. Term-weight approaches in automatic text retrieval. Inf. Process. Manage. 24, 5, 513--523. Google Scholar
- Singhal, A., Salton, G., Mitra, M., and Buckley, C. 1996. Document length normalization. Inf. Process. Manage. 32, 5, 619--633. Google Scholar
- Solomonoff, R. 1964a. A formal theory of inductive inference. Part I. Inf. Control 7, 1 (March), 1--22.Google Scholar
- Solomonoff, R. 1964b. A formal theory of inductive inference. Part II. Inf. Control 7, 2 (June), 224--254.Google Scholar
- Titterington, D. M., Smith, A. F. M., and Makov, U. E. 1985. Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester.Google Scholar
- Turtle, H. and Croft, W. 1992. A comparison of text retrieval models. Comput. J. 35, 3 (June), 279--290. Google Scholar
- van Rijsbergen, C. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. J. Doc. 33, 106--119.Google Scholar
- Willis, D. 1970. Computational complexity and probability constructions. J. ACM 17, 2, 241--259. Google Scholar
- Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes, second ed. Morgan Kaufmann, San Francisco.Google Scholar
- Wong, S. and Yao, Y. 1995. On modeling information retrieval with probabilistic inference. ACM Trans. Inf. Syst. 16, 38--68. Google Scholar
Index Terms
- Probabilistic models of information retrieval based on measuring the divergence from randomness
Recommendations
Term weighting for information retrieval based on term's discrimination power
One of the most important research topics in Information Retrieval is term weighting for document ranking and retrieval, such as TFIDF, BM25, etc. We propose a term weighting method that utilizes past retrieval results consisting of the queries that ...
Modeling term proximity for probabilistic information retrieval models
Proximity among query terms has been found to be useful for improving retrieval performance. However, its application to classical probabilistic information retrieval models, such as Okapi's BM25, remains a challenging research problem. In this paper, ...
Text Retrieval based on Least Information Measurement
ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information RetrievalWe developed a new information retrieval framework based on the Least Information (LI) metric. We derived multiple term weighting schemes and combined them with a vector space representation for ad hoc retrieval. Given probability distributions in a ...
Comments