skip to main content
article

Probabilistic models of information retrieval based on measuring the divergence from randomness

Published:01 October 2002Publication History
Skip Abstract Section

Abstract

We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose--Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document--query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model.

References

  1. Allan, J., Callan, J. P., Croft, W. B., Ballesteros, L., Broglio, J., Xu, J., and Shu, H. 1996. INQUERY at TREC-5. In Proceedings of the Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500-238, Gaithersburg, Md., 119--132.Google ScholarGoogle Scholar
  2. Amati, G., Carpineto, C., and Romano, G. 2001. FUB at TREC 10 web track: A probabilistic framework for topic relevance term weighting. In Proceedings of the Tenth Text Retrieval Conference (TREC-10). NIST Special Publication 500-250, Gaithersburg, Md.Google ScholarGoogle Scholar
  3. Bookstein, A. and Swanson, D. 1974. Probabilistic models for automatic indexing. J. Am. Soc. Inf. Sci. 25, 312--318.Google ScholarGoogle Scholar
  4. Carpineto, C. and Romano, G. 2000. Trec-8 automatic ad-hoc experiments at fub. In Proceedings of the Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246, Gaithersburg, Md., 377--380.Google ScholarGoogle Scholar
  5. Cooper, W. and Maron, M. 1978. Foundations of probabilistic and utility-theoretic indexing. J. ACM 25, 67--80. Google ScholarGoogle Scholar
  6. Cox, R. T. 1961. The Algebra of Probable Inference. Johns Hopkins Press, Baltimore, Md.Google ScholarGoogle Scholar
  7. Croft, W. and Harper, D. 1979. Using probabilistic models of document retrieval without relevance information. J. Doc. 35, 285--295.Google ScholarGoogle Scholar
  8. Damerau, F. 1965. An experiment in automatic indexing. Am. Doc. 16, 283--289.Google ScholarGoogle Scholar
  9. Feller, W. 1968. An Introduction to Probability Theory and Its Applications, Vol. I, third ed. Wiley, New York.Google ScholarGoogle Scholar
  10. Fuhr, N. 1989. Models for retrieval with probabilistic indexing. Inf. Process. Manage. 25, 1, 55--72. Google ScholarGoogle Scholar
  11. Good, I. J. 1968. The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Vol. 30. MIT Press, Cambridge, Mass.Google ScholarGoogle Scholar
  12. Harman, D. 1993. Overview of the Second Text REtrieval Conference (TREC--2). In Proceedings of the TREC Conference. NIST Special publication 500-215, Gaithersburg, Md, 1--20. Google ScholarGoogle Scholar
  13. Harter, S. P. 1974. A probabilistic approach to automatic keyword indexing. PhD Thesis, Graduate Library, The University of Chicago, Thesis No. T25146.Google ScholarGoogle Scholar
  14. Harter, S. P. 1975a. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. ASIS 26, 197--216.Google ScholarGoogle Scholar
  15. Harter, S. P. 1975b. A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. J. ASIS 26, 280--289.Google ScholarGoogle Scholar
  16. Hiemstra, D. and de Vries, A. 2000. Relating the new language models of information retrieval to the traditional retrieval models. Res. Rep. TR--CTIT--00--09, Centre for Telematics and Information Technology.Google ScholarGoogle Scholar
  17. Hintikka, J. 1970. On semantic information. In Information and Inference, J. Hintikka, and P. Suppes, Eds., Synthese Library. D. Reidel, Dordrecht, Holland, 3--27.Google ScholarGoogle Scholar
  18. Kwok, K. 1990. Experiments with component theory of probabilistic information retrieval based on single terms as document components. ACM Trans. Inf. Syst. 8, 4, 363--386. Google ScholarGoogle Scholar
  19. Lafferty, J. and Zhai, C. 2001. Document language models, query models, and risk minimization for information retrieval. In Proceedings of ACM SIGIR (New Orleans), ACM, New York, 111--119. Google ScholarGoogle Scholar
  20. Margulis, E. 1992. N-Poisson document modelling. In Proceedings of ACM--SIGIR 92 Conference (Denmark), ACM, New York, 177--189. Google ScholarGoogle Scholar
  21. Ponte, J. and Croft, B. 1998. A language modeling approach in information retrieval. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval, (Melbourne, Australia), B. Croft, A. Moffat, and C. van Rijsbergen, Eds., ACM, New York, 275--281. Google ScholarGoogle Scholar
  22. Popper, K. 1995. The Logic of Scientific Discovery (The bulk of the work was first published in Vienna in 1935; this reprint was first published by Hutchinson in 1959, new notes and footnotes in the present reprint). Routledge, London.Google ScholarGoogle Scholar
  23. Renyi, A. 1969. Foundations of Probability. Holden-Day, San Francisco.Google ScholarGoogle Scholar
  24. Robertson, S. 1986. On relevance weight estimation and query expansion. J. Doc. 42, 3, 288--297.Google ScholarGoogle Scholar
  25. Robertson, S. and Walker, S. 1994. Some simple approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (Dublin), Springer-Verlag, New York, 232--241. Google ScholarGoogle Scholar
  26. Robertson, S., Walker, S., Beaulieu, M., Gatford, M., and Payne, A. 1996. Okapi at Trec-4. In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), D. Harman, Ed., Department of Commerce, National Institute of Standards and Technology, Gaithersburg, Md., 182--191.Google ScholarGoogle Scholar
  27. Robertson, S. E. and Sparck-Jones, K. 1976. Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27, 129--146.Google ScholarGoogle Scholar
  28. Robertson, S. E., van Rijsbergen, C. J., and Porter, M. 1981. Probabilistic models of indexing and searching. In Information Retrieval Research, S. E. Robertson, C. J. van Rijsbergen, and P. Williams, Eds., Butterworths, Oxford, UK, Chapter 4, 35--56. Google ScholarGoogle Scholar
  29. Salton, G. and Buckley, C. 1988. Term-weight approaches in automatic text retrieval. Inf. Process. Manage. 24, 5, 513--523. Google ScholarGoogle Scholar
  30. Singhal, A., Salton, G., Mitra, M., and Buckley, C. 1996. Document length normalization. Inf. Process. Manage. 32, 5, 619--633. Google ScholarGoogle Scholar
  31. Solomonoff, R. 1964a. A formal theory of inductive inference. Part I. Inf. Control 7, 1 (March), 1--22.Google ScholarGoogle Scholar
  32. Solomonoff, R. 1964b. A formal theory of inductive inference. Part II. Inf. Control 7, 2 (June), 224--254.Google ScholarGoogle Scholar
  33. Titterington, D. M., Smith, A. F. M., and Makov, U. E. 1985. Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester.Google ScholarGoogle Scholar
  34. Turtle, H. and Croft, W. 1992. A comparison of text retrieval models. Comput. J. 35, 3 (June), 279--290. Google ScholarGoogle Scholar
  35. van Rijsbergen, C. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. J. Doc. 33, 106--119.Google ScholarGoogle Scholar
  36. Willis, D. 1970. Computational complexity and probability constructions. J. ACM 17, 2, 241--259. Google ScholarGoogle Scholar
  37. Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes, second ed. Morgan Kaufmann, San Francisco.Google ScholarGoogle Scholar
  38. Wong, S. and Yao, Y. 1995. On modeling information retrieval with probabilistic inference. ACM Trans. Inf. Syst. 16, 38--68. Google ScholarGoogle Scholar

Index Terms

  1. Probabilistic models of information retrieval based on measuring the divergence from randomness

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader