skip to main content
research-article

Interpreting TF-IDF term weights as making relevance decisions

Authors Info & Claims
Published:20 June 2008Publication History
Skip Abstract Section

Abstract

A novel probabilistic retrieval model is presented. It forms a basis to interpret the TF-IDF term weights as making relevance decisions. It simulates the local relevance decision-making for every location of a document, and combines all of these “local” relevance decisions as the “document-wide” relevance decision for the document. The significance of interpreting TF-IDF in this way is the potential to: (1) establish a unifying perspective about information retrieval as relevance decision-making; and (2) develop advanced TF-IDF-related term weights for future elaborate retrieval models. Our novel retrieval model is simplified to a basic ranking formula that directly corresponds to the TF-IDF term weights. In general, we show that the term-frequency factor of the ranking formula can be rendered into different term-frequency factors of existing retrieval systems. In the basic ranking formula, the remaining quantity - log p(r¯|td) is interpreted as the probability of randomly picking a nonrelevant usage (denoted by r¯) of term t. Mathematically, we show that this quantity can be approximated by the inverse document-frequency (IDF). Empirically, we show that this quantity is related to IDF, using four reference TREC ad hoc retrieval data collections.

References

  1. Aizawa, A. 2003. An information-theoretic perspective of TF-IDF measures. Inf. Process. Manage. 39, 1, 45--65.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amati, G. and van Rijsbergen, C. J. 1998. Semantic information retrieval. In Information Retrieval: Uncertainty and Logics, C. J. Van Rijsbergen et al., Eds. Kluwer Academic, 189--220.]]Google ScholarGoogle Scholar
  3. Amati, G. and van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4, 357--389.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bodoff, D. and Robertson, S. E. 2004. A new unified probabilistic model. J. Amer. Soc. Inf. Sci. Technol. 55, 6, 471--487.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bookstein, A. and Swanson, D. 1974. Probabilistic models for automatic indexing. J. Amer. Soc. Inf. Sci. 25, 312--318.]]Google ScholarGoogle ScholarCross RefCross Ref
  7. Calado, P., Ribeiro-Neto, B., Ziviani, N., Moura, E., and Silva, I. 2003. Local versus global link information in the Web. ACM Trans. Inf. Syst. 21, 1, 42--63.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Clarke, C. L. A. and Scholer, F. 2005. The 2005 terabyte track. In Proceedings of the 14th Text Retrieval Conference, Gaithersburg, MD, E. M. Voorhees and L. P. Buckland, Eds. National Institute of Standards and Technology.]]Google ScholarGoogle Scholar
  9. Clough, P., Sanderson, M., and Müller, H. 2004. The CLEF cross language image retrieval track (ImageCLEF) 2004. In Proceedings of 3rd International Conference on Image and Video Conference, Dublin, Ireland, P. Enser et al., Eds. Lecture Notes in Computer Science, vol. 3115. Springer, 243--251.]]Google ScholarGoogle Scholar
  10. Cooper, W. S., Chen, A., and Gey, F. C. 1993. Full text retrieval based on probabilistic equations with coefficients fitted by logistic regression. In Proceedings of the 2nd Text Retrieval Conference, Gaithersburg, MD, D. K. Harman, Ed. National Institute of Standards and Technology, 57--66.]]Google ScholarGoogle Scholar
  11. Cooper, W. S., Gey, F. C., and Dabney, D. P. 1992. Probabilistic retrieval based on staged logistic regression. In Proceedings of the 15th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, E. Fox et al., Eds. ACM, New York, 198--210.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cooper, W. S. and Maron, M. E. 1978. Foundations of probabilistic and utility-theoretic indexing. J. ACM 25, 1, 67--80.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Crestani, F., Lalmas, M., van Rijsbergen, C. J., and Campbell, I. 1998. “Is this document relevant?… probably”: A survey of probabilistic models in information retrieval. ACM Comput. Surv. 30, 4, 528--552.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Croft, W. B. and Harper, D. J. 1979. Using probabilistic models of document retrieval without relevance information. J. Document. 35, 285--295.]]Google ScholarGoogle ScholarCross RefCross Ref
  15. Damerau, F. 1965. An experiment in automatic indexing. Amer. Document. 16, 283--289.]]Google ScholarGoogle ScholarCross RefCross Ref
  16. de Vries, A. P. and Roelleke, T. 2005. Relevance information: A loss of entropy but a gain for IDF? In Proceedings of the 28th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, G. Marchionini et al., Eds. ACM, New York, 282--289.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dombi, J. 1982. A general class of fuzzy operators, the DeMorgan class of fuzzy operators and fuzziness measures induced by fuzzy operators. Fuzzy Sets Syst. 8, 149--163.]]Google ScholarGoogle ScholarCross RefCross Ref
  18. Dyckhoff, H. and Pedrycz, W. 1984. Generalized means as a model of compensative connectives. Fuzzy Sets Syst. 14, 143--154.]]Google ScholarGoogle ScholarCross RefCross Ref
  19. Feller, W. 1968. An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd ed. Wiley, New York.]]Google ScholarGoogle Scholar
  20. French, S. 1986. Decision Theory:An Introduction to the Mathematics of Rationality. Ellis Horwood, Chichester, UK.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Fuhr, N. 1989. Models for retrieval with probabilistic indexing. Inf. Process. Manage. 25, 1, 55--72.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gale, W. A., Church, K. W., and Yarowsky, D. 1992. Work on statistical methods for word sense disambiguation. In Working Notes of the AAAI Fall Symposium Series, Probabilistic Approaches to Natural Language, Cambridge, MA, 54--60.]]Google ScholarGoogle Scholar
  23. Harman, D. 2004. Personal communication at NTCIR-4.]]Google ScholarGoogle Scholar
  24. Harter, S. P. 1974. A probabilistic approach to automatic keyword indexing. Ph.D. thesis, Graduate Library, The University of Chicago, Thesis no. T25146.]]Google ScholarGoogle Scholar
  25. Harter, S. P. 1975a. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. Amer. Soc. Inf. Sci. 26, 4, 197--206.]]Google ScholarGoogle ScholarCross RefCross Ref
  26. Harter, S. P. 1975b. A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. J. Amer. Soc. Inf. Sci. 26, 4, 280--289.]]Google ScholarGoogle ScholarCross RefCross Ref
  27. Hiemstra, D. 1998. A linguistically motivated probabilistic model of information retrieval. In Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries, Heraklion, Crete, Greece, C. Nikolaou and C. Stephanidis, Eds. Springer, London, 569--584.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Huang, X., Peng, F., Shuurmans, D., Cercone, N., and Robertson, S. E. 2003. Applying machine learning to text segmentation for information retrieval. Inf. Retr. 6, 3--4, 333--362.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hung, K. Y., Luk, R. W. P., Yeung, D. S., Chung, K. F. L., and Shu, W. H. 2001. Determination of context window size. Int. J. Comput. Process. Orient. Lang. 14, 1, 71--80.]]Google ScholarGoogle ScholarCross RefCross Ref
  30. Joachims, T. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, D. H. Fisher, Ed. Morgan Kaufmann, San Francisco, CA, 143--151.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kaszkiel, M., Zobel, J., and Sacks-Davis, R. 1999. Efficient passage ranking for document databases. ACM Trans. Inf. Syst. 17, 4, 406--439.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Klir, G. J. and Folger, T. A. 1988. Fuzzy Sets, Uncertainty, and Information. Prentice-Hall, NJ.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kwok, K. L. 1995. A network approach to probabilistic information retrieval. ACM Trans. Inf. Syst. 13, 3, 324--353.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Lau, K. Y. K. and Luk, R. W. P. 1999. Word-Sense classification by hierarchical clustering. J. Chinese Lang. Comput. 9, 1, 101--121.]]Google ScholarGoogle Scholar
  35. Lavrenko, V. and Croft, W. B. 2001. Relevance-Based language model. In Proceedings of the 24th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, D. H. Kraft et al., Eds. ACM, New York, 120--127.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Lavrenko, V. and Croft, W. B. 2003. Relevance models in information retrieval. In Language Modeling for Information Retrieval, W. B. Croft and J. Lafferty, Eds. Kluwer Academic, Chapter 2.]]Google ScholarGoogle Scholar
  37. Lee, L. 2007. IDF revisited: A simple new derivation within the Robertson-Spärck Jones probabilistic model. In Proceedings of the 30th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, C. L. A. Clarke et al., Eds. ACM, New York, 751--752.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Liu, X. and Croft, W. B. 2002. Passage retrieval based on language models. In Proceedings of the 11th ACM Conference on Information and Knowledge Management, Mclean, VA, C. Nicholas et al., Eds. ACM, New York, 375--382.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lucassen, J. M. and Mercer, R. L. 1984. An information-theoretic approach to automatic determination of phonemic baseforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, San Diego, CA. IEEE, 304--307.]]Google ScholarGoogle Scholar
  40. Luhn, H. 1958. The automatic creation of literature abstracts. IBM J. Res. Devel. 2, 2, 159--165.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Margulis, E. L. 1992. N-Poisson document modelling. In Proceedings of the 15th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, E. Fox et al. Eds. ACM, New York, 177--189.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Maron, M. E. and Kuhns, J. L. 1960. On relevance, probabilistic indexing and information retrieval. J. ACM 25, 3, 216--244.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Paice, C. D. 1984. Soft evaluation of Boolean search queries in information retrieval systems. Inf. Technol. Res. Devel. Appl. 3, 1, 33--41.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Papineni, K. 2001. Why inverse document frequency? In Proceedings of the 2nd Meeting of the North American Chapter of The Association for Computational Linguistics, Pittsburgh, PA, L. Levin et al., Eds. Association for Computational Linguistics, Morristown, NJ, 25--32.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Pickens, J. and Macfarlane, A. 2006. Term context models for information retrieval. In Proceedings of the 15th ACM Conference on Information and Knowledge Management, Arlington, VA, E. Fox et al., Eds. ACM, New York, 559--566.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Ponte, J. M. and Croft, W. B. 1998. A language modeling approach in information retrieval. In Proceedings of the 21st Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, W. B. Croft et al., Eds. ACM, New York, 275--281.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.]]Google ScholarGoogle ScholarCross RefCross Ref
  48. Robertson, S. E. 1977. The probability ranking principle in IR. J. Document. 33, 4, 294--304.]]Google ScholarGoogle ScholarCross RefCross Ref
  49. Robertson, S. E. 1997. Overview of the Okapi projects. J. Document. 53, 1, 3--7.]]Google ScholarGoogle ScholarCross RefCross Ref
  50. Robertson, S. E. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Document. 60, 5, 503--520.]]Google ScholarGoogle ScholarCross RefCross Ref
  51. Robertson, S. E. 2005. On event spaces and probabilistic models in information retrieval. Inf. Retr. 8, 2, 319--329.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Robertson, S. E. and Spärck Jones, K. 1976. Relevance weighting of search terms. J. Amer. Soc. Inf. Sci. 27, 3, 129--146.]]Google ScholarGoogle ScholarCross RefCross Ref
  53. Robertson, S. E., van Rijsbergen, C. J., and Porter, M. F. 1981. Probabilistic models of indexing and searching. In Information Retrieval Research, R. N. Oddy et al., Eds. Butterworths, 35--56.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Robertson, S. E. and Walker, S. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, W. B. Croft and C. J. van Rijsbergen, Eds. ACM, New York, 232--241.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Robertson, S. E. and Walker, S. 1997. On relevance weights with little relevance information. In Proceedings of the 20th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, F. Can et al., Eds. ACM, New York, 16--24.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Robertson, S. E., Walker, S., and Hancock-Beaulieu, M. M. 1995. Large test collection experiments on an operational, interactive system: Okapi at TREC. Inf. Process. Manage. 31, 3, 345--360.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Roelleke, T. 2003. A frequency-based and a Poisson-based definition of probability of being informative. In Proceedings of the 26th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, C. Clarke et al., Eds. ACM, New York, 227--234.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Roelleke, T. and Wang, J. 2006. A parallel derivation of probabilistic information retrieval models. In Proceedings of the 29th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, S. Dumais, et al., Eds. ACM, New York, 107--114.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 5, 513--523.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Salton, G., Fox, E. A., and Wu, H. 1983. Extended Boolean information retrieval. Commun. ACM 26, 11, 1022--1036.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for information retrieval. J. Amer. Soc. Inf. Sci. 18, 11, 613--620.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Spärck Jones, K. 1972. Exhaustivity and specificity. J. Document. 28, 11--21.]]Google ScholarGoogle ScholarCross RefCross Ref
  63. Spärck Jones, K. 2004. IDF term weighting and IR research lessons. J. Document. 60, 521--523.]]Google ScholarGoogle ScholarCross RefCross Ref
  64. Spärck Jones, K., Walker, S., and Robertson, S. E. 2000. A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Inf. Process. Manage. 36, 6, 809--840.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Trotman, A. and Geva, S. 2006. Passage retrieval and other XML-retrieval tasks. In Proceedings of the ACM SIGIR Workshop on XML Element Retrieval Methodology, Seattle, WA, A. Trotman and S. Geva, Eds., 43--50.]]Google ScholarGoogle Scholar
  66. van Rijsbergen, C. J. 1975. Information Retrieval. Butterworths, London.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Vechtomova, O., Karamuftuoglu, M., and Robertson, S. E. 2006. On document relevance and lexical cohesion between query terms. Inf. Process. Manage. 24, 5, 1230--1247.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Waller, W. G. and Kraft, D. H. 1979. A mathematical model of a weighted Boolean retrieval system. Inf. Process. Manage. 15, 5, 235--245.]]Google ScholarGoogle ScholarCross RefCross Ref
  69. Wong, A. K. C. and Ghahraman, D. 1975. A statistical analysis of interdependence in character sequences. Inf. Sci. 8, 2, 173--188.]]Google ScholarGoogle ScholarCross RefCross Ref
  70. Wong, S. K. M., Ziarko, W., Raghavan, V. V., and Wong, P. C. N. 1986. On extending the vector space model for Boolean query processing. In Proceedings of the 9th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Palazzo dei Congressi, Pisa, Italy, F. Rabitti, Ed. ACM, New York, 175--185.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2005. A retrospective study of probabilistic context-based retrieval. In Proceedings of the 28th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, G. Marchionini et al., Eds. ACM, New York, 663--664.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2006. Probabilistic document-context based relevance feedback with limited relevance judgment. In Proceedings of the 15th ACM Conference on Information and Knowledge Management, Arlington, VA, P. S. Yu et al., Eds. ACM, New York, 854--855.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2007. A retrospective study of a hybrid document-context based retrieval model. Inf. Process. Manage. 43, 5, 1308--1331.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval using local context analysis. ACM Trans. Inf. Syst, 18, 1, 79--112.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Yao, Y. Y. and Wong, S. K. M. 1991. Preference structure, inference and set-oriented retrieval. In Proceedings of the 14th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Chicago, IL, E. Fox, Ed. ACM, New York, 211--218.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Yu, C. T. and Salton, G. 1976. Precision weighting—An effective automatic indexing method. J. ACM 23, 1, 76--88.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Zahn, C. T. 1971. Graph-Theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. 20, 1, 68--86.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Zhai, C. X. and Lafferty, J. 2003. A risk minimization framework for information retrieval. In Proceedings of the ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval.]]Google ScholarGoogle Scholar
  79. Zhai, C. X. and Lafferty, J. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22, 2, 179--214.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Interpreting TF-IDF term weights as making relevance decisions

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 26, Issue 3
      June 2008
      236 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/1361684
      Issue’s Table of Contents

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 June 2008
      • Accepted: 1 September 2007
      • Revised: 1 July 2007
      • Received: 1 August 2004
      Published in tois Volume 26, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader