research-article

Interpreting TF-IDF term weights as making relevance decisions

Authors:
Ho Chung Wu

The Hong Kong Polytechnic University, Kowloon, Hong Kong

The Hong Kong Polytechnic University, Kowloon, Hong Kong
View Profile

,
Robert Wing Pong Luk

The Hong Kong Polytechnic University, Kowloon, Hong Kong

The Hong Kong Polytechnic University, Kowloon, Hong Kong
View Profile

,
Kam Fai Wong

The Chinese University of Hong Kong, Hong Kong SAR, The People's Republic of China

The Chinese University of Hong Kong, Hong Kong SAR, The People's Republic of China
View Profile

,
Kui Lam Kwok

Queens College, City University of New York, Flushing, NY

Queens College, City University of New York, Flushing, NY
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 26 Issue 3Article No.: 13pp 1–37https://doi.org/10.1145/1361684.1361686

Published:20 June 2008Publication History

ACM Transactions on Information Systems

Abstract

A novel probabilistic retrieval model is presented. It forms a basis to interpret the TF-IDF term weights as making relevance decisions. It simulates the local relevance decision-making for every location of a document, and combines all of these “local” relevance decisions as the “document-wide” relevance decision for the document. The significance of interpreting TF-IDF in this way is the potential to: (1) establish a unifying perspective about information retrieval as relevance decision-making; and (2) develop advanced TF-IDF-related term weights for future elaborate retrieval models. Our novel retrieval model is simplified to a basic ranking formula that directly corresponds to the TF-IDF term weights. In general, we show that the term-frequency factor of the ranking formula can be rendered into different term-frequency factors of existing retrieval systems. In the basic ranking formula, the remaining quantity - log p(r¯|t ∈ d) is interpreted as the probability of randomly picking a nonrelevant usage (denoted by r¯) of term t. Mathematically, we show that this quantity can be approximated by the inverse document-frequency (IDF). Empirically, we show that this quantity is related to IDF, using four reference TREC ad hoc retrieval data collections.

References

Aizawa, A. 2003. An information-theoretic perspective of TF-IDF measures. Inf. Process. Manage. 39, 1, 45--65.]] Google ScholarDigital Library
Amati, G. and van Rijsbergen, C. J. 1998. Semantic information retrieval. In Information Retrieval: Uncertainty and Logics, C. J. Van Rijsbergen et al., Eds. Kluwer Academic, 189--220.]]Google Scholar
Amati, G. and van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4, 357--389.]] Google ScholarDigital Library
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, New York.]] Google ScholarDigital Library
Bodoff, D. and Robertson, S. E. 2004. A new unified probabilistic model. J. Amer. Soc. Inf. Sci. Technol. 55, 6, 471--487.]] Google ScholarDigital Library
Bookstein, A. and Swanson, D. 1974. Probabilistic models for automatic indexing. J. Amer. Soc. Inf. Sci. 25, 312--318.]]Google ScholarCross Ref
Calado, P., Ribeiro-Neto, B., Ziviani, N., Moura, E., and Silva, I. 2003. Local versus global link information in the Web. ACM Trans. Inf. Syst. 21, 1, 42--63.]] Google ScholarDigital Library
Clarke, C. L. A. and Scholer, F. 2005. The 2005 terabyte track. In Proceedings of the 14th Text Retrieval Conference, Gaithersburg, MD, E. M. Voorhees and L. P. Buckland, Eds. National Institute of Standards and Technology.]]Google Scholar
Clough, P., Sanderson, M., and Müller, H. 2004. The CLEF cross language image retrieval track (ImageCLEF) 2004. In Proceedings of 3rd International Conference on Image and Video Conference, Dublin, Ireland, P. Enser et al., Eds. Lecture Notes in Computer Science, vol. 3115. Springer, 243--251.]]Google Scholar
Cooper, W. S., Chen, A., and Gey, F. C. 1993. Full text retrieval based on probabilistic equations with coefficients fitted by logistic regression. In Proceedings of the 2nd Text Retrieval Conference, Gaithersburg, MD, D. K. Harman, Ed. National Institute of Standards and Technology, 57--66.]]Google Scholar
Cooper, W. S., Gey, F. C., and Dabney, D. P. 1992. Probabilistic retrieval based on staged logistic regression. In Proceedings of the 15th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, E. Fox et al., Eds. ACM, New York, 198--210.]] Google ScholarDigital Library
Cooper, W. S. and Maron, M. E. 1978. Foundations of probabilistic and utility-theoretic indexing. J. ACM 25, 1, 67--80.]] Google ScholarDigital Library
Crestani, F., Lalmas, M., van Rijsbergen, C. J., and Campbell, I. 1998. “Is this document relevant&quest;… probably”: A survey of probabilistic models in information retrieval. ACM Comput. Surv. 30, 4, 528--552.]] Google ScholarDigital Library
Croft, W. B. and Harper, D. J. 1979. Using probabilistic models of document retrieval without relevance information. J. Document. 35, 285--295.]]Google ScholarCross Ref
Damerau, F. 1965. An experiment in automatic indexing. Amer. Document. 16, 283--289.]]Google ScholarCross Ref
de Vries, A. P. and Roelleke, T. 2005. Relevance information: A loss of entropy but a gain for IDF&quest; In Proceedings of the 28th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, G. Marchionini et al., Eds. ACM, New York, 282--289.]] Google ScholarDigital Library
Dombi, J. 1982. A general class of fuzzy operators, the DeMorgan class of fuzzy operators and fuzziness measures induced by fuzzy operators. Fuzzy Sets Syst. 8, 149--163.]]Google ScholarCross Ref
Dyckhoff, H. and Pedrycz, W. 1984. Generalized means as a model of compensative connectives. Fuzzy Sets Syst. 14, 143--154.]]Google ScholarCross Ref
Feller, W. 1968. An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd ed. Wiley, New York.]]Google Scholar
French, S. 1986. Decision Theory:An Introduction to the Mathematics of Rationality. Ellis Horwood, Chichester, UK.]] Google ScholarDigital Library
Fuhr, N. 1989. Models for retrieval with probabilistic indexing. Inf. Process. Manage. 25, 1, 55--72.]] Google ScholarDigital Library
Gale, W. A., Church, K. W., and Yarowsky, D. 1992. Work on statistical methods for word sense disambiguation. In Working Notes of the AAAI Fall Symposium Series, Probabilistic Approaches to Natural Language, Cambridge, MA, 54--60.]]Google Scholar
Harman, D. 2004. Personal communication at NTCIR-4.]]Google Scholar
Harter, S. P. 1974. A probabilistic approach to automatic keyword indexing. Ph.D. thesis, Graduate Library, The University of Chicago, Thesis no. T25146.]]Google Scholar
Harter, S. P. 1975a. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. Amer. Soc. Inf. Sci. 26, 4, 197--206.]]Google ScholarCross Ref
Harter, S. P. 1975b. A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. J. Amer. Soc. Inf. Sci. 26, 4, 280--289.]]Google ScholarCross Ref
Hiemstra, D. 1998. A linguistically motivated probabilistic model of information retrieval. In Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries, Heraklion, Crete, Greece, C. Nikolaou and C. Stephanidis, Eds. Springer, London, 569--584.]] Google ScholarDigital Library
Huang, X., Peng, F., Shuurmans, D., Cercone, N., and Robertson, S. E. 2003. Applying machine learning to text segmentation for information retrieval. Inf. Retr. 6, 3--4, 333--362.]] Google ScholarDigital Library
Hung, K. Y., Luk, R. W. P., Yeung, D. S., Chung, K. F. L., and Shu, W. H. 2001. Determination of context window size. Int. J. Comput. Process. Orient. Lang. 14, 1, 71--80.]]Google ScholarCross Ref
Joachims, T. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, D. H. Fisher, Ed. Morgan Kaufmann, San Francisco, CA, 143--151.]] Google ScholarDigital Library
Kaszkiel, M., Zobel, J., and Sacks-Davis, R. 1999. Efficient passage ranking for document databases. ACM Trans. Inf. Syst. 17, 4, 406--439.]] Google ScholarDigital Library
Klir, G. J. and Folger, T. A. 1988. Fuzzy Sets, Uncertainty, and Information. Prentice-Hall, NJ.]] Google ScholarDigital Library
Kwok, K. L. 1995. A network approach to probabilistic information retrieval. ACM Trans. Inf. Syst. 13, 3, 324--353.]] Google ScholarDigital Library
Lau, K. Y. K. and Luk, R. W. P. 1999. Word-Sense classification by hierarchical clustering. J. Chinese Lang. Comput. 9, 1, 101--121.]]Google Scholar
Lavrenko, V. and Croft, W. B. 2001. Relevance-Based language model. In Proceedings of the 24th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, D. H. Kraft et al., Eds. ACM, New York, 120--127.]] Google ScholarDigital Library
Lavrenko, V. and Croft, W. B. 2003. Relevance models in information retrieval. In Language Modeling for Information Retrieval, W. B. Croft and J. Lafferty, Eds. Kluwer Academic, Chapter 2.]]Google Scholar
Lee, L. 2007. IDF revisited: A simple new derivation within the Robertson-Spärck Jones probabilistic model. In Proceedings of the 30th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, C. L. A. Clarke et al., Eds. ACM, New York, 751--752.]] Google ScholarDigital Library
Liu, X. and Croft, W. B. 2002. Passage retrieval based on language models. In Proceedings of the 11th ACM Conference on Information and Knowledge Management, Mclean, VA, C. Nicholas et al., Eds. ACM, New York, 375--382.]] Google ScholarDigital Library
Lucassen, J. M. and Mercer, R. L. 1984. An information-theoretic approach to automatic determination of phonemic baseforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, San Diego, CA. IEEE, 304--307.]]Google Scholar
Luhn, H. 1958. The automatic creation of literature abstracts. IBM J. Res. Devel. 2, 2, 159--165.]]Google ScholarDigital Library
Margulis, E. L. 1992. N-Poisson document modelling. In Proceedings of the 15th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, E. Fox et al. Eds. ACM, New York, 177--189.]] Google ScholarDigital Library
Maron, M. E. and Kuhns, J. L. 1960. On relevance, probabilistic indexing and information retrieval. J. ACM 25, 3, 216--244.]] Google ScholarDigital Library
Paice, C. D. 1984. Soft evaluation of Boolean search queries in information retrieval systems. Inf. Technol. Res. Devel. Appl. 3, 1, 33--41.]] Google ScholarDigital Library
Papineni, K. 2001. Why inverse document frequency&quest; In Proceedings of the 2nd Meeting of the North American Chapter of The Association for Computational Linguistics, Pittsburgh, PA, L. Levin et al., Eds. Association for Computational Linguistics, Morristown, NJ, 25--32.]] Google ScholarDigital Library
Pickens, J. and Macfarlane, A. 2006. Term context models for information retrieval. In Proceedings of the 15th ACM Conference on Information and Knowledge Management, Arlington, VA, E. Fox et al., Eds. ACM, New York, 559--566.]] Google ScholarDigital Library
Ponte, J. M. and Croft, W. B. 1998. A language modeling approach in information retrieval. In Proceedings of the 21st Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, W. B. Croft et al., Eds. ACM, New York, 275--281.]] Google ScholarDigital Library
Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.]]Google ScholarCross Ref
Robertson, S. E. 1977. The probability ranking principle in IR. J. Document. 33, 4, 294--304.]]Google ScholarCross Ref
Robertson, S. E. 1997. Overview of the Okapi projects. J. Document. 53, 1, 3--7.]]Google ScholarCross Ref
Robertson, S. E. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Document. 60, 5, 503--520.]]Google ScholarCross Ref
Robertson, S. E. 2005. On event spaces and probabilistic models in information retrieval. Inf. Retr. 8, 2, 319--329.]] Google ScholarDigital Library
Robertson, S. E. and Spärck Jones, K. 1976. Relevance weighting of search terms. J. Amer. Soc. Inf. Sci. 27, 3, 129--146.]]Google ScholarCross Ref
Robertson, S. E., van Rijsbergen, C. J., and Porter, M. F. 1981. Probabilistic models of indexing and searching. In Information Retrieval Research, R. N. Oddy et al., Eds. Butterworths, 35--56.]] Google ScholarDigital Library
Robertson, S. E. and Walker, S. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, W. B. Croft and C. J. van Rijsbergen, Eds. ACM, New York, 232--241.]] Google ScholarDigital Library
Robertson, S. E. and Walker, S. 1997. On relevance weights with little relevance information. In Proceedings of the 20th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, F. Can et al., Eds. ACM, New York, 16--24.]] Google ScholarDigital Library
Robertson, S. E., Walker, S., and Hancock-Beaulieu, M. M. 1995. Large test collection experiments on an operational, interactive system: Okapi at TREC. Inf. Process. Manage. 31, 3, 345--360.]] Google ScholarDigital Library
Roelleke, T. 2003. A frequency-based and a Poisson-based definition of probability of being informative. In Proceedings of the 26th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, C. Clarke et al., Eds. ACM, New York, 227--234.]] Google ScholarDigital Library
Roelleke, T. and Wang, J. 2006. A parallel derivation of probabilistic information retrieval models. In Proceedings of the 29th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, S. Dumais, et al., Eds. ACM, New York, 107--114.]] Google ScholarDigital Library
Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 5, 513--523.]] Google ScholarDigital Library
Salton, G., Fox, E. A., and Wu, H. 1983. Extended Boolean information retrieval. Commun. ACM 26, 11, 1022--1036.]] Google ScholarDigital Library
Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for information retrieval. J. Amer. Soc. Inf. Sci. 18, 11, 613--620.]] Google ScholarDigital Library
Spärck Jones, K. 1972. Exhaustivity and specificity. J. Document. 28, 11--21.]]Google ScholarCross Ref
Spärck Jones, K. 2004. IDF term weighting and IR research lessons. J. Document. 60, 521--523.]]Google ScholarCross Ref
Spärck Jones, K., Walker, S., and Robertson, S. E. 2000. A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Inf. Process. Manage. 36, 6, 809--840.]] Google ScholarDigital Library
Trotman, A. and Geva, S. 2006. Passage retrieval and other XML-retrieval tasks. In Proceedings of the ACM SIGIR Workshop on XML Element Retrieval Methodology, Seattle, WA, A. Trotman and S. Geva, Eds., 43--50.]]Google Scholar
van Rijsbergen, C. J. 1975. Information Retrieval. Butterworths, London.]] Google ScholarDigital Library
Vechtomova, O., Karamuftuoglu, M., and Robertson, S. E. 2006. On document relevance and lexical cohesion between query terms. Inf. Process. Manage. 24, 5, 1230--1247.]] Google ScholarDigital Library
Waller, W. G. and Kraft, D. H. 1979. A mathematical model of a weighted Boolean retrieval system. Inf. Process. Manage. 15, 5, 235--245.]]Google ScholarCross Ref
Wong, A. K. C. and Ghahraman, D. 1975. A statistical analysis of interdependence in character sequences. Inf. Sci. 8, 2, 173--188.]]Google ScholarCross Ref
Wong, S. K. M., Ziarko, W., Raghavan, V. V., and Wong, P. C. N. 1986. On extending the vector space model for Boolean query processing. In Proceedings of the 9th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Palazzo dei Congressi, Pisa, Italy, F. Rabitti, Ed. ACM, New York, 175--185.]] Google ScholarDigital Library
Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2005. A retrospective study of probabilistic context-based retrieval. In Proceedings of the 28th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, G. Marchionini et al., Eds. ACM, New York, 663--664.]] Google ScholarDigital Library
Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2006. Probabilistic document-context based relevance feedback with limited relevance judgment. In Proceedings of the 15th ACM Conference on Information and Knowledge Management, Arlington, VA, P. S. Yu et al., Eds. ACM, New York, 854--855.]] Google ScholarDigital Library
Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2007. A retrospective study of a hybrid document-context based retrieval model. Inf. Process. Manage. 43, 5, 1308--1331.]] Google ScholarDigital Library
Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval using local context analysis. ACM Trans. Inf. Syst, 18, 1, 79--112.]] Google ScholarDigital Library
Yao, Y. Y. and Wong, S. K. M. 1991. Preference structure, inference and set-oriented retrieval. In Proceedings of the 14th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Chicago, IL, E. Fox, Ed. ACM, New York, 211--218.]] Google ScholarDigital Library
Yu, C. T. and Salton, G. 1976. Precision weighting—An effective automatic indexing method. J. ACM 23, 1, 76--88.]] Google ScholarDigital Library
Zahn, C. T. 1971. Graph-Theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. 20, 1, 68--86.]] Google ScholarDigital Library
Zhai, C. X. and Lafferty, J. 2003. A risk minimization framework for information retrieval. In Proceedings of the ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval.]]Google Scholar
Zhai, C. X. and Lafferty, J. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22, 2, 179--214.]] Google ScholarDigital Library

Index Terms

Interpreting TF-IDF term weights as making relevance decisions
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Document-based and term-based linear methods for pseudo-relevance feedback

Query expansion is a successful approach for improving Information Retrieval effectiveness. This work focuses on pseudo-relevance feedback (PRF) which provides an automatic method for expanding queries without explicit user feedback. These techniques ...
Read More
TAG term weight-based N gram Thesaurus generation for query expansion in information retrieval application

Query expansion is an important task in information retrieval applications that improves the user query and helps in retrieving the relevant documents. In this paper, N gram Thesaurus is constructed from the documents for query expansion. The HTML TAGs ...
Read More
Term Relevance Weights in On-Line Information Retrieval
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 26, Issue 3
June 2008
236 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1361684
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2008
- Accepted: 1 September 2007
- Revised: 1 July 2007
- Received: 1 August 2004
Published in tois Volume 26, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Information retrieval
relevance decision
term weight
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 529
  Total Citations
  View Citations
- 5,325
  Total Downloads
- Downloads (Last 12 months)314
- Downloads (Last 6 weeks)66
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Interpreting TF-IDF term weights as making relevance decisions

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Document-based and term-based linear methods for pseudo-relevance feedback

TAG term weight-based N gram Thesaurus generation for query expansion in information retrieval application

Term Relevance Weights in On-Line Information Retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Interpreting TF-IDF term weights as making relevance decisions

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Document-based and term-based linear methods for pseudo-relevance feedback

TAG term weight-based N gram Thesaurus generation for query expansion in information retrieval application

Term Relevance Weights in On-Line Information Retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media