Abstract
A novel probabilistic retrieval model is presented. It forms a basis to interpret the TF-IDF term weights as making relevance decisions. It simulates the local relevance decision-making for every location of a document, and combines all of these “local” relevance decisions as the “document-wide” relevance decision for the document. The significance of interpreting TF-IDF in this way is the potential to: (1) establish a unifying perspective about information retrieval as relevance decision-making; and (2) develop advanced TF-IDF-related term weights for future elaborate retrieval models. Our novel retrieval model is simplified to a basic ranking formula that directly corresponds to the TF-IDF term weights. In general, we show that the term-frequency factor of the ranking formula can be rendered into different term-frequency factors of existing retrieval systems. In the basic ranking formula, the remaining quantity - log p(r¯|t ∈ d) is interpreted as the probability of randomly picking a nonrelevant usage (denoted by r¯) of term t. Mathematically, we show that this quantity can be approximated by the inverse document-frequency (IDF). Empirically, we show that this quantity is related to IDF, using four reference TREC ad hoc retrieval data collections.
- Aizawa, A. 2003. An information-theoretic perspective of TF-IDF measures. Inf. Process. Manage. 39, 1, 45--65.]] Google ScholarDigital Library
- Amati, G. and van Rijsbergen, C. J. 1998. Semantic information retrieval. In Information Retrieval: Uncertainty and Logics, C. J. Van Rijsbergen et al., Eds. Kluwer Academic, 189--220.]]Google Scholar
- Amati, G. and van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4, 357--389.]] Google ScholarDigital Library
- Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, New York.]] Google ScholarDigital Library
- Bodoff, D. and Robertson, S. E. 2004. A new unified probabilistic model. J. Amer. Soc. Inf. Sci. Technol. 55, 6, 471--487.]] Google ScholarDigital Library
- Bookstein, A. and Swanson, D. 1974. Probabilistic models for automatic indexing. J. Amer. Soc. Inf. Sci. 25, 312--318.]]Google ScholarCross Ref
- Calado, P., Ribeiro-Neto, B., Ziviani, N., Moura, E., and Silva, I. 2003. Local versus global link information in the Web. ACM Trans. Inf. Syst. 21, 1, 42--63.]] Google ScholarDigital Library
- Clarke, C. L. A. and Scholer, F. 2005. The 2005 terabyte track. In Proceedings of the 14th Text Retrieval Conference, Gaithersburg, MD, E. M. Voorhees and L. P. Buckland, Eds. National Institute of Standards and Technology.]]Google Scholar
- Clough, P., Sanderson, M., and Müller, H. 2004. The CLEF cross language image retrieval track (ImageCLEF) 2004. In Proceedings of 3rd International Conference on Image and Video Conference, Dublin, Ireland, P. Enser et al., Eds. Lecture Notes in Computer Science, vol. 3115. Springer, 243--251.]]Google Scholar
- Cooper, W. S., Chen, A., and Gey, F. C. 1993. Full text retrieval based on probabilistic equations with coefficients fitted by logistic regression. In Proceedings of the 2nd Text Retrieval Conference, Gaithersburg, MD, D. K. Harman, Ed. National Institute of Standards and Technology, 57--66.]]Google Scholar
- Cooper, W. S., Gey, F. C., and Dabney, D. P. 1992. Probabilistic retrieval based on staged logistic regression. In Proceedings of the 15th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, E. Fox et al., Eds. ACM, New York, 198--210.]] Google ScholarDigital Library
- Cooper, W. S. and Maron, M. E. 1978. Foundations of probabilistic and utility-theoretic indexing. J. ACM 25, 1, 67--80.]] Google ScholarDigital Library
- Crestani, F., Lalmas, M., van Rijsbergen, C. J., and Campbell, I. 1998. “Is this document relevant?… probably”: A survey of probabilistic models in information retrieval. ACM Comput. Surv. 30, 4, 528--552.]] Google ScholarDigital Library
- Croft, W. B. and Harper, D. J. 1979. Using probabilistic models of document retrieval without relevance information. J. Document. 35, 285--295.]]Google ScholarCross Ref
- Damerau, F. 1965. An experiment in automatic indexing. Amer. Document. 16, 283--289.]]Google ScholarCross Ref
- de Vries, A. P. and Roelleke, T. 2005. Relevance information: A loss of entropy but a gain for IDF? In Proceedings of the 28th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, G. Marchionini et al., Eds. ACM, New York, 282--289.]] Google ScholarDigital Library
- Dombi, J. 1982. A general class of fuzzy operators, the DeMorgan class of fuzzy operators and fuzziness measures induced by fuzzy operators. Fuzzy Sets Syst. 8, 149--163.]]Google ScholarCross Ref
- Dyckhoff, H. and Pedrycz, W. 1984. Generalized means as a model of compensative connectives. Fuzzy Sets Syst. 14, 143--154.]]Google ScholarCross Ref
- Feller, W. 1968. An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd ed. Wiley, New York.]]Google Scholar
- French, S. 1986. Decision Theory:An Introduction to the Mathematics of Rationality. Ellis Horwood, Chichester, UK.]] Google ScholarDigital Library
- Fuhr, N. 1989. Models for retrieval with probabilistic indexing. Inf. Process. Manage. 25, 1, 55--72.]] Google ScholarDigital Library
- Gale, W. A., Church, K. W., and Yarowsky, D. 1992. Work on statistical methods for word sense disambiguation. In Working Notes of the AAAI Fall Symposium Series, Probabilistic Approaches to Natural Language, Cambridge, MA, 54--60.]]Google Scholar
- Harman, D. 2004. Personal communication at NTCIR-4.]]Google Scholar
- Harter, S. P. 1974. A probabilistic approach to automatic keyword indexing. Ph.D. thesis, Graduate Library, The University of Chicago, Thesis no. T25146.]]Google Scholar
- Harter, S. P. 1975a. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. Amer. Soc. Inf. Sci. 26, 4, 197--206.]]Google ScholarCross Ref
- Harter, S. P. 1975b. A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. J. Amer. Soc. Inf. Sci. 26, 4, 280--289.]]Google ScholarCross Ref
- Hiemstra, D. 1998. A linguistically motivated probabilistic model of information retrieval. In Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries, Heraklion, Crete, Greece, C. Nikolaou and C. Stephanidis, Eds. Springer, London, 569--584.]] Google ScholarDigital Library
- Huang, X., Peng, F., Shuurmans, D., Cercone, N., and Robertson, S. E. 2003. Applying machine learning to text segmentation for information retrieval. Inf. Retr. 6, 3--4, 333--362.]] Google ScholarDigital Library
- Hung, K. Y., Luk, R. W. P., Yeung, D. S., Chung, K. F. L., and Shu, W. H. 2001. Determination of context window size. Int. J. Comput. Process. Orient. Lang. 14, 1, 71--80.]]Google ScholarCross Ref
- Joachims, T. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, D. H. Fisher, Ed. Morgan Kaufmann, San Francisco, CA, 143--151.]] Google ScholarDigital Library
- Kaszkiel, M., Zobel, J., and Sacks-Davis, R. 1999. Efficient passage ranking for document databases. ACM Trans. Inf. Syst. 17, 4, 406--439.]] Google ScholarDigital Library
- Klir, G. J. and Folger, T. A. 1988. Fuzzy Sets, Uncertainty, and Information. Prentice-Hall, NJ.]] Google ScholarDigital Library
- Kwok, K. L. 1995. A network approach to probabilistic information retrieval. ACM Trans. Inf. Syst. 13, 3, 324--353.]] Google ScholarDigital Library
- Lau, K. Y. K. and Luk, R. W. P. 1999. Word-Sense classification by hierarchical clustering. J. Chinese Lang. Comput. 9, 1, 101--121.]]Google Scholar
- Lavrenko, V. and Croft, W. B. 2001. Relevance-Based language model. In Proceedings of the 24th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, D. H. Kraft et al., Eds. ACM, New York, 120--127.]] Google ScholarDigital Library
- Lavrenko, V. and Croft, W. B. 2003. Relevance models in information retrieval. In Language Modeling for Information Retrieval, W. B. Croft and J. Lafferty, Eds. Kluwer Academic, Chapter 2.]]Google Scholar
- Lee, L. 2007. IDF revisited: A simple new derivation within the Robertson-Spärck Jones probabilistic model. In Proceedings of the 30th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, C. L. A. Clarke et al., Eds. ACM, New York, 751--752.]] Google ScholarDigital Library
- Liu, X. and Croft, W. B. 2002. Passage retrieval based on language models. In Proceedings of the 11th ACM Conference on Information and Knowledge Management, Mclean, VA, C. Nicholas et al., Eds. ACM, New York, 375--382.]] Google ScholarDigital Library
- Lucassen, J. M. and Mercer, R. L. 1984. An information-theoretic approach to automatic determination of phonemic baseforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, San Diego, CA. IEEE, 304--307.]]Google Scholar
- Luhn, H. 1958. The automatic creation of literature abstracts. IBM J. Res. Devel. 2, 2, 159--165.]]Google ScholarDigital Library
- Margulis, E. L. 1992. N-Poisson document modelling. In Proceedings of the 15th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, E. Fox et al. Eds. ACM, New York, 177--189.]] Google ScholarDigital Library
- Maron, M. E. and Kuhns, J. L. 1960. On relevance, probabilistic indexing and information retrieval. J. ACM 25, 3, 216--244.]] Google ScholarDigital Library
- Paice, C. D. 1984. Soft evaluation of Boolean search queries in information retrieval systems. Inf. Technol. Res. Devel. Appl. 3, 1, 33--41.]] Google ScholarDigital Library
- Papineni, K. 2001. Why inverse document frequency? In Proceedings of the 2nd Meeting of the North American Chapter of The Association for Computational Linguistics, Pittsburgh, PA, L. Levin et al., Eds. Association for Computational Linguistics, Morristown, NJ, 25--32.]] Google ScholarDigital Library
- Pickens, J. and Macfarlane, A. 2006. Term context models for information retrieval. In Proceedings of the 15th ACM Conference on Information and Knowledge Management, Arlington, VA, E. Fox et al., Eds. ACM, New York, 559--566.]] Google ScholarDigital Library
- Ponte, J. M. and Croft, W. B. 1998. A language modeling approach in information retrieval. In Proceedings of the 21st Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, W. B. Croft et al., Eds. ACM, New York, 275--281.]] Google ScholarDigital Library
- Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.]]Google ScholarCross Ref
- Robertson, S. E. 1977. The probability ranking principle in IR. J. Document. 33, 4, 294--304.]]Google ScholarCross Ref
- Robertson, S. E. 1997. Overview of the Okapi projects. J. Document. 53, 1, 3--7.]]Google ScholarCross Ref
- Robertson, S. E. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Document. 60, 5, 503--520.]]Google ScholarCross Ref
- Robertson, S. E. 2005. On event spaces and probabilistic models in information retrieval. Inf. Retr. 8, 2, 319--329.]] Google ScholarDigital Library
- Robertson, S. E. and Spärck Jones, K. 1976. Relevance weighting of search terms. J. Amer. Soc. Inf. Sci. 27, 3, 129--146.]]Google ScholarCross Ref
- Robertson, S. E., van Rijsbergen, C. J., and Porter, M. F. 1981. Probabilistic models of indexing and searching. In Information Retrieval Research, R. N. Oddy et al., Eds. Butterworths, 35--56.]] Google ScholarDigital Library
- Robertson, S. E. and Walker, S. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, W. B. Croft and C. J. van Rijsbergen, Eds. ACM, New York, 232--241.]] Google ScholarDigital Library
- Robertson, S. E. and Walker, S. 1997. On relevance weights with little relevance information. In Proceedings of the 20th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, F. Can et al., Eds. ACM, New York, 16--24.]] Google ScholarDigital Library
- Robertson, S. E., Walker, S., and Hancock-Beaulieu, M. M. 1995. Large test collection experiments on an operational, interactive system: Okapi at TREC. Inf. Process. Manage. 31, 3, 345--360.]] Google ScholarDigital Library
- Roelleke, T. 2003. A frequency-based and a Poisson-based definition of probability of being informative. In Proceedings of the 26th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, C. Clarke et al., Eds. ACM, New York, 227--234.]] Google ScholarDigital Library
- Roelleke, T. and Wang, J. 2006. A parallel derivation of probabilistic information retrieval models. In Proceedings of the 29th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, S. Dumais, et al., Eds. ACM, New York, 107--114.]] Google ScholarDigital Library
- Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 5, 513--523.]] Google ScholarDigital Library
- Salton, G., Fox, E. A., and Wu, H. 1983. Extended Boolean information retrieval. Commun. ACM 26, 11, 1022--1036.]] Google ScholarDigital Library
- Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for information retrieval. J. Amer. Soc. Inf. Sci. 18, 11, 613--620.]] Google ScholarDigital Library
- Spärck Jones, K. 1972. Exhaustivity and specificity. J. Document. 28, 11--21.]]Google ScholarCross Ref
- Spärck Jones, K. 2004. IDF term weighting and IR research lessons. J. Document. 60, 521--523.]]Google ScholarCross Ref
- Spärck Jones, K., Walker, S., and Robertson, S. E. 2000. A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Inf. Process. Manage. 36, 6, 809--840.]] Google ScholarDigital Library
- Trotman, A. and Geva, S. 2006. Passage retrieval and other XML-retrieval tasks. In Proceedings of the ACM SIGIR Workshop on XML Element Retrieval Methodology, Seattle, WA, A. Trotman and S. Geva, Eds., 43--50.]]Google Scholar
- van Rijsbergen, C. J. 1975. Information Retrieval. Butterworths, London.]] Google ScholarDigital Library
- Vechtomova, O., Karamuftuoglu, M., and Robertson, S. E. 2006. On document relevance and lexical cohesion between query terms. Inf. Process. Manage. 24, 5, 1230--1247.]] Google ScholarDigital Library
- Waller, W. G. and Kraft, D. H. 1979. A mathematical model of a weighted Boolean retrieval system. Inf. Process. Manage. 15, 5, 235--245.]]Google ScholarCross Ref
- Wong, A. K. C. and Ghahraman, D. 1975. A statistical analysis of interdependence in character sequences. Inf. Sci. 8, 2, 173--188.]]Google ScholarCross Ref
- Wong, S. K. M., Ziarko, W., Raghavan, V. V., and Wong, P. C. N. 1986. On extending the vector space model for Boolean query processing. In Proceedings of the 9th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Palazzo dei Congressi, Pisa, Italy, F. Rabitti, Ed. ACM, New York, 175--185.]] Google ScholarDigital Library
- Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2005. A retrospective study of probabilistic context-based retrieval. In Proceedings of the 28th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, G. Marchionini et al., Eds. ACM, New York, 663--664.]] Google ScholarDigital Library
- Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2006. Probabilistic document-context based relevance feedback with limited relevance judgment. In Proceedings of the 15th ACM Conference on Information and Knowledge Management, Arlington, VA, P. S. Yu et al., Eds. ACM, New York, 854--855.]] Google ScholarDigital Library
- Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2007. A retrospective study of a hybrid document-context based retrieval model. Inf. Process. Manage. 43, 5, 1308--1331.]] Google ScholarDigital Library
- Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval using local context analysis. ACM Trans. Inf. Syst, 18, 1, 79--112.]] Google ScholarDigital Library
- Yao, Y. Y. and Wong, S. K. M. 1991. Preference structure, inference and set-oriented retrieval. In Proceedings of the 14th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Chicago, IL, E. Fox, Ed. ACM, New York, 211--218.]] Google ScholarDigital Library
- Yu, C. T. and Salton, G. 1976. Precision weighting—An effective automatic indexing method. J. ACM 23, 1, 76--88.]] Google ScholarDigital Library
- Zahn, C. T. 1971. Graph-Theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. 20, 1, 68--86.]] Google ScholarDigital Library
- Zhai, C. X. and Lafferty, J. 2003. A risk minimization framework for information retrieval. In Proceedings of the ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval.]]Google Scholar
- Zhai, C. X. and Lafferty, J. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22, 2, 179--214.]] Google ScholarDigital Library
Index Terms
- Interpreting TF-IDF term weights as making relevance decisions
Recommendations
Document-based and term-based linear methods for pseudo-relevance feedback
Query expansion is a successful approach for improving Information Retrieval effectiveness. This work focuses on pseudo-relevance feedback (PRF) which provides an automatic method for expanding queries without explicit user feedback. These techniques ...
TAG term weight-based N gram Thesaurus generation for query expansion in information retrieval application
Query expansion is an important task in information retrieval applications that improves the user query and helps in retrieving the relevant documents. In this paper, N gram Thesaurus is constructed from the documents for query expansion. The HTML TAGs ...
Comments