ABSTRACT
We introduce in this paper the family of information-based models for ad hoc information retrieval. These models draw their inspiration from a long-standing hypothesis in IR, namely the fact that the difference in the behaviors of a word at the document and collection levels brings information on the significance of the word for the document. This hypothesis has been exploited in the 2-Poisson mixture models, in the notion of eliteness in BM25, and more recently in DFR models. We show here that, combined with notions related to burstiness, it can lead to simpler and better models.
- E. M. Airoldi, W. W. Cohen, and S. E. Fienberg. Bayesian methods for frequent terms in text: Models of contagion and the S2 statistic.Google Scholar
- G. Amati, C. Carpineto, G. Romano, and F. U. Bordoni. Fondazione Ugo Bordoni at TREC 2003: robust and web track, 2003.Google Scholar
- G. Amati and C. J. V. Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357--389, 2002. Google ScholarDigital Library
- A. L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509--512, October 1999.Google ScholarCross Ref
- D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv., 38(1):2, 2006. Google ScholarDigital Library
- K. W. Church and W. A. Gale. Poisson mixtures. Natural Language Engineering, 1:163--190, 1995.Google ScholarCross Ref
- S. Clinchant and É. Gaussier. The BNB distribution for text modeling. In Macdonald et al. {12}, pages 150--161. Google ScholarDigital Library
- C. Elkan. Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In W. W. Cohen and A. Moore, editors, ICML, volume 148 of ACM International Conference Proceeding Series, pages 289--296. ACM, 2006. Google ScholarDigital Library
- H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004. Google ScholarDigital Library
- W. Feller. An Introduction to Probability Theory and Its Applications, Vol. I. Wiley, New York, 1968.Google Scholar
- S. P. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 26, 1975.Google Scholar
- C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, and R. W. White, editors. Advances in Information Retrieval, 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008. Proceedings, volume 4956 of Lecture Notes in Computer Science. Springer, 2008. Google ScholarDigital Library
- R. E. Madsen, D. Kauchak, and C. Elkan. Modeling word burstiness using the dirichlet distribution. In L. D. Raedt and S. Wrobel, editors, ICML, volume 119 of ACM International Conference Proceeding Series, pages 545--552. ACM, 2005. Google ScholarDigital Library
- S.-H. Na, I.-S. Kang, and J.-H. Lee. Improving term frequency normalization for multi-topical documents and application to language modeling approaches. In Macdonald et al. {12}, pages 382--393. Google ScholarDigital Library
- S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR '94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 232--241, New York, NY, USA, 1994. Springer-Verlag New York, Inc. Google ScholarDigital Library
- G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1983. Google ScholarDigital Library
- A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In SIGIR '96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 21--29, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
- I. O. V. Plachouras, B. He. University of Glasgow at TREC 2004: Experiments in web, robust and terabyte tracks with terrier, 2004.Google Scholar
- Z. Xu and R. Akella. A new probabilistic retrieval model based on the dirichlet compound multinomial distribution. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 427--434, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In CIKM '01: Proceedings of the tenth international conference on Information and knowledge management, pages 403--410, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179--214, 2004. Google ScholarDigital Library
Index Terms
- Information-based models for ad hoc IR
Recommendations
Retrieval constraints and word frequency distributions: a log-logistic model for IR
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementWe first present in this paper an analytical view of heuristic retrieval constraints which yields simple tests to determine whether a retrieval function satisfies the constraints or not. We then review empirical findings on word frequency distributions ...
Axiomatic Analysis and Optimization of Information Retrieval Models
ICTIR '13: Proceedings of the 2013 Conference on the Theory of Information RetrievalThe accuracy of a search engine is mostly determined by the optimality of the retrieval model used in the search engine. Develoing optimal retrieval models has always been a very important fundamental research problem in information retrieval because an ...
Retrieval constraints and word frequency distributions a log-logistic model for IR
AbstractWe first present in this paper an analytical view of heuristic retrieval constraints which yields simple tests to determine whether a retrieval function satisfies the constraints or not. We then review empirical findings on word frequency ...
Comments