Skip to main content
Erschienen in: Discover Computing 1/2011

01.02.2011 | The Second International Conference on the Theory of Information Retrieval (ICTIR2009)

Retrieval constraints and word frequency distributions a log-logistic model for IR

verfasst von: Stéphane Clinchant, Eric Gaussier

Erschienen in: Discover Computing | Ausgabe 1/2011

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We first present in this paper an analytical view of heuristic retrieval constraints which yields simple tests to determine whether a retrieval function satisfies the constraints or not. We then review empirical findings on word frequency distributions and the central role played by burstiness in this context. This leads us to propose a formal definition of burstiness which can be used to characterize probability distributions with respect to this phenomenon. We then introduce the family of information-based IR models which naturally captures heuristic retrieval constraints when the underlying probability distribution is bursty and propose a new IR model within this family, based on the log-logistic distribution. The experiments we conduct on several collections illustrate the good behavior of the log-logistic IR model: It significantly outperforms the Jelinek-Mercer and Dirichlet prior language models on most collections we have used, with both short and long queries and for both the MAP and the precision at 10 documents. It also compares favorably to BM25 and has similar performance to classical DFR models such as InL2 and PL2.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
A function of class C 2 is a function for which second derivatives exist and are continuous.
 
2
Furthermore, as these variables are positive, the support of the distributions to be considered should be (or included in) [0;∞).
 
3
The same applies to the binomial model, for which \(\frac{\partial^2 h(x,y,z,\omega)}{\partial x^2} > 0.\) For the sake of clarity, we do not present here this derivation which is purely technical.
 
4
Due to relation 7, the Chi-square statistics is the same for the BNB and the log-logistic distributions on the given intervals.
 
Literatur
1.
Zurück zum Zitat Airoldi, E. M., Cohen, W. W., & Fienberg, S. E. Bayesian methods for frequent terms in text: Models of contagion and the δ2 statistic. Airoldi, E. M., Cohen, W. W., & Fienberg, S. E. Bayesian methods for frequent terms in text: Models of contagion and the δ2 statistic.
2.
Zurück zum Zitat Amati, G., & Rijsbergen, C. J. V. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information and Systems 20(4), 357–389.CrossRef Amati, G., & Rijsbergen, C. J. V. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information and Systems 20(4), 357–389.CrossRef
3.
Zurück zum Zitat Barabasi, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509–512.CrossRefMathSciNet Barabasi, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509–512.CrossRefMathSciNet
4.
Zurück zum Zitat Chakrabarti, D., & Faloutsos, C. (2006). Graph mining: Laws, generators, and algorithms. ACM Computer Survey, 38(1), 2 Chakrabarti, D., & Faloutsos, C. (2006). Graph mining: Laws, generators, and algorithms. ACM Computer Survey, 38(1), 2
5.
Zurück zum Zitat Church, K. W. (2000). Empirical estimates of adaptation: The chance of two noriegas is closer to p/2 than p2. In Proceedings of the 18th conference on computational linguistics, Morristown, NJ, USA, Association for Computational Linguistics, pp. 180–186. Church, K. W. (2000). Empirical estimates of adaptation: The chance of two noriegas is closer to p/2 than p2. In Proceedings of the 18th conference on computational linguistics, Morristown, NJ, USA, Association for Computational Linguistics, pp. 180–186.
6.
Zurück zum Zitat Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1, 163–190.CrossRef Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1, 163–190.CrossRef
7.
Zurück zum Zitat Clinchant, S., & Gaussier, É. The bnb distribution for text modeling. In Macdonald et al. [12], pp. 150–161. Clinchant, S., & Gaussier, É. The bnb distribution for text modeling. In Macdonald et al. [12], pp. 150–161.
8.
Zurück zum Zitat Elkan, C. (2006). Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In Cohen, W. W., & Moore, A. (Eds.), ICML, volume 148 of ACM international conference proceeding series, pp. 289–296. ACM. Elkan, C. (2006). Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In Cohen, W. W., & Moore, A. (Eds.), ICML, volume 148 of ACM international conference proceeding series, pp. 289–296. ACM.
9.
Zurück zum Zitat Fang, H., Tao, T., & Zhai, C. (2004). A formal study of information retrieval heuristics. In SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp. 49–56. Fang, H., Tao, T., & Zhai, C. (2004). A formal study of information retrieval heuristics. In SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp. 49–56.
10.
Zurück zum Zitat Feller, W. (1968). An introduction to probability theory and its applications (Vol. I). New York: Wiley. Feller, W. (1968). An introduction to probability theory and its applications (Vol. I). New York: Wiley.
11.
Zurück zum Zitat Harter, S. (1975). A probabilistic approach to automatic keyword indexing, part 1: On the distribution of speciality words in a technical literature, part 2: An algorithm for probabilistic indexing. Journal of the American Society for Information Science, (26), 197–206. Harter, S. (1975). A probabilistic approach to automatic keyword indexing, part 1: On the distribution of speciality words in a technical literature, part 2: An algorithm for probabilistic indexing. Journal of the American Society for Information Science, (26), 197–206.
12.
Zurück zum Zitat Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., & White, R. W. (eds.) (2008). Advances in information retrieval, 30th European conference on IR research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008. Proceedings, volume 4956 of lecture notes in computer science. Springer, Berlin. Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., & White, R. W. (eds.) (2008). Advances in information retrieval, 30th European conference on IR research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008. Proceedings, volume 4956 of lecture notes in computer science. Springer, Berlin.
13.
Zurück zum Zitat Madsen, R. E., Kauchak, D., & Elkan, C. (2005). Modeling word burstiness using the dirichlet distribution. In Raedt, L. D., & Wrobel, S. (Eds.), ICML, volume 119 of ACM international conference proceeding series, pp. 545–552. ACM. Madsen, R. E., Kauchak, D., & Elkan, C. (2005). Modeling word burstiness using the dirichlet distribution. In Raedt, L. D., & Wrobel, S. (Eds.), ICML, volume 119 of ACM international conference proceeding series, pp. 545–552. ACM.
14.
Zurück zum Zitat Na, S.-H., Kang, I.-S., & Lee, J.-H. Improving term frequency normalization for multi-topical documents and application to language modeling approaches. In Macdonald et al. [12], pp. 382–393. Na, S.-H., Kang, I.-S., & Lee, J.-H. Improving term frequency normalization for multi-topical documents and application to language modeling approaches. In Macdonald et al. [12], pp. 382–393.
15.
Zurück zum Zitat Robertson, S. E., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, New York, NY, USA, Springer, New York, pp. 232–241 Robertson, S. E., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, New York, NY, USA, Springer, New York, pp. 232–241
16.
Zurück zum Zitat Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York NY USA: McGraw-Hill Inc.MATH Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York NY USA: McGraw-Hill Inc.MATH
17.
Zurück zum Zitat Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, ACM, pp. 21–29. Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, ACM, pp. 21–29.
18.
Zurück zum Zitat Xu, Z., & Akella, R. (2008). A new probabilistic retrieval model based on the dirichlet compound multinomial distribution. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, New York, NY, USA: ACM, pp. 427–434. Xu, Z., & Akella, R. (2008). A new probabilistic retrieval model based on the dirichlet compound multinomial distribution. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, New York, NY, USA: ACM, pp. 427–434.
19.
Zurück zum Zitat Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions Information System, 22(2), 179–214.CrossRef Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions Information System, 22(2), 179–214.CrossRef
Metadaten
Titel
Retrieval constraints and word frequency distributions a log-logistic model for IR
verfasst von
Stéphane Clinchant
Eric Gaussier
Publikationsdatum
01.02.2011
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 1/2011
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-010-9143-7

Weitere Artikel der Ausgabe 1/2011

Discover Computing 1/2011 Zur Ausgabe

The Second International Conference on the Theory of Information Retrieval (ICTIR2009)

An analysis of NP-completeness in novelty and diversity ranking

The Second International Conference on the Theory of Information Retrieval (ICTIR2009)

Modeling score distributions in information retrieval

The Second International Conference on the Theory of Information Retrieval (ICTIR2009)

Introduction to special issue on the second international conference on the theory of information retrieval

The Second International Conference on the Theory of Information Retrieval (ICTIR2009)

Variational bayes for modeling score distributions

The Second International Conference on the Theory of Information Retrieval (ICTIR2009)

Specificity aboutness in XML retrieval

Premium Partner