research-article

Risky business: modeling and exploiting uncertainty in information retrieval

Authors:
Jianhan Zhu

University College London, London, United Kingdom

University College London, London, United Kingdom
View Profile

,
Jun Wang

University College London, London, United Kingdom

University College London, London, United Kingdom
View Profile

,
Ingemar J. Cox

University College London, London, United Kingdom

University College London, London, United Kingdom
View Profile

,
Michael J. Taylor

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalJuly 2009Pages 99–106https://doi.org/10.1145/1571941.1571961

Published:19 July 2009Publication History

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 99–106

ABSTRACT

Most retrieval models estimate the relevance of each document to a query and rank the documents accordingly. However, such an approach ignores the uncertainty associated with the estimates of relevancy. If a high estimate of relevancy also has a high uncertainty, then the document may be very relevant or not relevant at all. Another document may have a slightly lower estimate of relevancy but the corresponding uncertainty may be much less. In such a circumstance, should the retrieval engine risk ranking the first document highest, or should it choose a more conservative (safer) strategy that gives preference to the second document? There is no definitive answer to this question, as it depends on the risk preferences of the user and the information retrieval system. In this paper we present a general framework for modeling uncertainty and introduce an asymmetric loss function with a single parameter that can model the level of risk the system is willing to accept. By adjusting the risk preference parameter, our approach can effectively adapt to users' different retrieval strategies.

We apply this asymmetric loss function to a language modeling framework and a practical risk-aware document scoring function is obtained. Our experiments on several TREC collections show that our "risk-averse" approach significantly improves the Jelinek-Mercer smoothing language model, and a combination of our "risk-averse" approach and the Jelinek-Mercer smoothing method generally outperforms the Dirichlet smoothing method. Experimental results also show that the "risk-averse" approach, even without smoothing from the collection statistics, performs as well as three commonly-adopted retrieval models, namely, the Jelinek-Mercer and Dirichlet smoothing methods, and BM25 model.

References

]]G. Amati and C.J.V. Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357--389, 2002. Google ScholarDigital Library
]]D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. In Proc. of NIPS, pages 601--608, 2001.Google Scholar
]]H. Chen and D.R. Karger. Less is more: probabilistic models for retrieving fewer relevant documents. In Proc. of SIGIR '06, pages 429--436, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
]]K. Church and W. Gale. Poisson mixtures. Journal of Natural Language Engineering, 1995.Google ScholarCross Ref
]]A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis. Chapman and Hall, 2003.Google Scholar
]]D. Hiemstra. Using language models for information retrieval. Doctoral thesis, University of Twente, 2001.Google Scholar
]]T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Info. Syst., Vol 22(1):89--115, 2004. Google ScholarDigital Library
]]F. Jelinek and R. Mercer. Interpolated estimation of markov source parameters from sparse data. Pattern Recognition in Practice, pages 381--402, 1980.Google Scholar
]]M. Kendall and A. Stuart, editors. The Advanced Theory of Statistics Volume 1, 3rd Edition (Section 3.12). Griffin, London, 1969.Google Scholar
]]J.D. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proc. of SIGIR '01, pages 111--119, 2001. Google ScholarDigital Library
]]E. Lukacs, editor. Characteristic Functions, 2nd Edition (Page 27). Griffin, London, 1970.Google Scholar
]]R.E. Madsen, D. Kauchak, and C. Elkan. Modeling word burstiness using the Dirichlet distribution. In Proc. of ICML, pages 545--552, 2005. Google ScholarDigital Library
]]M.E. Maron and J.L. Kuhns. On relevance, probabilistic indexing and information retrieval. J. ACM, 7(3):216--244, 1960. Google ScholarDigital Library
]]J.M. Ponte and W.B. Croft. A language modeling approach to information retrieval. In Proc. of SIGIR '98, pages 275--281, 1998. Google ScholarDigital Library
]]J. Risson and T. Moors. Survey of research towards robust peer-to-peer networks: Search methods. Computer Networks, 50(17):3485--3521, 2006. Google ScholarDigital Library
]]S.E. Robertson. The probability ranking principle in IR. Readings in information retrieval, pages 281--286, 1997. Google ScholarDigital Library
]]S.E. Robertson, M.E. Maron, and W. Cooper. Probability of relevance: a unification of two competing models for document retrieval. Information Technology: Research and Development, 1(1):1--21, 1982.Google Scholar
]]S.E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129--146, 1976.Google ScholarCross Ref
]]S.E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of SIGIR '94, pages 232--241, 1994. Google ScholarDigital Library
]]S.E. Robertson, S. Walker, M. Hancock-Beaulieu, M. Gatford, and A. Payne. Okapi at trec-4. In Text REtrieval Conference (TREC), 1995.Google Scholar
]]J.A. Thom and F. Scholer. A comparison of evaluation measures given how users perform on search tasks. In Australasian Document Computing Symposium, pages 100--103, 2007.Google Scholar
]]C.J. van Rijsbergen. Information Retrieval. Butterworths, London, London, UK, 1979. Google ScholarDigital Library
]]J. Wang and J. Zhu. Portfolio theory of information retrieval. In ACM SIGIR 2009, 2009. Google ScholarDigital Library
]]H. Zaragoza, D. Hiemstra, M. Tipping, and S.E. Robertson. Bayesian extension to the language model for ad hoc information retrieval. In Proc. of SIGIR '03, 2003. Google ScholarDigital Library
]]A. Zellner. Bayesian estimation and prediction using asymmetric loss functions. Journal of the American Statistical Association, 81(394):446--451, 1986.Google ScholarCross Ref
]]C. Zhai and J.D. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proc. of SIGIR '01, pages 334--342, 2001. Google ScholarDigital Library
]]C. Zhai and J.D. Lafferty. A risk minimization framework for information retrieval. Inf. Process. Manage., 42(1):31--55, 2006.Google ScholarDigital Library
]]J. Zhu, J. Wang, I. Cox, and M. Taylor. Risk-aware information retrieval. In Proc. of the European Conference on Information Retrieval (ECIR), pages 17--28, 2009. Google ScholarDigital Library

Index Terms

Risky business: modeling and exploiting uncertainty in information retrieval
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Probabilistic models in IR and their relationships
Abstract
A solid research path towards new information retrieval models is to further develop the theory behind existing models. A profound understanding of these models is therefore essential. In this paper, we revisit probability ranking principle (PRP)-...
Read More
Neural Networks for Information Retrieval
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Machine learning plays a role in many aspects of modern IR systems, and deep learning is applied in all of them. The fast pace of modern-day research has given rise to many different approaches for many different IR problems. The amount of information ...
Read More
Retrieving Information from a Distributed Heterogeneous Document Collection
Abstract
This paper describes a probabilistic model for optimum information retrieval in a distributed heterogeneous environment.
The model assumes the collection of documents offered by the environment to be partitioned into subcollections. Documents as ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
July 2009
896 pages
ISBN:9781605584836
DOI:10.1145/1571941
General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
language models
loss functions
probabilistic retrieval models
probability ranking principle
risk adjustment
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 824
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Risky business: modeling and exploiting uncertainty in information retrieval

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Probabilistic models in IR and their relationships

Neural Networks for Information Retrieval

Retrieving Information from a Distributed Heterogeneous Document Collection