Top

Discover Computing

Published in:

01-02-2014

Latent word context model for information retrieval

Authors: Bernard Brosseau-Villeneuve, Jian-Yun Nie, Noriko Kando

Published in: Discover Computing | Issue 1/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The application of word sense disambiguation (WSD) techniques to information retrieval (IR) has yet to provide convincing retrieval results. Major obstacles to effective WSD in IR include coverage and granularity problems of word sense inventories, sparsity of document context, and limited information provided by short queries. In this paper, to alleviate these issues, we propose the construction of latent context models for terms using latent Dirichlet allocation. We propose building one latent context per word, using a well principled representation of local context based on word features. In particular, context words are weighted using a decaying function according to their distance to the target word, which is learnt from data in an unsupervised manner. The resulting latent features are used to discriminate word contexts, so as to constrict query’s semantic scope. Consistent and substantial improvements, including on difficult queries, are observed on TREC test collections, and the techniques combines well with blind relevance feedback. Compared to traditional topic modeling, WSD and positional indexing techniques, the proposed retrieval model is more effective and scales well on large-scale collections.

previous article Improving ranking performance with cost-sensitive ordinal classification via regression

next article Leveraging integrated information to extract query subtopics for search result diversification

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

http://www.cs.princeton.edu/~blei/lda-c.

http://www.lemurproject.org/indri.

http://sourceforge.net/projects/latentcontext.

http://ciir.cs.umass.edu/~metzler/dm.pl.

3 keywords × 800 topics × (1 add. + 1 mult. per topic) = 4,800.

A query Q = {q ₁, q ₂, q ₃} made from three content words results in one target word feature and two context word features per keyword, and four “no stop word” stop word features (at q _1,right, q _2,left, q _2,right, q _3,left).

13 features × 10 topics × (1 add. + 1 mult. per topic) = 260.

Bai, J., Song, D., Bruza, P., Nie, J. Y., & Cao, G. (2005). Query expansion using term relationships in language models for information retrieval. In CIKM’05 proceedings (pp. 688–695).

Berger, A., & Lafferty, J. (1999). Information retrieval as statistical translation. In SIGIR’99 proceedings (pp. 222–229).

Blei, D. M., & Lafferty, J. D. (2009). Topic models. Text Mining: Classification, clustering, and applications (Vol. 10, p. 71). London, England: Taylor & Francis.

Blei, D., & Lafferty, J. (2006). Correlated topic models. In Y. Weiss, B. Schölkopf, & J. Platt (Eds.), Blei06CTM proceedings. Advances in Neural Information Processing Systems (Vol. 18, pp. 147–154). Cambridge, MA: MIT Press.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.MATH

Brosseau-Villeneuve, B., Kando, N., & Nie, J. Y. (2011). Construction of context models for word sense disambiguation. Information and Media Technologies, 6(3), 701–729.

Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., & Mercer, R. L. (1991). Word-sense disambiguation using statistical methods. In ACL’91 proceedings (pp. 264–270).

Cai, J. F., Lee, W. S., & Teh, Y. W. (2007). Nus-ml: Improving word sense disambiguation using topic features. In SemEval’07 (pp. 249–252).

Cao, G., Nie, J. Y., Gao, J., & Robertson, S. (2008). Selecting good expansion terms for pseudo-relevance feedback. In: SIGIR’08 proceedings (pp. 243–250).

Croft, B., Metzler, D., & Strohman, T. (2009). Search engines: Information retrieval in practice. Boston: Addison-Wesley.

Croft, W., Metzler, D., & Strohmann, T. (2010). Search engines: Information retrieval in practice. London, UK: Pearson.

Cui, H., Wen, J. R., Nie, J. Y., & Ma, W. Y. (2002). Probabilistic query expansion using query logs. In WWW’02 proceedings (pp. 325–332).

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391–407.CrossRef

Doyle, G., & Elkan, C. (2009). Accounting for burstiness in topic models. In ICML’09 proceedings (pp. 281–288).

Gale, W. A., Church, K. W., & Yarowsky, D. (1992). One sense per discourse. In HLT’91 proceedings (pp. 233–237).

Gao, J., Nie, J. Y., Wu, G., & Cao, G. (2004). Dependence language model for information retrieval. In SIGIR’04 proceedings (pp. 170–177).

Gaustad, T. (2001). Statistical corpus-based word sense disambiguation: Pseudowords vs. real ambiguous words. In Companion volume to the ACL’01 proceedings (pp. 61–66).

Gonzalo, J., Verdejo, F., Chugur, I., & Cigarrin, J. (1998). Indexing with wordnet synsets can improve text retrieval. In COLING/ACL’98 workshop on the usage of WordNet for NLP (pp. 38–44).

Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR’99 (pp. 50–57). New York, NY, USA: ACM

Ide, N., & Véronis, J. (1998). Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics, 24(1), 2–40.

Kilgarriff, A. (1997). I don’t believe in word senses. Computers and the Humanities, 31(2), 91–113.

Kim, S. B., Seo, H. C., & Rim, H. C. (2004). Information retrieval using word senses: Root sense tagging approach. In SIGIR’04 proceedings (pp. 258–265).

Krovetz, R., & Croft, W. B. (1992). Lexical ambiguity and information retrieval. ACM Transactions on Information Systems, 10, 115–141.CrossRef

Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In SIGIR’01 (pp. 111–119).

Lavrenko, V., & Croft, W. B. (2001). Relevance based language models. In SIGIR’01 proceedings (pp. 120–127).

Li, W., & Mccallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML’06 proceedings (pp. 577–584).

Lu, Y., Mei, Q., & Zhai, C. (2011). Investigating task performance of probabilistic topic models: An empirical study of plsa and lda. Information Retrieval Journal, 14, 178–203.CrossRef

Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 28, 203–208.CrossRef

Lv, Y., & Zhai, C. (2009). Positional language models for information retrieval. In SIGIR’09 proceedings (pp. 299–306).

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press. http://nlp.stanford.edu/IR-book/.

Metzler, D., & Croft, W. B. (2004). Combining the language model and inference network approaches to retrieval. Information Processing & Management, 40, 735–750.CrossRef

Metzler, D., & Croft, W. B. (2005). A Markov random field model for term dependencies. In SIGIR’05 proceedings (pp. 472–479).

Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41(2), 1–69.CrossRef

Okumura, M., Shirai, K., Komiya, K., & Yokono, H. (2010). Semeval-2010 task: Japanese wsd. In Proceedings of the 5th international workshop on semantic evaluation (pp. 69–74). Uppsala, Sweden: Association for Computational Linguistics.

Salton, G., Allan, J., & Buckley, C. (1993). Approaches to passage retrieval in full text information systems. In SIGIR’93 proceedings (pp. 49–58).

Sanderson, M. (1994). Word sense disambiguation and information retrieval. In SIGIR’94 proceedings (pp. 142–151).

Sanderson, M. (2000). Retrieving with good sense. Information Retrieval, 2, 49–69.CrossRef

Sanderson, M., & Van Rijsbergen, C. J. (1999). The impact on retrieval effectiveness of skewed frequency distributions. ACM Transactions on Information Systems, 17, 440–465.CrossRef

Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24, 97–123.

Schutze, H., & Pedersen, J. O. (1995). Information retrieval based on word senses. In SDAIR’95 proceedings (pp. 161–175).

Song, F., & Croft, W. B. (1999). A general language model for information retrieval. In CIKM’99 proceedings (pp. 316–321).

Srikanth, M., & Srihari, R. (2002). Biterm language models for document retrieval. In SIGIR’02 proceedings (pp. 425–426).

Stokoe, C. (2005). Differentiating homonymy and polysemy in information retrieval. In HLT’05 proceedings (pp. 403–410).

Stokoe, C., Oakes, M. P., & Tait, J. (2003). Word sense disambiguation in information retrieval revisited. In SIGIR’03 proceedings (pp. 159–166).

Voorhees, E. M. (1993). Using wordnet to disambiguate word senses for text retrieval. In SIGIR’93 proceedings (pp. 171–180).

Voorhees, E. M. (2004). Overview of the trec 2004 robust retrieval track. In TREC’04 (p. 13).

Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In SIGIR’06 proceedings (pp. 178–185).

Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In SIGIR’96 proceedings (pp. 4–11).

Zhao, J., & Yun, Y. (2009). A proximity language model for information retrieval. In SIGIR’09 proceedings (pp. 291–298).

Title: Latent word context model for information retrieval
Authors: Bernard Brosseau-Villeneuve
Jian-Yun Nie
Noriko Kando
Publication date: 01-02-2014
Publisher: Springer Netherlands
Published in: Discover Computing / Issue 1/2014
Print ISSN: 2948-2984
Electronic ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-013-9220-9

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner