ABSTRACT
A fundamental task in Information Retrieval (IR) is term weighting. Early IR theory considered both the presence or absence of all terms in the lexicon for ranking and needed to weight them all. Yet, as the size of lexicons grew and models became too complex, common weighting models preferred to aggregate only the weights of the query terms that are matched in candidate documents. Thus, unmatched term contribution in these models is only considered indirectly, such as in probability smoothing with corpus distribution, or in weight normalization by document length. In this work we propose a novel term weighting model that directly assesses the weights of unmatched terms, and show its benefits. Specifically, we propose a Learning To Rank framework, in which features corresponding to matched terms are also "mirrored" in similar features that account only for unmatched terms. The relative importance of each feature is learned via a click-through query log. As a test case, we consider vertical search in Community-based Question Answering(CQA) sites from Web queries. Queries that result in viewing CQA content often contain fine grained information needs and benefit more from unmatched term weighting. We assess our model both via manual evaluation and via automatic evaluation over a clickthrough log. Our results show consistent improvement in retrieval when unmatched information is taken into account. This holds both when only identical terms are considered matched, and when related terms are matched via distributional similarity.
- G. Amati, V. Rijsbergen, and C. Joost. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4), Oct. 2002. Google ScholarDigital Library
- J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo. Sources of evidence for vertical selection. In SIGIR, 2009. Google ScholarDigital Library
- M. Bendersky, D. Metzler, and W. B. Croft. Learning concept importance using a weighted dependence model. In WSDM, 2010. Google ScholarDigital Library
- A. Berger and J. Lafferty. Information retrieval as statistical translation. In SIGIR, 1999. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
- L. Cai, G. Zhou, K. Liu, and J. Zhao. Learning the latent topics for question retrieval in community qa. In AFNLP, 2011.Google Scholar
- X. Cao, G. Cong, B. Cui, C. S. Jensen, and C. Zhang. The use of categorization information in language models for question retrieval. In CIKM, 2009. Google ScholarDigital Library
- Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking svm to document retrieval. In SIGIR, 2006. Google ScholarDigital Library
- D. Carmel, A. Mejer, Y. Pinter, and I. Szpektor. Improving term weighting for community question answering search using syntactic analysis. In CIKM, 2014. Google ScholarDigital Library
- R.-C. Chen, D. Spina, W. B. Croft, M. Sanderson, and F. Scholer. Harnessing semantics for answer sentence retrieval. In ESAIR Workshop, 2015. Google ScholarDigital Library
- K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. MLJ, 91(2):155--187, 2013. Google ScholarDigital Library
- S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JAsIs, 41(6):391--407, 1990.Google ScholarCross Ref
- H. Duan, Y. Cao, C.-Y. Lin, and Y. Yu. Searching questions by identifying question topic and question focus. In ACL, 2008.Google Scholar
- D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015. Google ScholarDigital Library
- P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. Google ScholarDigital Library
- J. Jeon, W. B. Croft, and J. H. Lee. Finding similar questions in large question and answer archives. In CIKM, 2005. Google ScholarDigital Library
- R. Jin, A. G. Hauptmann, and C. X. Zhai. Language model for information retrieval. In SIGIR, 2002. Google ScholarDigital Library
- Q. Liu, E. Agichtein, G. Dror, E. Gabrilovich, Y. Maarek, D. Pelleg, and I. Szpektor. Predicting web searcher satisfaction with existing community-based answers. In SIGIR, 2011. Google ScholarDigital Library
- T.-Y. Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225--331, Mar. 2009. Google ScholarDigital Library
- T. y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank for information retrieval. In SIGIR Workshop on Learning to Rank for Information Retrieval, 2007.Google ScholarCross Ref
- Y. Liu, C. Sun, L. Lin, Y. Zhao, and X. Wang. Computing semantic text similarity using rich features. In PACLIC, 2015.Google Scholar
- C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS. 2013.Google ScholarDigital Library
- D. R. Miller, T. Leek, and R. M. Schwartz. A hidden markov model information retrieval system. In SIGIR, 1999. Google ScholarDigital Library
- V. Murdock and M. Lalmas. Workshop on aggregated search. SIGIR Forum, 42(2):80--83, Nov. 2008. Google ScholarDigital Library
- J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR, 1998. Google ScholarDigital Library
- F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In CIKM, 2008. Google ScholarDigital Library
- S. Robertson and H. Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333--389, Apr. 2009. Google ScholarDigital Library
- S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information science, 27(3):129--146, 1976.Google ScholarCross Ref
- S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter. Probabilistic models of indexing and searching. In SIGIR, 1980. Google ScholarDigital Library
- G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, Aug. 1988. Google ScholarDigital Library
- A. Severyn and A. Moschitti. Learning to rank short text pairs with convolutional deep neural networks. In SIGIR, 2015. Google ScholarDigital Library
- F. Song and W. B. Croft. A general language model for information retrieval. In CIKM, 1999. Google ScholarDigital Library
- K. Tymoshenko and A. Moschitti. Assessing the impact of syntactic and semantic structures for answer passages reranking. In CIKM, 2015. Google ScholarDigital Library
- X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, 2006. Google ScholarDigital Library
- R. W. White, M. Richardson, and W.-t. Yih. Questions vs. queries in informational search tasks. In WWW Companion, 2015. Google ScholarDigital Library
- H. Wu, W. Wu, M. Zhou, E. Chen, L. Duan, and H.-Y. Shum. Improving search relevance for short queries in community question answering. In WSDM, 2014. Google ScholarDigital Library
- Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Adapting boosting for information retrieval measures. Inf. Retr., 13(3):254--270, June 2010. Google ScholarDigital Library
- X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In SIGIR, 2008. Google ScholarDigital Library
- C. Zhai. Statistical language models for information retrieval. Synthesis Lectures on HLT, 1(1):1--141, 2008. Google ScholarDigital Library
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR, 2001. Google ScholarDigital Library
- W. Zhang, Z. Ming, Y. Zhang, L. Nie, T. Liu, and T. Chua. The use of dependency relation graph to enhance the term weighting in question retrieval. In COLING, 2012.Google Scholar
- G. Zheng and J. Callan. Learning to reweight terms with distributed representations. In SIGIR, 2015. Google ScholarDigital Library
Index Terms
- That's Not My Question: Learning to Weight Unmatched Terms in CQA Vertical Search
Recommendations
Novelty based Ranking of Human Answers for Community Questions
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information RetrievalQuestions and their corresponding answers within a community based question answering (CQA) site are frequently presented as top search results forWeb search queries and viewed by millions of searchers daily. The number of answers for CQA questions ...
Recency and quality-based ranking question in CQAs: A Stack Overflow case study
AbstractRecency ranking, in Community-based Question Answering (CQA), would refer to put recent answers in a list’s top positions. To be recent is not related to how new is the date of creation or editing of a given answer, but how current is ...
Highlights- An automatic proposal for quality and recency-based answer ranking.
- Proposal of ...
Information Retrieval by Modified Term Weighting Method Using Random Walk Model with Query Term Position Ranking
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing SystemsTerm weighting is a core idea behind any information retrieval technique which has crucial importance in document ranking. In graph based ranking algorithm, terms within a document are represented as a graph of that document. Term weights for ...
Comments