ABSTRACT
Current search engines do not, in general, perform well with longer, more verbose queries. One of the main issues in processing these queries is identifying the key concepts that will have the most impact on effectiveness. In this paper, we develop and evaluate a technique that uses query-dependent, corpus-dependent, and corpus-independent features for automatic extraction of key concepts from verbose queries. We show that our method achieves higher accuracy in the identification of key concepts than standard weighting methods such as inverse document frequency. Finally, we propose a probabilistic model for integrating the weighted key concepts identified by our method into a query, and demonstrate that this integration significantly improves retrieval effectiveness for a large set of natural language description queries derived from TREC topics on several newswire and web collections.
- J. Allan, M.E. Connell, W.B. Croft, F.F. Feng, D. Fisher, and X. Li. INQUERY and TREC-9. Proceedings of the Ninth Text Retrieval Conference (TREC-9), pages 551--562, 2000.Google Scholar
- James Allan, Jamie Callan, W. Bruce Croft, Lisa Ballesteros, John Broglio, Jinxi Xu, and Hongmin Shu. INQUERY at TREC-5. pages 119--132. NIST, 1997.Google Scholar
- L. Bentivogli and E. Pianta. Beyond lexical units: Enriching wordnets with phrasets. Proceedings of the Research Note Sessions of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL03), pages 67--70, 2003. Google ScholarDigital Library
- D.M. Bikel, R. Schwartz, and R.M. Weischedel. An Algorithm that Learns What's in a Name. Machine Learning, 34(1):211--231, 1999. Google ScholarDigital Library
- Thorsten Brants and Alex Franz. Web 1T 5-gram Version 1, 2006.Google Scholar
- Chris Buckley, Mandar Mitra, Janet A. Walz, and Claire Cardie. Using clustering and superconcepts within SMART: TREC 6. Information Processing and Management, 36(1):109--131, 2000. Google ScholarDigital Library
- James P. Callan, W. Bruce Croft, and John Broglio. TREC and tipster experiments with INQUERY. Information Processing and Management, 31(3):327--343, 1995. Google ScholarDigital Library
- Kenneth W. Church and William A. Gale. Poisson mixtures. Natural Language Engineering, 1(2):163--190, 1995.Google ScholarCross Ref
- K. Collins-Thompson and J. Callan. Query expansion using random walk models. Proceedings of the 14th ACM international conference on Information and knowledge management, pages 704--711, 2005. Google ScholarDigital Library
- W. Bruce Croft and John Lafferty, editors. Language Modeling for Information Retrieval. Number 13 in Information Retrieval Book Series. Kluwer, 2003. Google ScholarDigital Library
- J.F. da Silva, J. Mexia, C.A. Coelho, and J.G.P. Lopes. Document Clustering and Cluster Topic Extraction in Multilingual Corpora. Proceedings of the 2001 IEEE International Conference on Data Mining, pages 513--520, 2001. Google ScholarDigital Library
- E. Frank, G.W. Paynter, I.H. Witten, C. Gutwin, and C.G. Nevill-Manning. Domain-specific keyphrase extraction. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), pages 668--673, 1999. Google ScholarDigital Library
- Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, 148:156, 1996.Google Scholar
- Djoerd Hiemstra. Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term. In SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 35--41. ACM, 2002. Google ScholarDigital Library
- A. Hulth. Improved automatic keyword extraction gmore linguistic knowledge. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pages 216--223, 2003. Google ScholarDigital Library
- Kevin Knight and Daniel Marcu. Statistics-based summarization - step one: Sentence compression. In AAAI/IAAI, pages 703--710, 2000. Google ScholarDigital Library
- Giridhar Kumaran and James Allan. A case for shorter queries, and helping user create them. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 220--227, 2006.Google Scholar
- O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 194--201, 2004. Google ScholarDigital Library
- Hugo Liu. MontyLingua: An end-to-end natural language processor with common sense, 2004. Available at: web.media.mit.edu/ hugo/montylingua.Google Scholar
- X. Liu and W.B. Croft. Cluster-based retrieval using language models. Proceedings of the 27th annual international conference on Research and developement in information retrieval, pages 186--193, 2004. Google ScholarDigital Library
- Q. Mei, H. Fang, and C. Zhai. A study of poisson query generation model for information retrieval. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 319--326. ACM, 2007. Google ScholarDigital Library
- D. Metzler and W.B. Croft. A Markov random field model for term dependencies. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 472--479, 2005. Google ScholarDigital Library
- D. Metzler and W.B. Croft. Latent concept expansion using markov random fields. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 311--318, 2007. Google ScholarDigital Library
- P. Ogilvie and J. Callan. Experiments using the Lemur toolkit. Proceedings of the Tenth Text Retrieval Conference (TREC-10), pages 103--108, 2001.Google Scholar
- Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR, pages 275--281, 1998. Google ScholarDigital Library
- M. Porter. The Porter Stemming Algorithm. Accessible at http://www.tartarus.org/martin/PorterStemmer.Google Scholar
- Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, 1988. Google ScholarDigital Library
- T. Strohman, D. Metzler, H. Turtle, and W.B. Croft. Indri: A language model-based search engine for complex queries. Proceedings of the International Conference on Intelligence Analysis, 2004.Google Scholar
- P.D. Turney. Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2(4):303--336, 2000. Google ScholarDigital Library
- X. Wei and W.B. Croft. LDA-based document models for ad-hoc retrieval. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 178--185, 2006. Google ScholarDigital Library
- I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2005. Google ScholarDigital Library
- J. Xu and W.B. Croft. Query expansion using local and global document analysis. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 4--11, 1996. Google ScholarDigital Library
- Wen T. Yih, Joshua Goodman, and Vitor R. Carvalho. Finding advertising keywords on web pages. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 213--222, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- Y. Zhou and W.B. Croft. Query performance prediction in web search environments. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 543--550, 2007. Google ScholarDigital Library
Index Terms
- Discovering key concepts in verbose queries
Recommendations
Evaluating verbose query processing techniques
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalVerbose or long queries are a small but significant part of the query stream in web search, and are common in other applications such as collaborative question answering (CQA). Current search engines perform well with keyword queries but are not, in ...
Information Retrieval with Verbose Queries
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalRecently, the focus of many novel search applications shifted from short keyword queries to verbose natural language queries. Examples include question answering systems and dialogue systems, voice search on mobile devices and entity search engines like ...
Reducing long queries using query quality predictors
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalLong queries frequently contain many extraneous terms that hinder retrieval of relevant documents. We present techniques to reduce long queries to more effective shorter ones that lack those extraneous terms. Our work is motivated by the observation ...
Comments