skip to main content
10.5555/1613715.1613848dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
research-article
Free Access

The linguistic structure of English web-search queries

Published:25 October 2008Publication History

ABSTRACT

Web-search queries are known to be short, but little else is known about their structure. In this paper we investigate the applicability of part-of-speech tagging to typical English-language web search-engine queries and the potential value of these tags for improving search results. We begin by identifying a set of part-of-speech tags suitable for search queries and quantifying their occurrence. We find that proper-nouns constitute 40% of query terms, and proper nouns and nouns together constitute over 70% of query terms. We also show that the majority of queries are noun-phrases, not unstructured collections of terms. We then use a set of queries manually labeled with these tags to train a Brill tagger and evaluate its performance. In addition, we investigate classification of search queries into grammatical classes based on the syntax of part-of-speech tag sequences. We also conduct preliminary investigative experiments into the practical applicability of leveraging query-trained part-of-speech taggers for information-retrieval tasks. In particular, we show that part-of-speech information can be a significant feature in machine-learned search-result relevance. These experiments also include the potential use of the tagger in selecting words for omission or substitution in query reformulation, actions which can improve recall. We conclude that training a part-of-speech tagger on labeled corpora of queries significantly outperforms taggers based on traditional corpora, and leveraging the unique linguistic structure of web-search queries can improve search experience.

References

  1. James Allan and Hema Raghavan. 2002. Using part-of-speech patterns to reduce query ambiguity. In Proceedings of SIGIR, pages 307--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Kevin Bartz, Cory Barr, and Adil Aijaz. 2008. Natural language generation in sponsored-search advertisements. In Proceedings of the 9th ACM Conference on Electronic Commerce, pages 1--9, Chicago, Illinois. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Abdur Chowdhury and M. Catherine McCabe. 2000. Improving information retrieval systems using part of speech tagging.Google ScholarGoogle Scholar
  5. Fabio Crestani, Mark Sanderson, and Mounia Lalmas. 1998. Short queries, natural language and spoken document retrieval: Experiments at glasgow university. In Proceedings of the Sixth Text Retrieval Conference (TREC-6), pages 667--686.Google ScholarGoogle Scholar
  6. Erika F. de Lima and Jan O. Pederson. 1999. Phrase recognition and expansion for short, precision-biased queries based on a query log. In Annual ACM Conference on Research and Development in Information Retrieval Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 145--152, Berkeley, California. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bekir Taner Dincer and Bahar Karaoglan. 2004. The effect of part-of-speech tagging on ir performance for turkish. pages 771--778.Google ScholarGoogle Scholar
  8. Bernard J. Jansen, Amanda Spink, and Tefko Saracevic. 2000. Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing and Management, 36(2):207--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christina Amalia Lioma. 2008. Part of speech N-grams for information retrieval. Ph.D. thesis, University of Glasgow, Glasgow, Scotland, UK.Google ScholarGoogle Scholar
  10. Marius Pasca. 2007. Weakly-supervised discovery of named entities using web search queries. In CIKM, pages 683--690. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Amanda Spink, B. J. Jansen, D. Wolfram, and T. Saracevic. 2002. From e-sex to e-commerce: Web search changes. IEEE Computer, 35(3):107--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Tomek Strzalkowski, Jose Perez Carballo, and Mihnea Marinescu. 1998. Natural language information retrieval: Trec-3 report. In Proceedings of the Sixth Text Retrieval Conference (TREC-6), page 39.Google ScholarGoogle Scholar
  13. Kristina Toutanova and Christopher D. Manning. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL, pages 252--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ingrid Zukerman and Bhavani Raskutti. 2002. Lexical query paraphrasing for document retrieval. In COLING, pages 1177--1183, Taipei, Taiwan. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language Processing
    October 2008
    1129 pages

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 25 October 2008

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate73of234submissions,31%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader