skip to main content
article

Q2C@UST: our winning solution to query classification in KDDCUP 2005

Authors Info & Claims
Published:01 December 2005Publication History
Skip Abstract Section

Abstract

In this paper, we describe our ensemble-search based approach, Q2C@UST (http://webprojectl.cs.ust.hk/q2c/), for the query classification task for the KDDCUP 2005. There are two aspects to the key difficulties of this problem: one is that the meaning of the queries and the semantics of the predefined categories are hard to determine. The other is that there are no training data for this classification problem. We apply a two-phase framework to tackle the above difficulties. Phase I corresponds to the training phase of machine learning research and phase II corresponds to testing phase. In phase I, two kinds of classifiers are developed as the base classifiers. One is synonym-based and the other is statistics based. Phase II consists of two stages. In the first stage, the queries are enriched such that for each query, its related Web pages together with their category information are collected through the use of search engines. In the second stage, the enriched queries are classified through the base classifiers trained in phase I. Based on the classification results obtained by the base classifiers, two ensemble classifiers based on two different strategies are proposed. The experimental results on the validation dataset help confirm our conjectures on the performance of the Q2C@UST system. In addition, the evaluation results given by the KDDCUP 2005 organizer confirm the effectiveness of our proposed approaches. The best F1 value of our two solutions is 9.6% higher than the best of all other participants' solutions. The average F1 value of our two submitted solutions is 94.4% higher than the average F1 value from all other submitted solutions.

References

  1. E. Bauer, R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36:1/2, 105--142. 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Beeferman and A. Berger. Agglomerative clustering of a search engine query log. In Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 407--415, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Breiman. Bagging predictors. Machine Learning, 24:2, 123--140. 1996.]] Google ScholarGoogle ScholarCross RefCross Ref
  4. R. Caruana and A. Niculescu-Mizil. Ensemble selection from libraries of models. In Proc. 21th International Conference on Machine Learning (ICML'04), 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Chekuri, M. Goldwasser, P. Raghavan and E. Upfal. Web Search Using Automated Classification. Poster at the Sixth International World Wide Web Conference (WWW6), 1997.]]Google ScholarGoogle Scholar
  6. H. Chen, S. Dumais. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pages 145--152, The Hague, The Netherlands, April 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. G. Dietterich. Ensemble methods in machine learning. First International Workshop on Multiple Classifier Systems, pages 1--15, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. Fan, S. Stolfo, J. Zhang. The application of AdaBoost for distributed, scalable and on-line learning. In Proceedings of the Fifth SIGKDD International Conference on Knowledge Discovery and Data Mining, 362--366. 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Freund, R. E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, 148--156. 1996.]]Google ScholarGoogle Scholar
  10. Google, http://www.google.com]]Google ScholarGoogle Scholar
  11. P. G. Hoel, Elementary Statistics, Wiley, 1971.]]Google ScholarGoogle Scholar
  12. T. Joachims. Transductive inference for text classification using support vector machines. In Proc. 16th International Conference on Machine Learning (ICML), Bled, Slovenia, June 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Joachims (1998): Text Categorization with Support Vector Machines: Learning with Many Relevant Features. European Conference on Machine Learning (ECML), Claire Nédellec and Céline Rouveirol (ed.), 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. S. Jones. Automatic Keyword Classification for Information Retrieval. Butterworths, London, 1971.]]Google ScholarGoogle Scholar
  15. I. H. Kang, G. Kim, Query type classification for web document retrieval. In Proceedings of the 26rd annual international ACM SIGIR Conference on Research and Development in Information Retrieval. Toronto, Canada, 2003, 64--71.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas. On Combining Classifiers. IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 20, No. 3, 1998, pp. 226--239.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lemur, http://www.lemurproject.org/]]Google ScholarGoogle Scholar
  18. D. D. Lewis, W. A. Gale. A sequential algorithm for training text classifiers. In W. Bruce Croft and Cornelis J. van Rijsbergen, editors, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 3--12, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Li, Z. J. Zheng, K. Dai. KDD-CUP 2005. Presentation on The Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Chicago, USA. August 21, 2005. http://kdd05.lac.uic.edu/kddcup.html.]]Google ScholarGoogle Scholar
  20. Looksmart, http://www.looksmart.com.]]Google ScholarGoogle Scholar
  21. ODP: Open Directory Project, http://dmoz.com]]Google ScholarGoogle Scholar
  22. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, Stanford University, Stanford, CA, USA, 1998.]]Google ScholarGoogle Scholar
  23. J. R. Quinlan. Bagging, boosting and C4.5. In proceedings of the Thirteenth National Conference on Artificial Intelligence, 725--730. 1996.]]Google ScholarGoogle Scholar
  24. C. J. van Rijsbergen. Information Retrieval. Second Edition, Butterworths, London, 1979, 173--176.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Wordnet, http://wordnet.princeton.edu/]]Google ScholarGoogle Scholar

Index Terms

  1. Q2C@UST: our winning solution to query classification in KDDCUP 2005

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader