ABSTRACT
Most existing approaches for text classification represent texts as vectors of words, namely ``Bag-of-Words.'' This text representation results in a very high dimensionality of feature space and frequently suffers from surface mismatching. Short texts make these issues even more serious, due to their shortness and sparsity. In this paper, we propose using ``Bag-of-Concepts'' in short text representation, aiming to avoid the surface mismatching and handle the synonym and polysemy problem. Based on ``Bag-of-Concepts,'' a novel framework is proposed for lightweight short text classification applications. By leveraging a large taxonomy knowledgebase, it learns a concept model for each category, and conceptualizes a short text to a set of relevant concepts. A concept-based similarity mechanism is presented to classify the given short text to the most similar category. One advantage of this mechanism is that it facilitates short text ranking after classification, which is needed in many applications, such as query or ad recommendation. We demonstrate the usage of our proposed framework through a real online application: Channel-based Query Recommendation. Experiments show that our framework can map queries to channels with a high degree of precision (avg. precision=90.3%), which is critical for recommendation applications.
- C. C. Aggarwal and C. Zhai. Mining text data. Springer, 2012. Google ScholarDigital Library
- A. Anagnostopoulos, L. Becchetti, C. Castillo, and A. Gionis. An optimization framework for query recommendation. In WSDM, pages 161--170. ACM, 2010. Google ScholarDigital Library
- R. Baeza-Yates, C. Hurtado, and M. Mendoza. Query recommendation using query logs in search engines. In EDBT, pages 588--596. Springer, 2005. Google ScholarDigital Library
- S. M. Beitzel, E. C. Jensen, O. Frieder, D. D. Lewis, A. Chowdhury, and A. Kolcz. Improving automatic query classification via semi-supervised learning. In ICDM, 2005. Google ScholarDigital Library
- I. Bordino, G. De Francisci Morales, I. Weber, and F. Bonchi. From machu_picchu to rafting the urubamba river: anticipating information needs via the entity-query graph. In WSDM, pages 275--284. ACM, 2013. Google ScholarDigital Library
- C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. TIST, 2(3):27, 2011. Google ScholarDigital Library
- M. Chen, X. Jin, and D. Shen. Short text classification improved by learning multi-granularity topics. In IJCAI, pages 1776--1781. AAAI Press, 2011. Google ScholarDigital Library
- C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273--297, 1995. Google ScholarDigital Library
- N. Craswell and M. Szummer. Random walks on the click graph. In SIGIR, pages 239--246. ACM, 2007. Google ScholarDigital Library
- W. B. Croft, M. Bendersky, H. Li, and G. Xu. Query representation and understanding workshop. In SIGIR Forum, volume 44, pages 48--53, 2010. Google ScholarDigital Library
- H. K. Dai, L. Zhao, Z. Nie, J.-R. Wen, L. Wang, and Y. Li. Detecting online commercial intention (oci). In WWW, 2006. Google ScholarDigital Library
- V. Dang and W. B. Croft. Diversity by proportionality: an election-based approach to search result diversification. In SIGIR, pages 65--74. ACM, 2012. Google ScholarDigital Library
- H. Feild and J. Allan. Task-aware query recommendation. In SIGIR, pages 83--92. ACM, 2013. Google ScholarDigital Library
- E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI, 2006. Google ScholarDigital Library
- E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, volume 7, pages 1606--1611, 2007. Google ScholarDigital Library
- J. He, V. Hollink, and A. de Vries. Combining implicit and explicit topic representations for result diversification. In SIGIR, pages 851--860. ACM, 2012. Google ScholarDigital Library
- X. Hu, N. Sun, C. Zhang, and T.-S. Chua. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In CIKM, pages 919--928. ACM, 2009. Google ScholarDigital Library
- L. Huang. Concept-based text clustering. PhD thesis, The University of Waikato, 2011.Google Scholar
- A. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. NIPS, 14:841, 2002.Google Scholar
- Y.-H. Kim, S.-Y. Hahn, and B.-T. Zhang. Text filtering by boosting naive bayes classifiers. In SIGIR, 2000. Google ScholarDigital Library
- T. Lee, Z. Wang, H. Wang, and S.-w. Hwang. Attribute extraction and scoring: A probabilistic approach. In ICDE, pages 194--205. IEEE, 2013. Google ScholarDigital Library
- P. Li, H. Wang, K. Q. Zhu, Z. Wang, and X. Wu. Computing term similarity by large probabilistic isa knowledge. In CIKM, pages 1401--1410. ACM, 2013. Google ScholarDigital Library
- R. Li, B. Kao, B. Bi, R. Cheng, and E. Lo. Dqr: a probabilistic approach to diversified query recommendation. In CIKM, pages 16--25. ACM, 2012. Google ScholarDigital Library
- X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In SIGIR, 2008. Google ScholarDigital Library
- Y. Li, D. McLean, Z. A. Bandar, J. D. O'shea, and K. Crockett. Sentence similarity based on semantic nets and corpus statistics. TKDE, 18(8):1138--1150, 2006. Google ScholarDigital Library
- H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In SIGIR. ACM, 1997. Google ScholarDigital Library
- X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In WWW, 2008. Google ScholarDigital Library
- J. R. Quinlan. Induction of decision trees. Machine learning, pages 81--106, 1986. Google ScholarDigital Library
- M. Sahlgren and R. Cöster. Using bag-of-concepts to improve the performance of support vector machines in text categorization. In COLING, page 487. ACL, 2004. Google ScholarDigital Library
- G. Salton, A. Wong, and C.-S. Yang. A vector space model for automatic indexing. Communications of the ACM, 1975. Google ScholarDigital Library
- D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang. Q2c@ust: our winning solution to query classification in kddcup 2005. SIGKDD, 7(2):100--110, 2005. Google ScholarDigital Library
- D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang. Query enrichment for web-query classification. TOIS, 24(3):320--352, 2006. Google ScholarDigital Library
- D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. Building bridges for web query classification. In SIGIR, 2006. Google ScholarDigital Library
- F. Song and W. B. Croft. A general language model for information retrieval. In CIKM, pages 316--321. ACM, 1999. Google ScholarDigital Library
- Y. Song, H. Wang, Z. Wang, H. Li, and W. Chen. Short text conceptualization using a probabilistic knowledgebase. In IJCAI, pages 2330--2336. AAAI Press, 2011. Google ScholarDigital Library
- F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706. ACM, 2007. Google ScholarDigital Library
- A. Sun. Short text classification using very few words. In SIGIR, pages 1145--1146. ACM, 2012. Google ScholarDigital Library
- I. Szpektor, A. Gionis, and Y. Maarek. Improving recommendation for long-tail queries via templates. In WWW, pages 47--56. ACM, 2011. Google ScholarDigital Library
- Z. Wang, H. Wang, and Z. Hu. Head, modifier, and constraint detection in short texts. In ICDE, pages 280--291, 2014.Google ScholarCross Ref
- W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In SIGMOD, pages 481--492. ACM, 2012. Google ScholarDigital Library
- E. Yeh, D. Ramage, C. D. Manning, E. Agirre, and A. Soroa. Wikiwalk: random walks on wikipedia for semantic relatedness. In ACL Workshop, pages 41--49. ACL, 2009. Google ScholarDigital Library
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR, pages 334--342. ACM, 2001. Google ScholarDigital Library
- Z. Zhang and O. Nasraoui. Mining search engine query logs for query recommendation. In WWW, pages 1039--1040, 2006. Google ScholarDigital Library
Index Terms
- Concept-based Short Text Classification and Ranking
Recommendations
Short text classification using very few words
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalWe propose a simple, scalable, and non-parametric approach for short text classification. Leveraging the well studied and scalable Information Retrieval (IR) framework, our approach mimics human labeling process for a piece of short text. It first ...
Enhancing naive bayes with various smoothing methods for short text classification
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebPartly due to the proliferance of microblog, short texts are becoming prominent. A huge number of short texts are generated every day, which calls for a method that can efficiently accommodate new data to incrementally adjust classification models. ...
Short Text Classification Using Wikipedia Concept Based Document Representation
ITA '13: Proceedings of the 2013 International Conference on Information Technology and ApplicationsShort text classification is a difficult and challenging task in information retrieval systems since the text data is short, sparse and multidimensional. In this paper, we represent short text with Wikipedia concepts for classification. Short document ...
Comments