ABSTRACT
Social networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the community propagates through the network. Studying the characteristics of content in the messages becomes important for a number of tasks, such as breaking news detection, personalized message recommendation, friends recommendation, sentiment analysis and others. While many researchers wish to use standard text mining tools to understand messages on Twitter, the restricted length of those messages prevents them from being employed to their full potential.
We address the problem of using standard topic models in micro-blogging environments by studying how the models can be trained on the dataset. We propose several schemes to train a standard topic model and compare their quality and effectiveness through a set of carefully designed experiments from both qualitative and quantitative perspectives. We show that by training a topic model on aggregated messages we can obtain a higher quality of learned model which results in significantly better performance in two real-world classification problems. We also discuss how the state-of-the-art Author-Topic model fails to model hierarchical relationships between entities in Social Media.
- D. Blei and J. Lafferty. Topic models. Text Mining: Theory and Applications, 2009.Google Scholar
- D. M. Blei and J. D. Mcauliffe. Supervised topic models. Advances in Neural Information Processing Systems 21, 2007.Google Scholar
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- J. Chang, J. Boyd-Graber, and D. M. Blei. Connections between the lines: augmenting social networks with text. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169--178, 2009. Google ScholarDigital Library
- T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101:5228--5235, 2004.Google ScholarCross Ref
- A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In WebKDD/SNA-KDD '07: Proceedings of the 9th WebKDD and 1st SNA-KDD Workshop on Web mining and Social Network Analysis, pages 56--65, 2007. Google ScholarDigital Library
- B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps about twitter. In WOSP '08: Proceedings of the First Workshop on Online Social Networks, pages 19--24, 2008. Google ScholarDigital Library
- Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link lda: joint models of topic and author community. In ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 665--672. ACM, 2009. Google ScholarDigital Library
- C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
- A. McCallum, X. Wang, and N. Mohanty. Joint group and topic discovery from relations and text. In Statistical Network Analysis: Models, Issues and New Directions, volume 4503 of Lecture Notes in Computer Science, pages 28--44. Springer-Verlag, Berlin, Heidelberg, 2007. Google ScholarDigital Library
- R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. In KDD '08: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 542--550, 2008. Google ScholarDigital Library
- X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In WWW '08: Proceedings of the 17th International Conference on World Wide Web, pages 91--100, 2008. Google ScholarDigital Library
- D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic models. In International AAAI Conference on Weblogs and Social Media, 2010.Google Scholar
- D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In EMNLP '09: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 248--256. Association for Computational Linguistics, 2009. Google ScholarDigital Library
- M. Rosen-Zvi, C. Chemudugunta, T. Griffiths, P. Smyth, and M. Steyvers. Learning author-topic models from text corpora. ACM Transactions on Information Systems, 28(1):1--38, 2010. Google ScholarDigital Library
- M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI '04: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages 487--494, 2004. Google ScholarDigital Library
- M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW '06: Proceedings of the 15th International Conference on World Wide Web, pages 377--386, 2006. Google ScholarDigital Library
- H. M. Wallach, D. Mimno, and A. McCallum. Rethinking lda: Why priors matter. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, 2009.Google Scholar
- J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In WSDM '10: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pages 261--270, 2010. Google ScholarDigital Library
- L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD '09: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 937--946, 2009. Google ScholarDigital Library
- W.-T. Yih and C. Meek. Improving similarity measures for short segments of text. In AAAI'07: Proceedings of the 22nd National Conference on Artificial Intelligence, pages 1489--1494, 2007. Google ScholarDigital Library
- H. Zhang, C. L. Giles, H. C. Foley, and J. Yen. Probabilistic community discovery using hierarchical latent gaussian mixture model. In AAAI'07: Proceedings of the 22nd National Conference on Artificial Intelligence, pages 663--668, 2007. Google ScholarDigital Library
Index Terms
- Empirical study of topic modeling in Twitter
Recommendations
Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
AbstractTopic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic ...
Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge ManagementAspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are ...
Using topic models for Twitter hashtag recommendation
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide WebSince the introduction of microblogging services, there has been a continuous growth of short-text social networking on the Internet. With the generation of large amounts of microposts, there is a need for effective categorization and search of the ...
Comments