research-article

Empirical study of topic modeling in Twitter

Authors:
Liangjie Hong

Lehigh University, Bethlehem, PA

Lehigh University, Bethlehem, PA
View Profile

,
Brian D. Davison

Lehigh University, Bethlehem, PA

Lehigh University, Bethlehem, PA
View Profile

SOMA '10: Proceedings of the First Workshop on Social Media AnalyticsJuly 2010Pages 80–88https://doi.org/10.1145/1964858.1964870

Published:25 July 2010Publication History

SOMA '10: Proceedings of the First Workshop on Social Media Analytics

Pages 80–88

ABSTRACT

Social networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the community propagates through the network. Studying the characteristics of content in the messages becomes important for a number of tasks, such as breaking news detection, personalized message recommendation, friends recommendation, sentiment analysis and others. While many researchers wish to use standard text mining tools to understand messages on Twitter, the restricted length of those messages prevents them from being employed to their full potential.

We address the problem of using standard topic models in micro-blogging environments by studying how the models can be trained on the dataset. We propose several schemes to train a standard topic model and compare their quality and effectiveness through a set of carefully designed experiments from both qualitative and quantitative perspectives. We show that by training a topic model on aggregated messages we can obtain a higher quality of learned model which results in significantly better performance in two real-world classification problems. We also discuss how the state-of-the-art Author-Topic model fails to model hierarchical relationships between entities in Social Media.

References

D. Blei and J. Lafferty. Topic models. Text Mining: Theory and Applications, 2009.Google Scholar
D. M. Blei and J. D. Mcauliffe. Supervised topic models. Advances in Neural Information Processing Systems 21, 2007.Google Scholar
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
J. Chang, J. Boyd-Graber, and D. M. Blei. Connections between the lines: augmenting social networks with text. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169--178, 2009. Google ScholarDigital Library
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101:5228--5235, 2004.Google ScholarCross Ref
A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In WebKDD/SNA-KDD '07: Proceedings of the 9th WebKDD and 1st SNA-KDD Workshop on Web mining and Social Network Analysis, pages 56--65, 2007. Google ScholarDigital Library
B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps about twitter. In WOSP '08: Proceedings of the First Workshop on Online Social Networks, pages 19--24, 2008. Google ScholarDigital Library
Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link lda: joint models of topic and author community. In ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 665--672. ACM, 2009. Google ScholarDigital Library
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
A. McCallum, X. Wang, and N. Mohanty. Joint group and topic discovery from relations and text. In Statistical Network Analysis: Models, Issues and New Directions, volume 4503 of Lecture Notes in Computer Science, pages 28--44. Springer-Verlag, Berlin, Heidelberg, 2007. Google ScholarDigital Library
R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. In KDD '08: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 542--550, 2008. Google ScholarDigital Library
X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In WWW '08: Proceedings of the 17th International Conference on World Wide Web, pages 91--100, 2008. Google ScholarDigital Library
D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic models. In International AAAI Conference on Weblogs and Social Media, 2010.Google Scholar
D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In EMNLP '09: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 248--256. Association for Computational Linguistics, 2009. Google ScholarDigital Library
M. Rosen-Zvi, C. Chemudugunta, T. Griffiths, P. Smyth, and M. Steyvers. Learning author-topic models from text corpora. ACM Transactions on Information Systems, 28(1):1--38, 2010. Google ScholarDigital Library
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI '04: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages 487--494, 2004. Google ScholarDigital Library
M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW '06: Proceedings of the 15th International Conference on World Wide Web, pages 377--386, 2006. Google ScholarDigital Library
H. M. Wallach, D. Mimno, and A. McCallum. Rethinking lda: Why priors matter. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, 2009.Google Scholar
J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In WSDM '10: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pages 261--270, 2010. Google ScholarDigital Library
L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD '09: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 937--946, 2009. Google ScholarDigital Library
W.-T. Yih and C. Meek. Improving similarity measures for short segments of text. In AAAI'07: Proceedings of the 22nd National Conference on Artificial Intelligence, pages 1489--1494, 2007. Google ScholarDigital Library
H. Zhang, C. L. Giles, H. C. Foley, and J. Yen. Probabilistic community discovery using hierarchical latent gaussian mixture model. In AAAI'07: Proceedings of the 22nd National Conference on Artificial Intelligence, pages 663--668, 2007. Google ScholarDigital Library

Index Terms

Empirical study of topic modeling in Twitter
1. Information systems
  1. Information retrieval

Recommendations

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
Abstract
Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic ...
Read More
Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Aspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are ...
Read More
Using topic models for Twitter hashtag recommendation
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

Since the introduction of microblogging services, there has been a continuous growth of short-text social networking on the Internet. With the generation of large amounts of microposts, there is a need for effective categorization and search of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOMA '10: Proceedings of the First Workshop on Social Media Analytics
July 2010
145 pages
ISBN:9781450302173
DOI:10.1145/1964858
Conference Chairs:
Prem Melville
IBM Research
,
Jure Leskovec
Stanford University
,
Foster Provost
NYU Stern School of Business
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Twitter
social media
topic models
Qualifiers
- research-article
Conference
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 706
  Total Citations
  View Citations
- 7,521
  Total Downloads
- Downloads (Last 12 months)537
- Downloads (Last 6 weeks)80
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Empirical study of topic modeling in Twitter

SOMA '10: Proceedings of the First Workshop on Social Media Analytics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data

Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon

Using topic models for Twitter hashtag recommendation