skip to main content
10.1145/1964858.1964870acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Empirical study of topic modeling in Twitter

Published:25 July 2010Publication History

ABSTRACT

Social networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the community propagates through the network. Studying the characteristics of content in the messages becomes important for a number of tasks, such as breaking news detection, personalized message recommendation, friends recommendation, sentiment analysis and others. While many researchers wish to use standard text mining tools to understand messages on Twitter, the restricted length of those messages prevents them from being employed to their full potential.

We address the problem of using standard topic models in micro-blogging environments by studying how the models can be trained on the dataset. We propose several schemes to train a standard topic model and compare their quality and effectiveness through a set of carefully designed experiments from both qualitative and quantitative perspectives. We show that by training a topic model on aggregated messages we can obtain a higher quality of learned model which results in significantly better performance in two real-world classification problems. We also discuss how the state-of-the-art Author-Topic model fails to model hierarchical relationships between entities in Social Media.

References

  1. D. Blei and J. Lafferty. Topic models. Text Mining: Theory and Applications, 2009.Google ScholarGoogle Scholar
  2. D. M. Blei and J. D. Mcauliffe. Supervised topic models. Advances in Neural Information Processing Systems 21, 2007.Google ScholarGoogle Scholar
  3. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Chang, J. Boyd-Graber, and D. M. Blei. Connections between the lines: augmenting social networks with text. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169--178, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101:5228--5235, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  6. A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In WebKDD/SNA-KDD '07: Proceedings of the 9th WebKDD and 1st SNA-KDD Workshop on Web mining and Social Network Analysis, pages 56--65, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps about twitter. In WOSP '08: Proceedings of the First Workshop on Online Social Networks, pages 19--24, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link lda: joint models of topic and author community. In ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 665--672. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. McCallum, X. Wang, and N. Mohanty. Joint group and topic discovery from relations and text. In Statistical Network Analysis: Models, Issues and New Directions, volume 4503 of Lecture Notes in Computer Science, pages 28--44. Springer-Verlag, Berlin, Heidelberg, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. In KDD '08: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 542--550, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In WWW '08: Proceedings of the 17th International Conference on World Wide Web, pages 91--100, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic models. In International AAAI Conference on Weblogs and Social Media, 2010.Google ScholarGoogle Scholar
  14. D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In EMNLP '09: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 248--256. Association for Computational Linguistics, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Rosen-Zvi, C. Chemudugunta, T. Griffiths, P. Smyth, and M. Steyvers. Learning author-topic models from text corpora. ACM Transactions on Information Systems, 28(1):1--38, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI '04: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages 487--494, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW '06: Proceedings of the 15th International Conference on World Wide Web, pages 377--386, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. M. Wallach, D. Mimno, and A. McCallum. Rethinking lda: Why priors matter. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, 2009.Google ScholarGoogle Scholar
  19. J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In WSDM '10: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pages 261--270, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD '09: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 937--946, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W.-T. Yih and C. Meek. Improving similarity measures for short segments of text. In AAAI'07: Proceedings of the 22nd National Conference on Artificial Intelligence, pages 1489--1494, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Zhang, C. L. Giles, H. C. Foley, and J. Yen. Probabilistic community discovery using hierarchical latent gaussian mixture model. In AAAI'07: Proceedings of the 22nd National Conference on Artificial Intelligence, pages 663--668, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Empirical study of topic modeling in Twitter

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SOMA '10: Proceedings of the First Workshop on Social Media Analytics
      July 2010
      145 pages
      ISBN:9781450302173
      DOI:10.1145/1964858

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 July 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader