Skip to main content
Erschienen in: World Wide Web 2/2018

23.06.2017

A topic model for co-occurring normal documents and short texts

verfasst von: Yang Yang, Feifei Wang, Junni Zhang, Jin Xu, Philip S. Yu

Erschienen in: World Wide Web | Ausgabe 2/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

User comments, as a large group of online short texts, are becoming increasingly prevalent with the development of online communications. These short texts are characterized by their co-occurrences with usually lengthier normal documents. For example, there could be multiple user comments following one news article, or multiple reader reviews following one blog post. The co-occurring structure inherent in such text corpora is important for efficient learning of topics, but is rarely captured by conventional topic models. To capture such structure, we propose a topic model for co-occurring documents, referred to as COTM. In COTM, we assume there are two sets of topics: formal topics and informal topics, where formal topics can appear in both normal documents and short texts whereas informal topics can only appear in short texts. Each normal document has a probability distribution over a set of formal topics; each short text is composed of two topics, one from the set of formal topics, whose selection is governed by the topic probabilities of the corresponding normal document, and the other from a set of informal topics. We also develop an online algorithm for COTM to deal with large scale corpus. Extensive experiments on real-world datasets demonstrate that COTM and its online algorithm outperform state-of-art methods by discovering more prominent, coherent and comprehensive topics.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat AlSumait, L., Barbara, D., Domeniconi, C.: On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In: 2008 eighth IEEE international conference on data mining, pp. 3c12. IEEE (2008) AlSumait, L., Barbara, D., Domeniconi, C.: On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In: 2008 eighth IEEE international conference on data mining, pp. 3c12. IEEE (2008)
2.
Zurück zum Zitat Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993C1022 (2003)MATH Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993C1022 (2003)MATH
3.
Zurück zum Zitat Cai, D., Mei, Q., Han, J., Zhai, C.: Modeling hidden topics on document manifold. In: Proceedings of the 17th ACM conference on information and knowledge management, pp. 911c920. ACM (2008) Cai, D., Mei, Q., Han, J., Zhai, C.: Modeling hidden topics on document manifold. In: Proceedings of the 17th ACM conference on information and knowledge management, pp. 911c920. ACM (2008)
4.
Zurück zum Zitat Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statisticstheory and Methods 3(1), 1C27 (1974)MathSciNetMATH Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statisticstheory and Methods 3(1), 1C27 (1974)MathSciNetMATH
5.
Zurück zum Zitat Cheng, X., Yan, X., Lan, Y., Guo, J.: Btm: Topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928C2941 (2014)CrossRef Cheng, X., Yan, X., Lan, Y., Guo, J.: Btm: Topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928C2941 (2014)CrossRef
6.
Zurück zum Zitat Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al Najada H.: Survey of review spam detection using machine learning techniques. J. Big Data 2(1), 1C24 (2015)CrossRef Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al Najada H.: Survey of review spam detection using machine learning techniques. J. Big Data 2(1), 1C24 (2015)CrossRef
7.
Zurück zum Zitat Dixit, S., Agrawal, A.: Survey on review spam detection. Int. J. Comput. Commun. Technol. ISSN (PRINT) 4, 0975C7449 (2013) Dixit, S., Agrawal, A.: Survey on review spam detection. Int. J. Comput. Commun. Technol. ISSN (PRINT) 4, 0975C7449 (2013)
8.
Zurück zum Zitat Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9(Aug), 1871C1874 (2008)MATH Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9(Aug), 1871C1874 (2008)MATH
9.
Zurück zum Zitat Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, pp. 50c57. ACM (1999) Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, pp. 50c57. ACM (1999)
10.
Zurück zum Zitat Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, pp. 80c88. ACM (2010) Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, pp. 80c88. ACM (2010)
11.
Zurück zum Zitat Hu, W., Tsujii, J.: A latent concept topic model for robust topic inference using word embeddings. In: The 54th annual meeting of the association for computational linguistics, pp. 380 (2016) Hu, W., Tsujii, J.: A latent concept topic model for robust topic inference using word embeddings. In: The 54th annual meeting of the association for computational linguistics, pp. 380 (2016)
12.
Zurück zum Zitat Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp. 775c784. ACM (2011) Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp. 775c784. ACM (2011)
13.
Zurück zum Zitat Lakkaraju, H., Bhattacharya, I., Bhattacharyya, C.: Dynamic multi-relational chinese restaurant process for analyzing influences on users in social media. In: 2012 IEEE 12th international conference on data mining, pp. 389c398. IEEE (2012) Lakkaraju, H., Bhattacharya, I., Bhattacharyya, C.: Dynamic multi-relational chinese restaurant process for analyzing influences on users in social media. In: 2012 IEEE 12th international conference on data mining, pp. 389c398. IEEE (2012)
14.
Zurück zum Zitat Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: The international ACM SIGIR conference, pp. 165c174 (2016) Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: The international ACM SIGIR conference, pp. 165c174 (2016)
15.
Zurück zum Zitat Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link lda: joint models of topic and author community. In: Proceedings of the 26th annual international conference on machine learning, pp. 665c672. ACM (2009) Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link lda: joint models of topic and author community. In: Proceedings of the 26th annual international conference on machine learning, pp. 665c672. ACM (2009)
16.
Zurück zum Zitat Ma, Z., Sun, A., Yuan, Q., Cong, G.: Topic-driven reader comments summarization. In: Proceedings of the 21st ACM international conference on information and knowledge management, pp. 265c274. ACM (2012) Ma, Z., Sun, A., Yuan, Q., Cong, G.: Topic-driven reader comments summarization. In: Proceedings of the 21st ACM international conference on information and knowledge management, pp. 265c274. ACM (2012)
17.
Zurück zum Zitat McCallum, A., Wang, X., Mohanty, N.: Joint group and topic discovery from relations and text. Springer (2007) McCallum, A., Wang, X., Mohanty, N.: Joint group and topic discovery from relations and text. Springer (2007)
18.
Zurück zum Zitat Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp. 889c892. ACM (2013) Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp. 889c892. ACM (2013)
19.
Zurück zum Zitat Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Computer Science (2013) Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Computer Science (2013)
20.
Zurück zum Zitat Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, pp. 262c272 (2011) Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, pp. 262c272 (2011)
21.
Zurück zum Zitat Natarajan, N., Sen, P., Chaoji, V.: Community detection in content-sharing social networks. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp. 82c89. ACM (2013) Natarajan, N., Sen, P., Chaoji, V.: Community detection in content-sharing social networks. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp. 82c89. ACM (2013)
22.
Zurück zum Zitat Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on empirical methods in natural language processing, pp. 1532c1543 (2014) Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on empirical methods in natural language processing, pp. 1532c1543 (2014)
23.
Zurück zum Zitat Phan X.H., Nguyen L.M., Horiguchi S.: Learning to classify short and sparse text & Web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide Web, pp. 91c100. ACM (2008) Phan X.H., Nguyen L.M., Horiguchi S.: Learning to classify short and sparse text & Web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide Web, pp. 91c100. ACM (2008)
24.
Zurück zum Zitat Phan, X.H., Nguyen, C.T., Le, D.T., Nguyen, L.M., Horiguchi, S., Ha, Q.T.: A hidden topic-based framework toward building applications with short Web documents. IEEE Trans. Knowl. Data Eng. 23(7), 961C976 (2011)CrossRef Phan, X.H., Nguyen, C.T., Le, D.T., Nguyen, L.M., Horiguchi, S., Ha, Q.T.: A hidden topic-based framework toward building applications with short Web documents. IEEE Trans. Knowl. Data Eng. 23(7), 961C976 (2011)CrossRef
25.
Zurück zum Zitat Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: International conference on artificial intelligence, pp. 2270c2276 (2015) Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: International conference on artificial intelligence, pp. 2270c2276 (2015)
26.
Zurück zum Zitat Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web search and data mining, pp. 261c270. ACM (2010) Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web search and data mining, pp. 261c270. ACM (2010)
27.
Zurück zum Zitat Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd international conference on WorldWideWeb, InternationalWorldWideWeb conferences steering committee, pp. 1445c1456 (2013) Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd international conference on WorldWideWeb, InternationalWorldWideWeb conferences steering committee, pp. 1445c1456 (2013)
28.
Zurück zum Zitat Yang, Y., Wang, F., Jiang, F., Jin, S., Xu, J.: A topic model for hierarchical documents. In: International conference on data science in cyberspace, IEEE (2016) Yang, Y., Wang, F., Jiang, F., Jin, S., Xu, J.: A topic model for hierarchical documents. In: International conference on data science in cyberspace, IEEE (2016)
29.
Zurück zum Zitat Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Advances in information retrieval, pp. 338c349. Springer (2011) Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Advances in information retrieval, pp. 338c349. Springer (2011)
30.
Zurück zum Zitat Zuo, Y., Wu, J., Zhang, H., Lin, H., Wang, F., Xu, K., Xiong, H.: Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 2. ACM (2016) Zuo, Y., Wu, J., Zhang, H., Lin, H., Wang, F., Xu, K., Xiong, H.: Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 2. ACM (2016)
Metadaten
Titel
A topic model for co-occurring normal documents and short texts
verfasst von
Yang Yang
Feifei Wang
Junni Zhang
Jin Xu
Philip S. Yu
Publikationsdatum
23.06.2017
Verlag
Springer US
Erschienen in
World Wide Web / Ausgabe 2/2018
Print ISSN: 1386-145X
Elektronische ISSN: 1573-1413
DOI
https://doi.org/10.1007/s11280-017-0467-8

Weitere Artikel der Ausgabe 2/2018

World Wide Web 2/2018 Zur Ausgabe