Skip to main content
Top
Published in: Knowledge and Information Systems 2/2016

01-08-2016 | Regular Paper

Word network topic model: a simple but general solution for short and imbalanced texts

Authors: Yuan Zuo, Jichang Zhao, Ke Xu

Published in: Knowledge and Information Systems | Issue 2/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The short text has been the prevalent format for information of Internet, especially with the development of online social media. Although sophisticated signals delivered by the short text make it a promising source for topic modeling, its extreme sparsity and imbalance bring unprecedented challenges to conventional topic models like LDA and its variants. Aiming at presenting a simple but general solution for topic modeling in short texts, we present a word co-occurrence network-based model named WNTM to tackle the sparsity and imbalance simultaneously. Different from previous approaches, WNTM models the distribution over topics for each word instead of learning topics for each document, which successfully enhances the semantic density of data space without importing too much time or space complexity. Meanwhile, the rich contextual information preserved in the word–word space also guarantees its sensitivity in identifying rare topics with convincing quality. Furthermore, employing the same Gibbs sampling as LDA makes WNTM easily to be extended to various application scenarios. Extensive validations on both short and normal texts testify the outperformance of WNTM as compared to baseline methods. And we also demonstrate its potential in precisely discovering newly emerging topics or unexpected events in Weibo at pretty early stages.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: ICML, pp 25–32 Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: ICML, pp 25–32
2.
go back to reference Arora S, Ge R, Halpern Y, Mimno D, Moitra A, Sontag D, Wu Y, Zhu M (2013) A practical algorithm for topic modeling with provable guarantees. ICML 28:280–288 Arora S, Ge R, Halpern Y, Mimno D, Moitra A, Sontag D, Wu Y, Zhu M (2013) A practical algorithm for topic modeling with provable guarantees. ICML 28:280–288
3.
go back to reference Blei DM, Lafferty JD (2006) Dynamic topic models. In: ICML, pp 113–120 Blei DM, Lafferty JD (2006) Dynamic topic models. In: ICML, pp 113–120
4.
go back to reference Blei DM, McAuliffe JD (2007) Supervised topic models. In: NIPS, pp 121–128 Blei DM, McAuliffe JD (2007) Supervised topic models. In: NIPS, pp 121–128
5.
go back to reference Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH
6.
go back to reference Cha Y, Cho J (2012) Social-network analysis using topic models. In: SIGIR, pp 565–574 Cha Y, Cho J (2012) Social-network analysis using topic models. In: SIGIR, pp 565–574
7.
go back to reference Chang J, Gerrish S, Wang C, Boyd-graber JL, Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: NIPS, pp 288–296 Chang J, Gerrish S, Wang C, Boyd-graber JL, Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: NIPS, pp 288–296
8.
go back to reference Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: IJCAI, pp 1776–1781 Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: IJCAI, pp 1776–1781
9.
go back to reference Chen Y, Amiri H, Li Z, Chua TS (2013a) Emerging topic detection for organizations from microblogs. In: SIGIR, pp 43–52 Chen Y, Amiri H, Li Z, Chua TS (2013a) Emerging topic detection for organizations from microblogs. In: SIGIR, pp 43–52
10.
go back to reference Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R (2013b) Discovering coherent topics using general knowledge. In: CIKM, pp 209–218 Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R (2013b) Discovering coherent topics using general knowledge. In: CIKM, pp 209–218
11.
go back to reference Chua FCT, Asur S (2013) Automatic summarization of events from social media. In: ICWSM Chua FCT, Asur S (2013) Automatic summarization of events from social media. In: ICWSM
12.
go back to reference Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6):391–407CrossRef Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6):391–407CrossRef
13.
go back to reference Fan R, Zhao J, Feng X, Xu K (2014) Topic dynamics in weibo: happy entertainment dominates but angry finance is more periodic. In: ASONAM, pp 230–233 Fan R, Zhao J, Feng X, Xu K (2014) Topic dynamics in weibo: happy entertainment dominates but angry finance is more periodic. In: ASONAM, pp 230–233
14.
go back to reference Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20(1):116–131CrossRef Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20(1):116–131CrossRef
16.
go back to reference Henderson K, Eliassi-Rad T (2009) Applying latent dirichlet allocation to group discovery in large graphs. In: SAC, pp 1456–1461 Henderson K, Eliassi-Rad T (2009) Applying latent dirichlet allocation to group discovery in large graphs. In: SAC, pp 1456–1461
17.
go back to reference Hofmann T (1999) Probabilistic latent semantic indexing. In: SIGIR, pp 50–57 Hofmann T (1999) Probabilistic latent semantic indexing. In: SIGIR, pp 50–57
18.
go back to reference Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: SOMA, pp 80–88 Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: SOMA, pp 80–88
19.
go back to reference Jagarlamudi J, Daumé H III, Udupa R (2012) Incorporating lexical priors into topic models. In: EACL, pp 204–213 Jagarlamudi J, Daumé H III, Udupa R (2012) Incorporating lexical priors into topic models. In: EACL, pp 204–213
20.
go back to reference Jiang D, Leung KT, Vosecky J, Ng W (2014a) Personalized query suggestion with diversity awareness. In: ICDE, pp 400–411 Jiang D, Leung KT, Vosecky J, Ng W (2014a) Personalized query suggestion with diversity awareness. In: ICDE, pp 400–411
21.
go back to reference Jiang D, Leung KWT, Ng W (2014b) Fast topic discovery from web search streams. In: WWW, pp 949–960 Jiang D, Leung KWT, Ng W (2014b) Fast topic discovery from web search streams. In: WWW, pp 949–960
22.
go back to reference Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: CIKM, pp 775–784 Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: CIKM, pp 775–784
23.
go back to reference Li C, Cheung W, Ye Y, Zhang X, Chu D, Li X (2015) The author-topic-community model for author interest profiling and community discovery. Knowl Inf Syst 44(2):359–383CrossRef Li C, Cheung W, Ye Y, Zhang X, Chu D, Li X (2015) The author-topic-community model for author interest profiling and community discovery. Knowl Inf Syst 44(2):359–383CrossRef
24.
go back to reference Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: mining focused topics and focused terms in short text. In: WWW, pp 539–550 Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: mining focused topics and focused terms in short text. In: WWW, pp 539–550
25.
go back to reference McCallum A, Mimno D, Wallach HM (2009) Rethinking lda: why priors matter. In: NIPS, pp 1973–1981 McCallum A, Mimno D, Wallach HM (2009) Rethinking lda: why priors matter. In: NIPS, pp 1973–1981
26.
go back to reference Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: EMNLP, pp 262–272 Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: EMNLP, pp 262–272
27.
go back to reference Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134CrossRefMATH Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134CrossRefMATH
28.
go back to reference Peirsman Y, Heylen K, Geeraerts D (2008) Size matters: tight and loose context definitions in english word space models. In: Proceedings of the ESSLLI workshop on distributional lexical semantics, pp 34–41 Peirsman Y, Heylen K, Geeraerts D (2008) Size matters: tight and loose context definitions in english word space models. In: Proceedings of the ESSLLI workshop on distributional lexical semantics, pp 34–41
29.
go back to reference Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, pp 91–100 Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, pp 91–100
30.
go back to reference Quan X, Liu G, Lu Z, Ni X, Liu W (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491CrossRef Quan X, Liu G, Lu Z, Ni X, Liu W (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491CrossRef
31.
go back to reference Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP, pp 248–256 Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP, pp 248–256
32.
go back to reference Ramage D, Dumais S, Liebling D (2010) Characterizing microblogs with topic models. In: ICWSM Ramage D, Dumais S, Liebling D (2010) Characterizing microblogs with topic models. In: ICWSM
33.
go back to reference Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI, pp 487–494 Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI, pp 487–494
34.
go back to reference Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633CrossRef Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633CrossRef
35.
go back to reference Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208MathSciNetCrossRefMATH Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208MathSciNetCrossRefMATH
36.
go back to reference Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: WWW, pp 377–386 Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: WWW, pp 377–386
37.
go back to reference Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: EMNLP-CoNLL, pp 952–961 Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: EMNLP-CoNLL, pp 952–961
38.
go back to reference Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: ICML, pp 190–198 Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: ICML, pp 190–198
39.
go back to reference Tong Y, Cao CC, Chen L (2014) Tcs: efficient topic discovery over crowd-oriented service data. In: KDD, pp 861–870 Tong Y, Cao CC, Chen L (2014) Tcs: efficient topic discovery over crowd-oriented service data. In: KDD, pp 861–870
40.
go back to reference Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: KDD, pp 424–433 Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: KDD, pp 424–433
41.
go back to reference Wang X, Jia Y, Zhou B, Ding Z, Zheng L (2011) Computing semantic relatedness using chinese wikipedia links and taxonomy. J Chin Comput Syst 32(11):2237–2242 Wang X, Jia Y, Zhou B, Ding Z, Zheng L (2011) Computing semantic relatedness using chinese wikipedia links and taxonomy. J Chin Comput Syst 32(11):2237–2242
42.
go back to reference Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: WSDM, pp 261–270 Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: WSDM, pp 261–270
43.
go back to reference Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: WWW, pp 1445–1456 Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: WWW, pp 1445–1456
46.
go back to reference Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: ECIR, pp 338–349 Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: ECIR, pp 338–349
47.
go back to reference Zhou T, Lyu MT, King I, Lou J (2015) Learning to suggest questions in social media. Knowl Inf Syst 43(2):389–416CrossRef Zhou T, Lyu MT, King I, Lou J (2015) Learning to suggest questions in social media. Knowl Inf Syst 43(2):389–416CrossRef
Metadata
Title
Word network topic model: a simple but general solution for short and imbalanced texts
Authors
Yuan Zuo
Jichang Zhao
Ke Xu
Publication date
01-08-2016
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 2/2016
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-015-0882-z

Other articles of this Issue 2/2016

Knowledge and Information Systems 2/2016 Go to the issue

Premium Partner