Skip to main content
Erschienen in: Knowledge and Information Systems 2/2016

01.08.2016 | Regular Paper

Word network topic model: a simple but general solution for short and imbalanced texts

verfasst von: Yuan Zuo, Jichang Zhao, Ke Xu

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The short text has been the prevalent format for information of Internet, especially with the development of online social media. Although sophisticated signals delivered by the short text make it a promising source for topic modeling, its extreme sparsity and imbalance bring unprecedented challenges to conventional topic models like LDA and its variants. Aiming at presenting a simple but general solution for topic modeling in short texts, we present a word co-occurrence network-based model named WNTM to tackle the sparsity and imbalance simultaneously. Different from previous approaches, WNTM models the distribution over topics for each word instead of learning topics for each document, which successfully enhances the semantic density of data space without importing too much time or space complexity. Meanwhile, the rich contextual information preserved in the word–word space also guarantees its sensitivity in identifying rare topics with convincing quality. Furthermore, employing the same Gibbs sampling as LDA makes WNTM easily to be extended to various application scenarios. Extensive validations on both short and normal texts testify the outperformance of WNTM as compared to baseline methods. And we also demonstrate its potential in precisely discovering newly emerging topics or unexpected events in Weibo at pretty early stages.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: ICML, pp 25–32 Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: ICML, pp 25–32
2.
Zurück zum Zitat Arora S, Ge R, Halpern Y, Mimno D, Moitra A, Sontag D, Wu Y, Zhu M (2013) A practical algorithm for topic modeling with provable guarantees. ICML 28:280–288 Arora S, Ge R, Halpern Y, Mimno D, Moitra A, Sontag D, Wu Y, Zhu M (2013) A practical algorithm for topic modeling with provable guarantees. ICML 28:280–288
3.
Zurück zum Zitat Blei DM, Lafferty JD (2006) Dynamic topic models. In: ICML, pp 113–120 Blei DM, Lafferty JD (2006) Dynamic topic models. In: ICML, pp 113–120
4.
Zurück zum Zitat Blei DM, McAuliffe JD (2007) Supervised topic models. In: NIPS, pp 121–128 Blei DM, McAuliffe JD (2007) Supervised topic models. In: NIPS, pp 121–128
5.
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH
6.
Zurück zum Zitat Cha Y, Cho J (2012) Social-network analysis using topic models. In: SIGIR, pp 565–574 Cha Y, Cho J (2012) Social-network analysis using topic models. In: SIGIR, pp 565–574
7.
Zurück zum Zitat Chang J, Gerrish S, Wang C, Boyd-graber JL, Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: NIPS, pp 288–296 Chang J, Gerrish S, Wang C, Boyd-graber JL, Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: NIPS, pp 288–296
8.
Zurück zum Zitat Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: IJCAI, pp 1776–1781 Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: IJCAI, pp 1776–1781
9.
Zurück zum Zitat Chen Y, Amiri H, Li Z, Chua TS (2013a) Emerging topic detection for organizations from microblogs. In: SIGIR, pp 43–52 Chen Y, Amiri H, Li Z, Chua TS (2013a) Emerging topic detection for organizations from microblogs. In: SIGIR, pp 43–52
10.
Zurück zum Zitat Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R (2013b) Discovering coherent topics using general knowledge. In: CIKM, pp 209–218 Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R (2013b) Discovering coherent topics using general knowledge. In: CIKM, pp 209–218
11.
Zurück zum Zitat Chua FCT, Asur S (2013) Automatic summarization of events from social media. In: ICWSM Chua FCT, Asur S (2013) Automatic summarization of events from social media. In: ICWSM
12.
Zurück zum Zitat Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6):391–407CrossRef Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6):391–407CrossRef
13.
Zurück zum Zitat Fan R, Zhao J, Feng X, Xu K (2014) Topic dynamics in weibo: happy entertainment dominates but angry finance is more periodic. In: ASONAM, pp 230–233 Fan R, Zhao J, Feng X, Xu K (2014) Topic dynamics in weibo: happy entertainment dominates but angry finance is more periodic. In: ASONAM, pp 230–233
14.
Zurück zum Zitat Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20(1):116–131CrossRef Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20(1):116–131CrossRef
16.
Zurück zum Zitat Henderson K, Eliassi-Rad T (2009) Applying latent dirichlet allocation to group discovery in large graphs. In: SAC, pp 1456–1461 Henderson K, Eliassi-Rad T (2009) Applying latent dirichlet allocation to group discovery in large graphs. In: SAC, pp 1456–1461
17.
Zurück zum Zitat Hofmann T (1999) Probabilistic latent semantic indexing. In: SIGIR, pp 50–57 Hofmann T (1999) Probabilistic latent semantic indexing. In: SIGIR, pp 50–57
18.
Zurück zum Zitat Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: SOMA, pp 80–88 Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: SOMA, pp 80–88
19.
Zurück zum Zitat Jagarlamudi J, Daumé H III, Udupa R (2012) Incorporating lexical priors into topic models. In: EACL, pp 204–213 Jagarlamudi J, Daumé H III, Udupa R (2012) Incorporating lexical priors into topic models. In: EACL, pp 204–213
20.
Zurück zum Zitat Jiang D, Leung KT, Vosecky J, Ng W (2014a) Personalized query suggestion with diversity awareness. In: ICDE, pp 400–411 Jiang D, Leung KT, Vosecky J, Ng W (2014a) Personalized query suggestion with diversity awareness. In: ICDE, pp 400–411
21.
Zurück zum Zitat Jiang D, Leung KWT, Ng W (2014b) Fast topic discovery from web search streams. In: WWW, pp 949–960 Jiang D, Leung KWT, Ng W (2014b) Fast topic discovery from web search streams. In: WWW, pp 949–960
22.
Zurück zum Zitat Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: CIKM, pp 775–784 Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: CIKM, pp 775–784
23.
Zurück zum Zitat Li C, Cheung W, Ye Y, Zhang X, Chu D, Li X (2015) The author-topic-community model for author interest profiling and community discovery. Knowl Inf Syst 44(2):359–383CrossRef Li C, Cheung W, Ye Y, Zhang X, Chu D, Li X (2015) The author-topic-community model for author interest profiling and community discovery. Knowl Inf Syst 44(2):359–383CrossRef
24.
Zurück zum Zitat Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: mining focused topics and focused terms in short text. In: WWW, pp 539–550 Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: mining focused topics and focused terms in short text. In: WWW, pp 539–550
25.
Zurück zum Zitat McCallum A, Mimno D, Wallach HM (2009) Rethinking lda: why priors matter. In: NIPS, pp 1973–1981 McCallum A, Mimno D, Wallach HM (2009) Rethinking lda: why priors matter. In: NIPS, pp 1973–1981
26.
Zurück zum Zitat Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: EMNLP, pp 262–272 Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: EMNLP, pp 262–272
27.
Zurück zum Zitat Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134CrossRefMATH Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134CrossRefMATH
28.
Zurück zum Zitat Peirsman Y, Heylen K, Geeraerts D (2008) Size matters: tight and loose context definitions in english word space models. In: Proceedings of the ESSLLI workshop on distributional lexical semantics, pp 34–41 Peirsman Y, Heylen K, Geeraerts D (2008) Size matters: tight and loose context definitions in english word space models. In: Proceedings of the ESSLLI workshop on distributional lexical semantics, pp 34–41
29.
Zurück zum Zitat Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, pp 91–100 Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, pp 91–100
30.
Zurück zum Zitat Quan X, Liu G, Lu Z, Ni X, Liu W (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491CrossRef Quan X, Liu G, Lu Z, Ni X, Liu W (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491CrossRef
31.
Zurück zum Zitat Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP, pp 248–256 Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP, pp 248–256
32.
Zurück zum Zitat Ramage D, Dumais S, Liebling D (2010) Characterizing microblogs with topic models. In: ICWSM Ramage D, Dumais S, Liebling D (2010) Characterizing microblogs with topic models. In: ICWSM
33.
Zurück zum Zitat Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI, pp 487–494 Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI, pp 487–494
34.
Zurück zum Zitat Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633CrossRef Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633CrossRef
35.
Zurück zum Zitat Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208MathSciNetCrossRefMATH Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208MathSciNetCrossRefMATH
36.
Zurück zum Zitat Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: WWW, pp 377–386 Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: WWW, pp 377–386
37.
Zurück zum Zitat Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: EMNLP-CoNLL, pp 952–961 Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: EMNLP-CoNLL, pp 952–961
38.
Zurück zum Zitat Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: ICML, pp 190–198 Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: ICML, pp 190–198
39.
Zurück zum Zitat Tong Y, Cao CC, Chen L (2014) Tcs: efficient topic discovery over crowd-oriented service data. In: KDD, pp 861–870 Tong Y, Cao CC, Chen L (2014) Tcs: efficient topic discovery over crowd-oriented service data. In: KDD, pp 861–870
40.
Zurück zum Zitat Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: KDD, pp 424–433 Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: KDD, pp 424–433
41.
Zurück zum Zitat Wang X, Jia Y, Zhou B, Ding Z, Zheng L (2011) Computing semantic relatedness using chinese wikipedia links and taxonomy. J Chin Comput Syst 32(11):2237–2242 Wang X, Jia Y, Zhou B, Ding Z, Zheng L (2011) Computing semantic relatedness using chinese wikipedia links and taxonomy. J Chin Comput Syst 32(11):2237–2242
42.
Zurück zum Zitat Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: WSDM, pp 261–270 Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: WSDM, pp 261–270
43.
Zurück zum Zitat Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: WWW, pp 1445–1456 Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: WWW, pp 1445–1456
46.
Zurück zum Zitat Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: ECIR, pp 338–349 Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: ECIR, pp 338–349
47.
Zurück zum Zitat Zhou T, Lyu MT, King I, Lou J (2015) Learning to suggest questions in social media. Knowl Inf Syst 43(2):389–416CrossRef Zhou T, Lyu MT, King I, Lou J (2015) Learning to suggest questions in social media. Knowl Inf Syst 43(2):389–416CrossRef
Metadaten
Titel
Word network topic model: a simple but general solution for short and imbalanced texts
verfasst von
Yuan Zuo
Jichang Zhao
Ke Xu
Publikationsdatum
01.08.2016
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 2/2016
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-015-0882-z

Weitere Artikel der Ausgabe 2/2016

Knowledge and Information Systems 2/2016 Zur Ausgabe