Top

Knowledge and Information Systems

Published in:

01-08-2016 | Regular Paper

Word network topic model: a simple but general solution for short and imbalanced texts

Authors: Yuan Zuo, Jichang Zhao, Ke Xu

Published in: Knowledge and Information Systems | Issue 2/2016

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The short text has been the prevalent format for information of Internet, especially with the development of online social media. Although sophisticated signals delivered by the short text make it a promising source for topic modeling, its extreme sparsity and imbalance bring unprecedented challenges to conventional topic models like LDA and its variants. Aiming at presenting a simple but general solution for topic modeling in short texts, we present a word co-occurrence network-based model named WNTM to tackle the sparsity and imbalance simultaneously. Different from previous approaches, WNTM models the distribution over topics for each word instead of learning topics for each document, which successfully enhances the semantic density of data space without importing too much time or space complexity. Meanwhile, the rich contextual information preserved in the word–word space also guarantees its sensitivity in identifying rare topics with convincing quality. Furthermore, employing the same Gibbs sampling as LDA makes WNTM easily to be extended to various application scenarios. Extensive validations on both short and normal texts testify the outperformance of WNTM as compared to baseline methods. And we also demonstrate its potential in precisely discovering newly emerging topics or unexpected events in Weibo at pretty early stages.

previous article Comparison of different weighting schemes for the kNN classifier on time-series data

next article Inferring lockstep behavior from connectivity pattern in large graphs

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

http://jgibblda.sourceforge.net/.

http://code.google.com/p/plda/.

http://code.google.com/p/btm/.

Publicly available at http://ipv6.nlsde.buaa.edu.cn/zhaojichang/paper/wntm.rar.

http://ictclas.nlpir.org/downloads.

http://www.sogou.com/labs/dl/ca.html.

http://www.csie.ntu.edu.tw/~cjlin/liblinear/.

Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: ICML, pp 25–32

Arora S, Ge R, Halpern Y, Mimno D, Moitra A, Sontag D, Wu Y, Zhu M (2013) A practical algorithm for topic modeling with provable guarantees. ICML 28:280–288

Blei DM, Lafferty JD (2006) Dynamic topic models. In: ICML, pp 113–120

Blei DM, McAuliffe JD (2007) Supervised topic models. In: NIPS, pp 121–128

Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH

Cha Y, Cho J (2012) Social-network analysis using topic models. In: SIGIR, pp 565–574

Chang J, Gerrish S, Wang C, Boyd-graber JL, Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: NIPS, pp 288–296

Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: IJCAI, pp 1776–1781

Chen Y, Amiri H, Li Z, Chua TS (2013a) Emerging topic detection for organizations from microblogs. In: SIGIR, pp 43–52

10.

Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R (2013b) Discovering coherent topics using general knowledge. In: CIKM, pp 209–218

11.

Chua FCT, Asur S (2013) Automatic summarization of events from social media. In: ICWSM

12.

Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6):391–407CrossRef

13.

Fan R, Zhao J, Feng X, Xu K (2014) Topic dynamics in weibo: happy entertainment dominates but angry finance is more periodic. In: ASONAM, pp 230–233

14.

Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20(1):116–131CrossRef

15.

Heinrich G (2005) Parameter estimation for text analysis. http://www.arbylon.net/publications/text-est.pdf

16.

Henderson K, Eliassi-Rad T (2009) Applying latent dirichlet allocation to group discovery in large graphs. In: SAC, pp 1456–1461

17.

Hofmann T (1999) Probabilistic latent semantic indexing. In: SIGIR, pp 50–57

18.

Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: SOMA, pp 80–88

19.

Jagarlamudi J, Daumé H III, Udupa R (2012) Incorporating lexical priors into topic models. In: EACL, pp 204–213

20.

Jiang D, Leung KT, Vosecky J, Ng W (2014a) Personalized query suggestion with diversity awareness. In: ICDE, pp 400–411

21.

Jiang D, Leung KWT, Ng W (2014b) Fast topic discovery from web search streams. In: WWW, pp 949–960

22.

Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: CIKM, pp 775–784

23.

Li C, Cheung W, Ye Y, Zhang X, Chu D, Li X (2015) The author-topic-community model for author interest profiling and community discovery. Knowl Inf Syst 44(2):359–383CrossRef

24.

Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: mining focused topics and focused terms in short text. In: WWW, pp 539–550

25.

McCallum A, Mimno D, Wallach HM (2009) Rethinking lda: why priors matter. In: NIPS, pp 1973–1981

26.

Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: EMNLP, pp 262–272

27.

Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134CrossRefMATH

28.

Peirsman Y, Heylen K, Geeraerts D (2008) Size matters: tight and loose context definitions in english word space models. In: Proceedings of the ESSLLI workshop on distributional lexical semantics, pp 34–41

29.

Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, pp 91–100

30.

Quan X, Liu G, Lu Z, Ni X, Liu W (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491CrossRef

31.

Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP, pp 248–256

32.

Ramage D, Dumais S, Liebling D (2010) Characterizing microblogs with topic models. In: ICWSM

33.

Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI, pp 487–494

34.

Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633CrossRef

35.

Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208MathSciNetCrossRefMATH

36.

Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: WWW, pp 377–386

37.

Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: EMNLP-CoNLL, pp 952–961

38.

Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: ICML, pp 190–198

39.

Tong Y, Cao CC, Chen L (2014) Tcs: efficient topic discovery over crowd-oriented service data. In: KDD, pp 861–870

40.

Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: KDD, pp 424–433

41.

Wang X, Jia Y, Zhou B, Ding Z, Zheng L (2011) Computing semantic relatedness using chinese wikipedia links and taxonomy. J Chin Comput Syst 32(11):2237–2242

42.

Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: WSDM, pp 261–270

43.

Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: WWW, pp 1445–1456

44.

Yu L, Asur S, Huberman BA (2011) What trends in chinese social media. arXiv:1107.3522

45.

Yu LL, Asur S, Huberman BA (2013) Dynamics of trends and attention in chinese social media. arXiv:1312.0649

46.

Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: ECIR, pp 338–349

47.

Zhou T, Lyu MT, King I, Lou J (2015) Learning to suggest questions in social media. Knowl Inf Syst 43(2):389–416CrossRef

Title: Word network topic model: a simple but general solution for short and imbalanced texts
Authors: Yuan Zuo
Jichang Zhao
Ke Xu
Publication date: 01-08-2016
Publisher: Springer London
Published in: Knowledge and Information Systems / Issue 2/2016
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-015-0882-z

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 2/2016

CloFAST: closed sequential pattern mining using sparse and vertical id-lists

Comparison of different weighting schemes for the kNN classifier on time-series data

An entropy-based clustering ensemble method to support resource allocation in business process management

Performance evaluation of word-aligned compression methods for bitmap indices

IRAFCA: an O(n) information retrieval algorithm based on formal concept analysis

Local search and pseudoinversion: an hybrid approach to neural network training

Premium Partner