Skip to main content
Erschienen in: Knowledge and Information Systems 10/2020

10.07.2020 | Regular Paper

Bag of biterms modeling for short texts

verfasst von: Anh Phan Tuan, Bach Tran, Thien Huu Nguyen, Linh Ngo Van, Khoat Than

Erschienen in: Knowledge and Information Systems | Ausgabe 10/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely bag of biterms modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of bag of biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP). By exploiting both terms (words) and biterms (pairs of words), the major advantages of BBM are: (1) it enhances the length of the documents and makes the context more coherent by emphasizing the word connotation and co-occurrence via bag of biterms, and (2) it inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts. Extensive experiments suggest that BBM outperforms several state-of-the-art models. We also point out that the BoB representation performs better than the traditional representations (e.g., bag of words, tf-idf) even for normal texts.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., Burlington, pp 289–296 Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., Burlington, pp 289–296
2.
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH
3.
Zurück zum Zitat Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581MathSciNetCrossRef Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581MathSciNetCrossRef
4.
Zurück zum Zitat Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: Proceedings of The 31st international conference on machine learning, pp 190–198 Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: Proceedings of The 31st international conference on machine learning, pp 190–198
5.
Zurück zum Zitat Than K, Doan T (2014) Dual online inference for latent Dirichlet allocation. In: ACML Than K, Doan T (2014) Dual online inference for latent Dirichlet allocation. In: ACML
6.
Zurück zum Zitat Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on world wide web. ACM, pp 377–386 Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on world wide web. ACM, pp 377–386
7.
Zurück zum Zitat Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using web search engines. In: www, vol 7, pp 757–766 Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using web search engines. In: www, vol 7, pp 757–766
8.
Zurück zum Zitat Yih W-T, Meek C (2007) Improving similarity measures for short segments of text. In: AAAI, vol 7, pp 1489–1494 Yih W-T, Meek C (2007) Improving similarity measures for short segments of text. In: AAAI, vol 7, pp 1489–1494
9.
Zurück zum Zitat Banerjee S, Ramanathan K, Gupta A (2007) Clustering short texts using wikipedia. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 787–788 Banerjee S, Ramanathan K, Gupta A (2007) Clustering short texts using wikipedia. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 787–788
10.
Zurück zum Zitat Schönhofen P (2009) Identifying document topics using the wikipedia category network. Web Intell Agent Syst 7(2):195–207CrossRef Schönhofen P (2009) Identifying document topics using the wikipedia category network. Web Intell Agent Syst 7(2):195–207CrossRef
11.
Zurück zum Zitat Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100 Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100
12.
Zurück zum Zitat Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 889–892 Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 889–892
13.
Zurück zum Zitat Grant CE, George CP, Jenneisch C, Wilson JN (2011) Online topic modeling for real-time Twitter search. In: TREC Grant CE, George CP, Jenneisch C, Wilson JN (2011) Online topic modeling for real-time Twitter search. In: TREC
14.
Zurück zum Zitat Ye C, Wen W (2014) PY: TM-HDP—an effective nonparametric topic model for Tibetan messages. J Comput Inf Syst 10:10433–10444 Ye C, Wen W (2014) PY: TM-HDP—an effective nonparametric topic model for Tibetan messages. J Comput Inf Syst 10:10433–10444
15.
Zurück zum Zitat Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88 Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88
16.
Zurück zum Zitat Qiang J, Chen P, Wang T, Wu X (2017) Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 363–374 Qiang J, Chen P, Wang T, Wu X (2017) Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 363–374
17.
Zurück zum Zitat Zhao H, Du L, Buntine W (2017) A word embeddings informed focused topic model. In: Asian conference on machine learning, pp 423–438 Zhao H, Du L, Buntine W (2017) A word embeddings informed focused topic model. In: Asian conference on machine learning, pp 423–438
18.
Zurück zum Zitat Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst (TOIS) 36(2):11CrossRef Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst (TOIS) 36(2):11CrossRef
19.
Zurück zum Zitat Weng J, Lim E-P, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential Twitterers. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 261–270 Weng J, Lim E-P, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential Twitterers. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 261–270
20.
Zurück zum Zitat Jiang L, Lu H, Xu M, Wang C (2016) Biterm pseudo document topic model for short text. In: 2016 IEEE 28th International conference on tools with artificial intelligence (ICTAI). IEEE, pp 865–872 Jiang L, Lu H, Xu M, Wang C (2016) Biterm pseudo document topic model for short text. In: 2016 IEEE 28th International conference on tools with artificial intelligence (ICTAI). IEEE, pp 865–872
21.
Zurück zum Zitat Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa GL (2017) A general framework to expand short text for topic modeling. Inf Sci 393:66–81CrossRef Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa GL (2017) A general framework to expand short text for topic modeling. Inf Sci 393:66–81CrossRef
22.
Zurück zum Zitat Yang Y, Wang F, Zhang J, Xu J, Philip SY (2018) A topic model for co-occurring normal documents and short texts. World Wide Web 21(2):487–513CrossRef Yang Y, Wang F, Zhang J, Xu J, Philip SY (2018) A topic model for co-occurring normal documents and short texts. World Wide Web 21(2):487–513CrossRef
23.
Zurück zum Zitat Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2105–2114 Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2105–2114
24.
Zurück zum Zitat Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp 2270–2276 Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp 2270–2276
25.
Zurück zum Zitat Cheng X, Yan X, Lan Y, Guo J (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941CrossRef Cheng X, Yan X, Lan Y, Guo J (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941CrossRef
26.
Zurück zum Zitat Wang C, Paisley JW, Blei DM (2011) Online variational inference for the hierarchical Dirichlet process. In: AISTATS, vol 2, p 4 Wang C, Paisley JW, Blei DM (2011) Online variational inference for the hierarchical Dirichlet process. In: AISTATS, vol 2, p 4
27.
Zurück zum Zitat Broderick T, Boyd N, Wibisono A, Wilson AC, Jordan MI (2013) Streaming variational Bayes. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 1727–1735 Broderick T, Boyd N, Wibisono A, Wilson AC, Jordan MI (2013) Streaming variational Bayes. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 1727–1735
28.
Zurück zum Zitat Duc AN, Van Linh N, Kim AN, Than K (2017) Keeping priors in streaming Bayesian learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 247–258 Duc AN, Van Linh N, Kim AN, Than K (2017) Keeping priors in streaming Bayesian learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 247–258
29.
Zurück zum Zitat Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 262–272 Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 262–272
30.
Zurück zum Zitat Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on world wide web. ACM, pp 1445–1456 Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on world wide web. ACM, pp 1445–1456
31.
Zurück zum Zitat Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 856–864 Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 856–864
32.
Zurück zum Zitat Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347MathSciNetMATH Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347MathSciNetMATH
33.
Zurück zum Zitat Mai K, Mai S, Nguyen A, Van Linh N, Than K (2006) Enabling hierarchical Dirichlet processes to work better for short texts at large scale. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 431–442 Mai K, Mai S, Nguyen A, Van Linh N, Than K (2006) Enabling hierarchical Dirichlet processes to work better for short texts at large scale. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 431–442
35.
Zurück zum Zitat Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
36.
Zurück zum Zitat Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of GSCL, pp 31–40 Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of GSCL, pp 31–40
Metadaten
Titel
Bag of biterms modeling for short texts
verfasst von
Anh Phan Tuan
Bach Tran
Thien Huu Nguyen
Linh Ngo Van
Khoat Than
Publikationsdatum
10.07.2020
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 10/2020
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-020-01482-z

Weitere Artikel der Ausgabe 10/2020

Knowledge and Information Systems 10/2020 Zur Ausgabe