Skip to main content
Top
Published in: Knowledge and Information Systems 10/2020

10-07-2020 | Regular Paper

Bag of biterms modeling for short texts

Authors: Anh Phan Tuan, Bach Tran, Thien Huu Nguyen, Linh Ngo Van, Khoat Than

Published in: Knowledge and Information Systems | Issue 10/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely bag of biterms modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of bag of biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP). By exploiting both terms (words) and biterms (pairs of words), the major advantages of BBM are: (1) it enhances the length of the documents and makes the context more coherent by emphasizing the word connotation and co-occurrence via bag of biterms, and (2) it inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts. Extensive experiments suggest that BBM outperforms several state-of-the-art models. We also point out that the BoB representation performs better than the traditional representations (e.g., bag of words, tf-idf) even for normal texts.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., Burlington, pp 289–296 Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., Burlington, pp 289–296
2.
go back to reference Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH
3.
4.
go back to reference Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: Proceedings of The 31st international conference on machine learning, pp 190–198 Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: Proceedings of The 31st international conference on machine learning, pp 190–198
5.
go back to reference Than K, Doan T (2014) Dual online inference for latent Dirichlet allocation. In: ACML Than K, Doan T (2014) Dual online inference for latent Dirichlet allocation. In: ACML
6.
go back to reference Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on world wide web. ACM, pp 377–386 Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on world wide web. ACM, pp 377–386
7.
go back to reference Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using web search engines. In: www, vol 7, pp 757–766 Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using web search engines. In: www, vol 7, pp 757–766
8.
go back to reference Yih W-T, Meek C (2007) Improving similarity measures for short segments of text. In: AAAI, vol 7, pp 1489–1494 Yih W-T, Meek C (2007) Improving similarity measures for short segments of text. In: AAAI, vol 7, pp 1489–1494
9.
go back to reference Banerjee S, Ramanathan K, Gupta A (2007) Clustering short texts using wikipedia. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 787–788 Banerjee S, Ramanathan K, Gupta A (2007) Clustering short texts using wikipedia. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 787–788
10.
go back to reference Schönhofen P (2009) Identifying document topics using the wikipedia category network. Web Intell Agent Syst 7(2):195–207CrossRef Schönhofen P (2009) Identifying document topics using the wikipedia category network. Web Intell Agent Syst 7(2):195–207CrossRef
11.
go back to reference Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100 Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100
12.
go back to reference Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 889–892 Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 889–892
13.
go back to reference Grant CE, George CP, Jenneisch C, Wilson JN (2011) Online topic modeling for real-time Twitter search. In: TREC Grant CE, George CP, Jenneisch C, Wilson JN (2011) Online topic modeling for real-time Twitter search. In: TREC
14.
go back to reference Ye C, Wen W (2014) PY: TM-HDP—an effective nonparametric topic model for Tibetan messages. J Comput Inf Syst 10:10433–10444 Ye C, Wen W (2014) PY: TM-HDP—an effective nonparametric topic model for Tibetan messages. J Comput Inf Syst 10:10433–10444
15.
go back to reference Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88 Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88
16.
go back to reference Qiang J, Chen P, Wang T, Wu X (2017) Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 363–374 Qiang J, Chen P, Wang T, Wu X (2017) Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 363–374
17.
go back to reference Zhao H, Du L, Buntine W (2017) A word embeddings informed focused topic model. In: Asian conference on machine learning, pp 423–438 Zhao H, Du L, Buntine W (2017) A word embeddings informed focused topic model. In: Asian conference on machine learning, pp 423–438
18.
go back to reference Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst (TOIS) 36(2):11CrossRef Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst (TOIS) 36(2):11CrossRef
19.
go back to reference Weng J, Lim E-P, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential Twitterers. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 261–270 Weng J, Lim E-P, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential Twitterers. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 261–270
20.
go back to reference Jiang L, Lu H, Xu M, Wang C (2016) Biterm pseudo document topic model for short text. In: 2016 IEEE 28th International conference on tools with artificial intelligence (ICTAI). IEEE, pp 865–872 Jiang L, Lu H, Xu M, Wang C (2016) Biterm pseudo document topic model for short text. In: 2016 IEEE 28th International conference on tools with artificial intelligence (ICTAI). IEEE, pp 865–872
21.
go back to reference Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa GL (2017) A general framework to expand short text for topic modeling. Inf Sci 393:66–81CrossRef Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa GL (2017) A general framework to expand short text for topic modeling. Inf Sci 393:66–81CrossRef
22.
go back to reference Yang Y, Wang F, Zhang J, Xu J, Philip SY (2018) A topic model for co-occurring normal documents and short texts. World Wide Web 21(2):487–513CrossRef Yang Y, Wang F, Zhang J, Xu J, Philip SY (2018) A topic model for co-occurring normal documents and short texts. World Wide Web 21(2):487–513CrossRef
23.
go back to reference Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2105–2114 Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2105–2114
24.
go back to reference Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp 2270–2276 Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp 2270–2276
25.
go back to reference Cheng X, Yan X, Lan Y, Guo J (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941CrossRef Cheng X, Yan X, Lan Y, Guo J (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941CrossRef
26.
go back to reference Wang C, Paisley JW, Blei DM (2011) Online variational inference for the hierarchical Dirichlet process. In: AISTATS, vol 2, p 4 Wang C, Paisley JW, Blei DM (2011) Online variational inference for the hierarchical Dirichlet process. In: AISTATS, vol 2, p 4
27.
go back to reference Broderick T, Boyd N, Wibisono A, Wilson AC, Jordan MI (2013) Streaming variational Bayes. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 1727–1735 Broderick T, Boyd N, Wibisono A, Wilson AC, Jordan MI (2013) Streaming variational Bayes. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 1727–1735
28.
go back to reference Duc AN, Van Linh N, Kim AN, Than K (2017) Keeping priors in streaming Bayesian learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 247–258 Duc AN, Van Linh N, Kim AN, Than K (2017) Keeping priors in streaming Bayesian learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 247–258
29.
go back to reference Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 262–272 Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 262–272
30.
go back to reference Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on world wide web. ACM, pp 1445–1456 Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on world wide web. ACM, pp 1445–1456
31.
go back to reference Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 856–864 Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 856–864
32.
go back to reference Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347MathSciNetMATH Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347MathSciNetMATH
33.
go back to reference Mai K, Mai S, Nguyen A, Van Linh N, Than K (2006) Enabling hierarchical Dirichlet processes to work better for short texts at large scale. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 431–442 Mai K, Mai S, Nguyen A, Van Linh N, Than K (2006) Enabling hierarchical Dirichlet processes to work better for short texts at large scale. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 431–442
35.
go back to reference Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
36.
go back to reference Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of GSCL, pp 31–40 Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of GSCL, pp 31–40
Metadata
Title
Bag of biterms modeling for short texts
Authors
Anh Phan Tuan
Bach Tran
Thien Huu Nguyen
Linh Ngo Van
Khoat Than
Publication date
10-07-2020
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 10/2020
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-020-01482-z

Other articles of this Issue 10/2020

Knowledge and Information Systems 10/2020 Go to the issue

Premium Partner