Skip to main content
Top

2017 | OriginalPaper | Chapter

A Lexicon LDA Model Based Solution to Theme Extraction of Chinese Short Text on the Internet

Authors : Xu Wang, Jing Zhou

Published in: Bio-inspired Computing: Theories and Applications

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Chinese short text has become the main content of the Internet. Accurately extracting thematic terms is the basis of content analysis, query suggestions, document classification, and text clustering and other tasks for Chinese short text on the Internet. Since Chinese short text is short on the Internet, unbalanced and less of context information, the traditional text clustering model is not immediately appropriate. This paper presents a simple and generic theme model named Lexicon LDA for Chinese short text on the Internet, by using the sentence structure within the document, to enrich the context of the common Chinese word semantics. Words of each sentence which is divided by punctuation marks compose a word set. Unlike the previous method, the model distributes the theme for each word set, rather than for each document. When the data set presents a strong theme distribution, it can significantly improve the effect of the theme model through experiments. The conclusion is that extracting thematic terms of Chinese short text on the Internet is related both to the word itself and to the sentence where the word is located.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Yu, L.L., Asur, S., Huberman, B.A.: Dynamics of Trends and Attention in Chinese Social Media. arXivpreprint arXiv:1312.0649 (2013) Yu, L.L., Asur, S., Huberman, B.A.: Dynamics of Trends and Attention in Chinese Social Media. arXivpreprint arXiv:​1312.​0649 (2013)
2.
go back to reference Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(1), 993–1022 (2003)MATH Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(1), 993–1022 (2003)MATH
3.
go back to reference Wang, X., Mc Callum, A.: Topics over time: a non-markov continuous-time model of topical trends. In: KDD, 424–433 (2006) Wang, X., Mc Callum, A.: Topics over time: a non-markov continuous-time model of topical trends. In: KDD, 424–433 (2006)
4.
go back to reference Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: WSDM, pp. 261–270 (2010) Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: WSDM, pp. 261–270 (2010)
5.
go back to reference Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, pp. 91–100 (2008) Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, pp. 91–100 (2008)
6.
go back to reference Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_34 CrossRef Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). https://​doi.​org/​10.​1007/​978-3-642-20161-5_​34 CrossRef
7.
go back to reference Chua, F.C.T., Asur, S.: Automatic summarization of events from social media. In: ICWSM (2013) Chua, F.C.T., Asur, S.: Automatic summarization of events from social media. In: ICWSM (2013)
8.
go back to reference Chen, Y., Amiri, H., Li, Z., Chua, T.S.: Emerging topic detection for organizations from microblogs. In: SIGIR, pp. 43–52 (2013) Chen, Y., Amiri, H., Li, Z., Chua, T.S.: Emerging topic detection for organizations from microblogs. In: SIGIR, pp. 43–52 (2013)
9.
go back to reference Jagarlamudi, J., Daumé III, H., Udupa, R.: Incorporating lexical priors into topic models. In: EACL, pp. 204–213 (2012) Jagarlamudi, J., Daumé III, H., Udupa, R.: Incorporating lexical priors into topic models. In: EACL, pp. 204–213 (2012)
10.
go back to reference Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In: ICML, pp. 25–32 (2009) Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In: ICML, pp. 25–32 (2009)
11.
go back to reference Mc Callum, A., Mimno, D.M., Wallach, H.M.: Rethinking lda: why priors matter. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., Culotta, A. (eds.) NIPS, pp. 1973–1981. Curran Associates, Inc. (2009) Mc Callum, A., Mimno, D.M., Wallach, H.M.: Rethinking lda: why priors matter. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., Culotta, A. (eds.) NIPS, pp. 1973–1981. Curran Associates, Inc. (2009)
13.
go back to reference Yan, X., Guo, J., Lan, Y., Cheng, X.: A Biterm topic model for short texts. In: 22nd International Conference on World Wide Web, pp. 1445–1456. International World Wide Web Conferences Steering Committee (2013) Yan, X., Guo, J., Lan, Y., Cheng, X.: A Biterm topic model for short texts. In: 22nd International Conference on World Wide Web, pp. 1445–1456. International World Wide Web Conferences Steering Committee (2013)
15.
go back to reference Mimno, D., Wallach, H.M., Talley, E., Leenders, M., and Mc Callum, A.: Optimizing semantic coherence in topic models. In: Conference on Empirical Methods in Natural Language Processing, pp. 262–272. Association for Computational Linguistics (2011) Mimno, D., Wallach, H.M., Talley, E., Leenders, M., and Mc Callum, A.: Optimizing semantic coherence in topic models. In: Conference on Empirical Methods in Natural Language Processing, pp. 262–272. Association for Computational Linguistics (2011)
16.
go back to reference Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coherence over many models and many topics. In: 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, ser, EMNLP-Co NLL 2012, pp. 952–961 (2012) Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coherence over many models and many topics. In: 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, ser, EMNLP-Co NLL 2012, pp. 952–961 (2012)
17.
go back to reference Zhou, T., Lyu, M.T., King, I., Lou, J.: Learning to suggest questions in social media. Knowl. Inf. Syst. 1–28 (2014) Zhou, T., Lyu, M.T., King, I., Lou, J.: Learning to suggest questions in social media. Knowl. Inf. Syst. 1–28 (2014)
18.
go back to reference Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)CrossRef Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)CrossRef
19.
go back to reference Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coherence over many models and many topics. In: 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-Co NLL 2012, pp. 952–961 (2012) Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coherence over many models and many topics. In: 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-Co NLL 2012, pp. 952–961 (2012)
20.
go back to reference Nigam, K., Mc Callum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2—-3), 103–134 (2000)CrossRefMATH Nigam, K., Mc Callum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2—-3), 103–134 (2000)CrossRefMATH
21.
go back to reference Frank, E., Paynter G.W., Witten, I.H., et. al.: Domain-specific key phrase extraction. In: 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 668–673. Morgan Kaufmann Publishers Inc., San Francisco (1999) Frank, E., Paynter G.W., Witten, I.H., et. al.: Domain-specific key phrase extraction. In: 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 668–673. Morgan Kaufmann Publishers Inc., San Francisco (1999)
22.
go back to reference Turney, P.D.: Learning algorithms for key phrase extraction. Inform. Retrieval 2(4), 303–336 (2000)CrossRef Turney, P.D.: Learning algorithms for key phrase extraction. Inform. Retrieval 2(4), 303–336 (2000)CrossRef
23.
go back to reference Jing, S., Wanlong, L.: Topic words extraction method based on LDA model. Comput. Eng. 36(19), 81–83 (2010) Jing, S., Wanlong, L.: Topic words extraction method based on LDA model. Comput. Eng. 36(19), 81–83 (2010)
24.
go back to reference Jun, L., Dongsheng, Z., Xinlai, X., et al.: Key phrase extraction based on topic feature. Appl. Res. Comput. 29(11), 4224–4227 (2012) Jun, L., Dongsheng, Z., Xinlai, X., et al.: Key phrase extraction based on topic feature. Appl. Res. Comput. 29(11), 4224–4227 (2012)
25.
go back to reference Zhiyuan, L.: Research on Keyword Extraction Using Document Topical Structure. Tsinghua University, Beijing (2011) Zhiyuan, L.: Research on Keyword Extraction Using Document Topical Structure. Tsinghua University, Beijing (2011)
Metadata
Title
A Lexicon LDA Model Based Solution to Theme Extraction of Chinese Short Text on the Internet
Authors
Xu Wang
Jing Zhou
Copyright Year
2017
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-10-7179-9_17

Premium Partner