Skip to main content
Top

2017 | OriginalPaper | Chapter

Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model

Authors : Yingying Yan, Ruizhang Huang, Can Ma, Liyang Xu, Zhiyuan Ding, Rui Wang, Ting Huang, Bowei Liu

Published in: Web and Big Data

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Document clustering for short texts has received considerable interest. Traditional document clustering approaches are designed for long documents and perform poorly for short texts due to the their sparseness representation. To better understand short texts, we observe that words that appear in long documents can enrich short text context and improve the clustering performance for short texts. In this paper, we propose a novel model, namely DDMAfs, which (1) improves the clustering performance of short texts by sharing structural knowledge of long documents to short texts; (2) automatically identifies the number of clusters; (3) separates discriminative words from irrelevant words for long documents to obtain high quality structural knowledge. Our experiments indicate that the DDMAfs model performs well on the synthetic dataset and real datasets. Comparisons between the DDMAfs model and state-of-the-art short text clustering approaches show that the DDMAfs model is effective.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Bela, A., Frigyik, A., Gupta, M.: Introduction to the dirichlet distribution and related processes. Department of Electrical Engineering, University of Washington (2010) Bela, A., Frigyik, A., Gupta, M.: Introduction to the dirichlet distribution and related processes. Department of Electrical Engineering, University of Washington (2010)
2.
go back to reference Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., Freeman, D.: Autoclass: a Bayesian classification system. In: Readings in Knowledge Acquisition and Learning, pp. 431–441. Morgan Kaufmann Publishers Inc., Burlington (1993) Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., Freeman, D.: Autoclass: a Bayesian classification system. In: Readings in Knowledge Acquisition and Learning, pp. 431–441. Morgan Kaufmann Publishers Inc., Burlington (1993)
3.
go back to reference Green, P.J., Richardson, S.: Modelling heterogeneity with and without the dirichlet process. Scand. J. Stat. 28(2), 355–375 (2001)MathSciNetCrossRefMATH Green, P.J., Richardson, S.: Modelling heterogeneity with and without the dirichlet process. Scand. J. Stat. 28(2), 355–375 (2001)MathSciNetCrossRefMATH
4.
go back to reference Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010) Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)
5.
go back to reference Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the SIGIR 2003 Semantic Web Workshop, pp. 541–544 (2003) Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the SIGIR 2003 Semantic Web Workshop, pp. 541–544 (2003)
6.
go back to reference Huang, R., Yu, G., Wang, Z., Zhang, J., Shi, L.: Dirichlet process mixture model for document clustering with feature partition. IEEE Trans. Knowl. Data Eng. 25(8), 1748–1759 (2013)CrossRef Huang, R., Yu, G., Wang, Z., Zhang, J., Shi, L.: Dirichlet process mixture model for document clustering with feature partition. IEEE Trans. Knowl. Data Eng. 25(8), 1748–1759 (2013)CrossRef
7.
8.
go back to reference Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)CrossRef Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)CrossRef
9.
go back to reference Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011) Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)
10.
go back to reference Phan, X.H., Nguyen, C.T., Le, D.T., Nguyen, L.M., Horiguchi, S., Ha, Q.T.: A hidden topic-based framework toward building applications with short web documents. IEEE Trans. Knowl. Data Eng. 23(7), 961–976 (2011)CrossRef Phan, X.H., Nguyen, C.T., Le, D.T., Nguyen, L.M., Horiguchi, S., Ha, Q.T.: A hidden topic-based framework toward building applications with short web documents. IEEE Trans. Knowl. Data Eng. 23(7), 961–976 (2011)CrossRef
11.
go back to reference Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008) Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)
12.
go back to reference Smyth, P.: Model selection for probabilistic clustering using cross-validated likelihood. Stat. Comput. 10(1), 63–72 (2000)CrossRef Smyth, P.: Model selection for probabilistic clustering using cross-validated likelihood. Stat. Comput. 10(1), 63–72 (2000)CrossRef
13.
go back to reference Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 990–998. ACM (2008) Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 990–998. ACM (2008)
14.
go back to reference Weng, J., Lim, E.P., Jiang, J., He, Q.: TwitterRank: finding topic-sensitive influential twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010) Weng, J., Lim, E.P., Jiang, J., He, Q.: TwitterRank: finding topic-sensitive influential twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010)
15.
go back to reference Xu, J., Wang, P., Tian, G., Xu, B., Zhao, J., Wang, F., Hao, H.: Short text clustering via convolutional neural networks. In: Proceedings of NAACL-HLT, pp. 62–69 (2015) Xu, J., Wang, P., Tian, G., Xu, B., Zhao, J., Wang, F., Hao, H.: Short text clustering via convolutional neural networks. In: Proceedings of NAACL-HLT, pp. 62–69 (2015)
16.
go back to reference Yang, C.L., Benjamasutin, N., Chen-Burger, Y.H.: Mining hidden concepts: using short text clustering and wikipedia knowledge. In: 2014 28th International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 675–680. IEEE (2014) Yang, C.L., Benjamasutin, N., Chen-Burger, Y.H.: Mining hidden concepts: using short text clustering and wikipedia knowledge. In: 2014 28th International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 675–680. IEEE (2014)
17.
go back to reference Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014) Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)
18.
go back to reference Yu, G., Huang, R., Wang, Z.: Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 763–772. ACM (2010) Yu, G., Huang, R., Wang, Z.: Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 763–772. ACM (2010)
19.
go back to reference Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing Twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20161-5_34 CrossRef Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing Twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). doi:10.​1007/​978-3-642-20161-5_​34 CrossRef
20.
go back to reference Zhong, S.: Semi-supervised model-based document clustering: a comparative study. Mach. Learn. 65(1), 3–29 (2006)CrossRef Zhong, S.: Semi-supervised model-based document clustering: a comparative study. Mach. Learn. 65(1), 3–29 (2006)CrossRef
Metadata
Title
Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model
Authors
Yingying Yan
Ruizhang Huang
Can Ma
Liyang Xu
Zhiyuan Ding
Rui Wang
Ting Huang
Bowei Liu
Copyright Year
2017
DOI
https://doi.org/10.1007/978-3-319-63579-8_47

Premium Partner