Skip to main content
Top

2017 | OriginalPaper | Chapter

Semi-supervised Text Categorization Using Recursive K-means Clustering

Authors : Harsha S. Gowda, Mahamad Suhil, D. S. Guru, Lavanya Narayana Raju

Published in: Recent Trends in Image Processing and Pattern Recognition

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partition till a desired level partition is achieved such that each partition contains labeled documents of a single class. Once the desired clusters are obtained, the respective cluster centroids are considered as representatives of the clusters and the nearest neighbor rule is used for classifying an unknown text document. Series of experiments have been conducted to bring out the superiority of the proposed model over other recent state of the art models on 20Newsgroups dataset.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Bair, E.: Semi-supervised clustering methods. Wiley Interdiscip. Rev. Comput. Stat. 5(5), 349–361 (2013)CrossRef Bair, E.: Semi-supervised clustering methods. Wiley Interdiscip. Rev. Comput. Stat. 5(5), 349–361 (2013)CrossRef
2.
go back to reference Basu, S., Bilenko, M., Mooney, R.J.: Comparing and unifying search-based and similarity-based approaches to semi-supervised clustering. In: Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining Systems, pp. 42–49 (2003) Basu, S., Bilenko, M., Mooney, R.J.: Comparing and unifying search-based and similarity-based approaches to semi-supervised clustering. In: Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining Systems, pp. 42–49 (2003)
3.
go back to reference Guru, D.S., Suhil, M.: A novel Term_Class relevance measure for text categorization. Procedia Comput. Sci. 45, 13–22 (2015). ElsevierCrossRef Guru, D.S., Suhil, M.: A novel Term_Class relevance measure for text categorization. Procedia Comput. Sci. 45, 13–22 (2015). ElsevierCrossRef
4.
go back to reference Guru, D.S., Harish, B.S., Manjunath, S.: Symbolic representation of text documents. In: Proceedings of the Third Annual ACM Bangalore Conference, COMPUTE 2010. ACM, New York (2010). Article 18, 4 pages Guru, D.S., Harish, B.S., Manjunath, S.: Symbolic representation of text documents. In: Proceedings of the Third Annual ACM Bangalore Conference, COMPUTE 2010. ACM, New York (2010). Article 18, 4 pages
5.
go back to reference Harish, B.S., Guru, D.S., Manjunath, S.: Representation and classification of text documents: a brief review. IJCA 2, 110–119 (2010). Special Issue on RTIPPR Harish, B.S., Guru, D.S., Manjunath, S.: Representation and classification of text documents: a brief review. IJCA 2, 110–119 (2010). Special Issue on RTIPPR
6.
go back to reference Isa, D., Lee, L.H., Kallimani, V.P., RajKumar, R.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE TKDE 20, 23–31 (2008) Isa, D., Lee, L.H., Kallimani, V.P., RajKumar, R.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE TKDE 20, 23–31 (2008)
7.
go back to reference Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATH Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATH
8.
go back to reference Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 103–134 (2013)CrossRefMATH Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 103–134 (2013)CrossRefMATH
9.
go back to reference Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRef Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRef
10.
go back to reference Su, J., Shirabad, J.S., Matwin, S.: Large scale text classification using semi-supervised multinomial Navie Bayes. In: Proceedings of the International Conference on Machine Learning (2011) Su, J., Shirabad, J.S., Matwin, S.: Large scale text classification using semi-supervised multinomial Navie Bayes. In: Proceedings of the International Conference on Machine Learning (2011)
11.
go back to reference Yan, Y., Chen, L., Tjhi, W.C.: Fuzzy semi-supervised co-clustering for text documents. Fuzzy Sets Syst. 215, 74–89 (2013)MathSciNetCrossRef Yan, Y., Chen, L., Tjhi, W.C.: Fuzzy semi-supervised co-clustering for text documents. Fuzzy Sets Syst. 215, 74–89 (2013)MathSciNetCrossRef
12.
go back to reference Yang, L., Jin, R., Sukthankar, R.: Semi-supervised learning with weakly-related unlabeled data: towards better text categorization. In: Advances in Neural Information Processing Systems (NIPS). MIT Press, Cambridge (2009) Yang, L., Jin, R., Sukthankar, R.: Semi-supervised learning with weakly-related unlabeled data: towards better text categorization. In: Advances in Neural Information Processing Systems (NIPS). MIT Press, Cambridge (2009)
13.
go back to reference Zhang, W., Tang, X., Yoshida, T.: TESC: An approach to TExt classification using Semi-supervised Clustering. Knowl. Based Syst. 75, 152–160 (2015)CrossRef Zhang, W., Tang, X., Yoshida, T.: TESC: An approach to TExt classification using Semi-supervised Clustering. Knowl. Based Syst. 75, 152–160 (2015)CrossRef
14.
go back to reference Zhang, W., Yang, Y., Wang, Q.: Using Bayesian regression and EM algorithm with missing handling for software effort prediction. Inf. Softw. Technol. 58, 58–70 (2015)CrossRef Zhang, W., Yang, Y., Wang, Q.: Using Bayesian regression and EM algorithm with missing handling for software effort prediction. Inf. Softw. Technol. 58, 58–70 (2015)CrossRef
15.
go back to reference Zhu, X.: Semi-supervised learning literature survey, Technical Report 1530, University of Wisconsin-Madison (2006) Zhu, X.: Semi-supervised learning literature survey, Technical Report 1530, University of Wisconsin-Madison (2006)
Metadata
Title
Semi-supervised Text Categorization Using Recursive K-means Clustering
Authors
Harsha S. Gowda
Mahamad Suhil
D. S. Guru
Lavanya Narayana Raju
Copyright Year
2017
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-10-4859-3_20

Premium Partner