Skip to main content
Erschienen in: Cluster Computing 2/2019

27.02.2018

Arabic text clustering using improved clustering algorithms with dimensionality reduction

verfasst von: Arun Kumar Sangaiah, Ahmed E. Fakhry, Mohamed Abdel-Basset, Ibrahim El-henawy

Erschienen in: Cluster Computing | Sonderheft 2/2019

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Arabic Text document clustering is an important aspect for providing conjectural navigation and browsing techniques by organizing massive amounts of data into a small number of defined clusters. However, Words in form of vector are used for clustering methods is often unsatisfactory as it ignores relationships between important terms. Cluster analysis separates data into groups on clusters for the purposes of improved understanding or summarization. Clustering has a long history and many techniques developed in statistics, data mining, pattern recognition and other fields. This research proposes three approaches; Unsupervised, Semi Supervised techniques and Semi Supervised with dimensionality reduction to construct a clustering based classifier for Arabic text documents. Using k-means, incremental k-means, Threshold + k-means and k-means with dimensionality reduction, after document preprocessing removing stop words and gets the root for each term in each document. Then apply a term weighting method to get the weight of each term with respect to its document. Then apply a similarity measure method to each document and its similarity with other documents. And using F-measure, entropy and support vector machine (SVM) for calculate accuracy. The datasets are online dynamic datasets that are characterized by its availability and credibility on the internet. Arabic language is a challenging language when applied in an inference based algorithm. So, selecting the appropriate dataset is a principal factor in such research. The accuracy of those methods compared with other approaches and the proposed methods shows better accuracy and fewer errors for new classification test cases. Considering that the dimension reduction process is very sensitive because increasing the ratio of reduction can destroy important terms.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, Burlington (2006) Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, Burlington (2006)
2.
Zurück zum Zitat Farghaly, A., Shaalan, K.: Arabic Natural Language Processing. ACM, New York (2009) Farghaly, A., Shaalan, K.: Arabic Natural Language Processing. ACM, New York (2009)
3.
Zurück zum Zitat Mumtaz, K., Duraiswamy, K.: A novel density based improved k-means clustering algorithm—Dbkmeans. Int. J. Comput. Sci. Eng. 2(2), 213–218 (2010) Mumtaz, K., Duraiswamy, K.: A novel density based improved k-means clustering algorithm—Dbkmeans. Int. J. Comput. Sci. Eng. 2(2), 213–218 (2010)
4.
Zurück zum Zitat Hill, T., Lewicki, P.: Statistics methods and applications, 1st edn. StatSoft, Tulsa (2007) Hill, T., Lewicki, P.: Statistics methods and applications, 1st edn. StatSoft, Tulsa (2007)
5.
Zurück zum Zitat Aitao C.: Building an Arabic stemmer for information retrieval. In: Proceedings of the Eleventh Text Retrieval Conference, Berkeley, pp. 631–639 (2003) Aitao C.: Building an Arabic stemmer for information retrieval. In: Proceedings of the Eleventh Text Retrieval Conference, Berkeley, pp. 631–639 (2003)
6.
Zurück zum Zitat Ababneh, M., AlShalabi, R., Kanaan, G., AlNobani, A.: Building an effective rule based light stemmer or Arabic language to improve search effectiveness. Int. Arab. J. Inf. Technol. 9(4), 368–372 (2012) Ababneh, M., AlShalabi, R., Kanaan, G., AlNobani, A.: Building an effective rule based light stemmer or Arabic language to improve search effectiveness. Int. Arab. J. Inf. Technol. 9(4), 368–372 (2012)
7.
Zurück zum Zitat Hayder, A., Shaikha, A., Amna, A., Khadija, A., Naila, A., Noura, A., Shaikha, A.: Arabic light stemmer: a new enhanced approach. In: Proceedings of Software Engineering Department, UAE University, Dubai, pp. 1–9 (2005) Hayder, A., Shaikha, A., Amna, A., Khadija, A., Naila, A., Noura, A., Shaikha, A.: Arabic light stemmer: a new enhanced approach. In: Proceedings of Software Engineering Department, UAE University, Dubai, pp. 1–9 (2005)
9.
Zurück zum Zitat Leah, L., Lisa, B., Margaret, C.: Light stemming for Arabic information retrieval. University of Massachusetts, Springer (2007) Leah, L., Lisa, B., Margaret, C.: Light stemming for Arabic information retrieval. University of Massachusetts, Springer (2007)
10.
Zurück zum Zitat Xiao, Y.: A Survey of Document Clustering Techniques & Comparison of LDA and moVMF. North Carolina State University, Raleigh (2010) Xiao, Y.: A Survey of Document Clustering Techniques & Comparison of LDA and moVMF. North Carolina State University, Raleigh (2010)
11.
Zurück zum Zitat Anna Huang Department of Computer Science: Similarity Measures for Text Document Clustering. The University of Waikato, Hamilton (2011) Anna Huang Department of Computer Science: Similarity Measures for Text Document Clustering. The University of Waikato, Hamilton (2011)
12.
Zurück zum Zitat Ghwanmeh, S.H.: Applying clustering of hierarchical K-means-like algorithm on Arabic language. Int. J. Inf. Commun. Eng. 3, 7 (2007) Ghwanmeh, S.H.: Applying clustering of hierarchical K-means-like algorithm on Arabic language. Int. J. Inf. Commun. Eng. 3, 7 (2007)
13.
Zurück zum Zitat Alkoffash, M.S.: Comparing between Arabic text clustering using K means and K mediods. Int. J. Comput. Appl. 51(2), 0975–8887 (2012) Alkoffash, M.S.: Comparing between Arabic text clustering using K means and K mediods. Int. J. Comput. Appl. 51(2), 0975–8887 (2012)
14.
Zurück zum Zitat Raghuvira Pratap, A., Suvarna Vani, K., Rama Devi, J., Nageswara Rao, K.: An efficient density based improved K-medoids clustering algorithm. Int. J. Adv. Comput. Sci. Appl. 2(6) (2011) Raghuvira Pratap, A., Suvarna Vani, K., Rama Devi, J., Nageswara Rao, K.: An efficient density based improved K-medoids clustering algorithm. Int. J. Adv. Comput. Sci. Appl. 2(6) (2011)
15.
Zurück zum Zitat Wanas, N.M., Said, D.A., Hegazy, N.H., Darwish, N.M.: A study of local and global thresholding techniques in text categorization. In: Proceedings of the Australasian Data Mining Conference (AusDM 2006), Volume 61 of Conferences in Research and Practice in Information Technology (CRPIT), pp. 91–101. Sydney, Australia (2007) Wanas, N.M., Said, D.A., Hegazy, N.H., Darwish, N.M.: A study of local and global thresholding techniques in text categorization. In: Proceedings of the Australasian Data Mining Conference (AusDM 2006), Volume 61 of Conferences in Research and Practice in Information Technology (CRPIT), pp. 91–101. Sydney, Australia (2007)
16.
Zurück zum Zitat Meng, L., Ren, J., Hu, C.: CABGD: an improved clustering algorithm based on grid-density. In: Proceedings of the 2009 Fourth International Conference on Innovative Computing, Information and Control (CICIC), pp. 381–384 (2009) Meng, L., Ren, J., Hu, C.: CABGD: an improved clustering algorithm based on grid-density. In: Proceedings of the 2009 Fourth International Conference on Innovative Computing, Information and Control (CICIC), pp. 381–384 (2009)
17.
Zurück zum Zitat Trikha, P., Vijendra, S.: Fast density based clustering algorithm. Int. J. Mach. Learn. Comput. 3(1) (2013) Trikha, P., Vijendra, S.: Fast density based clustering algorithm. Int. J. Mach. Learn. Comput. 3(1) (2013)
18.
Zurück zum Zitat Nagpal, P.B., Mann, P.A.: Comparative study of density based clustering algorithms. Int. J. Comput. Appl. 27(11), 0975–8887 (2011) Nagpal, P.B., Mann, P.A.: Comparative study of density based clustering algorithms. Int. J. Comput. Appl. 27(11), 0975–8887 (2011)
19.
Zurück zum Zitat Stein, B., Busch, M.: Density-based cluster algorithms in low-dimensional and high-dimensional applications. Fachberichte Informatik, 45–56. ISSN 1860-4471 Stein, B., Busch, M.: Density-based cluster algorithms in low-dimensional and high-dimensional applications. Fachberichte Informatik, 45–56. ISSN 1860-4471
20.
Zurück zum Zitat Musdholifah, A., Hashim, S.Z.M.: Cluster analysis on high-dimensional data: a comparison of density-based clustering algorithms. Aust. J. Basic Appl. Sci. 7(2), 380–389 (2013). ISSN 1991-8178 Musdholifah, A., Hashim, S.Z.M.: Cluster analysis on high-dimensional data: a comparison of density-based clustering algorithms. Aust. J. Basic Appl. Sci. 7(2), 380–389 (2013). ISSN 1991-8178
23.
Zurück zum Zitat Kelaiaia, A., Merouani, H.: Clustering with probabilistic topic models on arabic texts: a comparative study of LDA and K-means. Int. Arab. J. Inf. Technol. 13(2), 332 (2016) Kelaiaia, A., Merouani, H.: Clustering with probabilistic topic models on arabic texts: a comparative study of LDA and K-means. Int. Arab. J. Inf. Technol. 13(2), 332 (2016)
24.
Zurück zum Zitat Alghamdi, H.M., Selamat, A., Karim, N.S.A.: Improved text lustering using k-mean Bayesian vectoriser. J. Inf. Knowl. Manag. 13(3), 1450026 (2014) Alghamdi, H.M., Selamat, A., Karim, N.S.A.: Improved text lustering using k-mean Bayesian vectoriser. J. Inf. Knowl. Manag. 13(3), 1450026 (2014)
25.
Zurück zum Zitat Tsarev, D., Petrovskiy, M., Mashechkin, I.: Supervised and unsupervised text classification via generic summarization. Int. J. Comput. Inf. Syst. Ind. Manag. Appl. 5, 509–515 (2013) Tsarev, D., Petrovskiy, M., Mashechkin, I.: Supervised and unsupervised text classification via generic summarization. Int. J. Comput. Inf. Syst. Ind. Manag. Appl. 5, 509–515 (2013)
Metadaten
Titel
Arabic text clustering using improved clustering algorithms with dimensionality reduction
verfasst von
Arun Kumar Sangaiah
Ahmed E. Fakhry
Mohamed Abdel-Basset
Ibrahim El-henawy
Publikationsdatum
27.02.2018
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe Sonderheft 2/2019
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-018-2084-4

Weitere Artikel der Sonderheft 2/2019

Cluster Computing 2/2019 Zur Ausgabe