Skip to main content

2018 | OriginalPaper | Buchkapitel

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

verfasst von : Evis Trandafili, Nelda Kote, Marenglen Biba

Erschienen in: Advances in Internet, Data & Web Technologies

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Text mining and natural language processing are gaining significant role in our daily life as information volumes increase steadily. Most of the digital information is unstructured in the form of raw text. While for several languages there is extensive research on mining and language processing, much less work has been performed for other languages. In this paper we aim to evaluate the performance of some of the most important text classification algorithms over a corpus composed of Albanian texts. After applying natural language preprocessing steps, we apply several algorithms such as Simple Logistics, Naïve Bayes, k-Nearest Neighbor, Decision Trees, Random Forest, Support Vector Machines and Neural Networks. The experiments show that Naïve Bayes and Support Vector Machines perform best in classifying Albanian corpuses. Furthermore, Simple Logistics algorithm also shows good results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Gantz, J., Reinsel, D.: The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. Technical Report 1. IDC, 5 Speen Street, Framingham, MA 01701 USA (2012) Gantz, J., Reinsel, D.: The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. Technical Report 1. IDC, 5 Speen Street, Framingham, MA 01701 USA (2012)
2.
Zurück zum Zitat Fan, W., Wallace, L., Rich, S., Zhang, Z.: Tapping the power of text mining. Commun. ACM 49(9), 76–82 (2006)CrossRef Fan, W., Wallace, L., Rich, S., Zhang, Z.: Tapping the power of text mining. Commun. ACM 49(9), 76–82 (2006)CrossRef
3.
Zurück zum Zitat Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50, 104–112 (2014)CrossRef Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50, 104–112 (2014)CrossRef
4.
Zurück zum Zitat Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. In: Proceedings of KDD Bigdas, Halifax, Canada, 13 p., August 2017 Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. In: Proceedings of KDD Bigdas, Halifax, Canada, 13 p., August 2017
5.
Zurück zum Zitat Talib, R., et al.: Text mining: techniques, applications and issues. Int. J. Adv. Comput. Sci. Appl. 7(11) (2016) Talib, R., et al.: Text mining: techniques, applications and issues. Int. J. Adv. Comput. Sci. Appl. 7(11) (2016)
6.
Zurück zum Zitat Zewen, X.U., et al.: Semi-Supervised Learning in Large Scale Text Categorization. Shanghai Jiao Tong University and Springer, Heidelberg (2017) Zewen, X.U., et al.: Semi-Supervised Learning in Large Scale Text Categorization. Shanghai Jiao Tong University and Springer, Heidelberg (2017)
7.
Zurück zum Zitat Sadiku, J., Biba, M.: Automatic stemming of Albanian through a rule-based approach. J. Int. Res. Publ. Lang. Individ. Soc. 6 (2012). ISSN 1313-2547 Sadiku, J., Biba, M.: Automatic stemming of Albanian through a rule-based approach. J. Int. Res. Publ. Lang. Individ. Soc. 6 (2012). ISSN 1313-2547
8.
Zurück zum Zitat Biba, M., Gjati, E.: Boosting text classification through stemming of composite words. In: ISI 2013, pp. 185–194 (2013) Biba, M., Gjati, E.: Boosting text classification through stemming of composite words. In: ISI 2013, pp. 185–194 (2013)
9.
Zurück zum Zitat Kılıncx, D., et al.: TTC-3600: a new benchmark dataset for Turkish text categorization. J. Inf. Sci., 1–12 (2015) Kılıncx, D., et al.: TTC-3600: a new benchmark dataset for Turkish text categorization. J. Inf. Sci., 1–12 (2015)
10.
Zurück zum Zitat Karan, K., Snajder, J., Basic, B.D.: Evaluation of classification algorithms and features for collocation extraction in Croatian. In: LREC 2012, Eighth International Conference on Language Resources and Evaluation (2012). ISBN 978-2-9517408-7-7 Karan, K., Snajder, J., Basic, B.D.: Evaluation of classification algorithms and features for collocation extraction in Croatian. In: LREC 2012, Eighth International Conference on Language Resources and Evaluation (2012). ISBN 978-2-9517408-7-7
11.
Zurück zum Zitat Yu, B.: An evaluation of text classification methods for literary study. Literary Linguist. Comput. 23(3), 327–343 (2008)CrossRef Yu, B.: An evaluation of text classification methods for literary study. Literary Linguist. Comput. 23(3), 327–343 (2008)CrossRef
12.
Zurück zum Zitat Gonçalves, T., Quaresma, P.: Using IR techniques to improve automated text classification. In: Meziane, F., Métais, E. (eds.) Natural Language Processing and Information Systems, NLDB 2004. LNCS, vol. 3136. Springer, Heidelberg (2004) Gonçalves, T., Quaresma, P.: Using IR techniques to improve automated text classification. In: Meziane, F., Métais, E. (eds.) Natural Language Processing and Information Systems, NLDB 2004. LNCS, vol. 3136. Springer, Heidelberg (2004)
13.
Zurück zum Zitat Rasjida, Z.E., Setiawan, R.: Performance comparison and optimization of text document classification using k-NN and Naïve Bayes classification technique. In: 2nd International Conference on Computer Science and Computational Intelligence 2017, ICCSCI 2017, vol. 1314, Bali, Indonesia, October 2017 Rasjida, Z.E., Setiawan, R.: Performance comparison and optimization of text document classification using k-NN and Naïve Bayes classification technique. In: 2nd International Conference on Computer Science and Computational Intelligence 2017, ICCSCI 2017, vol. 1314, Bali, Indonesia, October 2017
14.
Zurück zum Zitat Al-Zaghoul, F., Al-Dhaheri, S.: Arabic text classification based on features reduction using artificial neural networks. In: UKSim 15th International Conference on Computer Modelling and Simulation. IEEE (2013) Al-Zaghoul, F., Al-Dhaheri, S.: Arabic text classification based on features reduction using artificial neural networks. In: UKSim 15th International Conference on Computer Modelling and Simulation. IEEE (2013)
16.
Zurück zum Zitat Hamp, E.P.: Albanian Language, Encyclopedia Britannica (2016) Hamp, E.P.: Albanian Language, Encyclopedia Britannica (2016)
17.
Zurück zum Zitat Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005) Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)
18.
Zurück zum Zitat McCallum, A.K.: Mallet: A Machine Learning for Language Toolkit (2002) McCallum, A.K.: Mallet: A Machine Learning for Language Toolkit (2002)
19.
Zurück zum Zitat Aggarwal, C., Zhai, C.X.: Mining Text Data. Springer (2012) Aggarwal, C., Zhai, C.X.: Mining Text Data. Springer (2012)
20.
Zurück zum Zitat Dunham, M.H.: Data Mining: Introductory And Advanced Topics. Pearson Education (2006) Dunham, M.H.: Data Mining: Introductory And Advanced Topics. Pearson Education (2006)
21.
Zurück zum Zitat Moreaux, M.: Text Classification with Generic Logistic-Regression Classifier (2015) Moreaux, M.: Text Classification with Generic Logistic-Regression Classifier (2015)
22.
Zurück zum Zitat Ramasundaram, S., Victor, S.P.: Text categorization by backpropagation network. Int. J. Comput. Appl. (0975 – 8887) 8(6), October 2010 Ramasundaram, S., Victor, S.P.: Text categorization by backpropagation network. Int. J. Comput. Appl. (0975 – 8887) 8(6), October 2010
23.
Zurück zum Zitat Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning (1998) Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning (1998)
Metadaten
Titel
Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus
verfasst von
Evis Trandafili
Nelda Kote
Marenglen Biba
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-75928-9_48

Premium Partner