Skip to main content
Top

2020 | OriginalPaper | Chapter

Preprocessing Techniques in Text Categorization: A Survey

Authors : Sayyam Malik, Sana Ahmad Sani, Anees Baqir, Usman Ahmad, Faizan ul Mustafa

Published in: Intelligent Technologies and Applications

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Text Categorization is a process of categorizing or labeling an unstructured Natural Language (NL) text to related categories with the help of a predefined set. In text categorization, pre-processing is a crucial step which is used for extracting non-trivial, interesting and useful input for further stages of the process of text categorization. As the words in text usually contains a lot of structural variations, so before accessing the information from documents, pre-processing techniques are applied on the data to minimize the size of the data which may increase efficacy of the result and better categorize the text. The main objective of this research is to Survey about the pre-processing techniques like Tokenization, Stop-words removing and Stemming. We’ll see how these techniques affect text categorization in good or may be bad ways.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Vijayarani, S., Ilamathi, M.J., Nithya, M.: Preprocessing techniques for text mining-an overview. Int. J. Comput. Sci. Commun. Netw. 5(1), 7–16 (2015) Vijayarani, S., Ilamathi, M.J., Nithya, M.: Preprocessing techniques for text mining-an overview. Int. J. Comput. Sci. Commun. Netw. 5(1), 7–16 (2015)
2.
go back to reference Srividhya, V., Anitha, R.: Evaluating preprocessing techniques in text categorization. Int. J. Comput. Sci. Appl. 47(11), 49–51 (2010) Srividhya, V., Anitha, R.: Evaluating preprocessing techniques in text categorization. Int. J. Comput. Sci. Appl. 47(11), 49–51 (2010)
3.
go back to reference Dale, R., Moisl, H., Somers, H.: Handbook of Natural Language Processing. CRC Press, Boca Raton (2000)CrossRef Dale, R., Moisl, H., Somers, H.: Handbook of Natural Language Processing. CRC Press, Boca Raton (2000)CrossRef
4.
go back to reference Gurusamy, V., Kannan, S.: Preprocessing techniques for text mining. In: Conference Paper (2014) Gurusamy, V., Kannan, S.: Preprocessing techniques for text mining. In: Conference Paper (2014)
5.
go back to reference Rajput, B.S., NilayKhare, A.: A survey of stemming algorithms for information retrieval. IOSR J. Comput. Eng. 17(3), 76–80 (2015) Rajput, B.S., NilayKhare, A.: A survey of stemming algorithms for information retrieval. IOSR J. Comput. Eng. 17(3), 76–80 (2015)
6.
go back to reference Lovins, J.B.: Development of a stemming algorithm. Mech. Translat. Comp. Linguist. 11(1–2), 22–31 (1968) Lovins, J.B.: Development of a stemming algorithm. Mech. Translat. Comp. Linguist. 11(1–2), 22–31 (1968)
7.
go back to reference Al-Shalabi, R., Kannan, G., Hilat, I., Ababneh, A., Al-Zubi, A.: Experiments with the successor variety algorithm using the cutoff and entropy methods. Inf. Technol. J. 4(1), 55–62 (2005)CrossRef Al-Shalabi, R., Kannan, G., Hilat, I., Ababneh, A., Al-Zubi, A.: Experiments with the successor variety algorithm using the cutoff and entropy methods. Inf. Technol. J. 4(1), 55–62 (2005)CrossRef
8.
go back to reference Palmer, D.D.: Tokenisation and sentence segmentation. In: Handbook of Natural Language Processing, pp. 11–35 (2000) Palmer, D.D.: Tokenisation and sentence segmentation. In: Handbook of Natural Language Processing, pp. 11–35 (2000)
Metadata
Title
Preprocessing Techniques in Text Categorization: A Survey
Authors
Sayyam Malik
Sana Ahmad Sani
Anees Baqir
Usman Ahmad
Faizan ul Mustafa
Copyright Year
2020
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-15-5232-8_43

Premium Partner