Skip to main content
Top

2018 | OriginalPaper | Chapter

On Feature Weighting and Selection for Medical Document Classification

Authors : Bekir Parlak, Alper Kursat Uysal

Published in: Developments and Advances in Intelligent Systems and Applications

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Medical document classification is still one of the popular research problems inside text classification domain. In this study, the impact of feature selection and feature weighting on medical document classification is analyzed using two datasets containing MEDLINE documents. The performances of two different feature selection methods namely Gini index and distinguishing feature selector and two different term weighting methods namely term frequency (TF) and term frequency-inverse document frequency (TF-IDF) are analyzed using two pattern classifiers. These pattern classifiers are Bayesian network and C4.5 decision tree. As this study deals with single-label classification, a subset of documents inside OHSUMED and a self-constructed dataset is used for assessment of these methods. Due to having low amount of documents for some categories in self-compiled dataset, only documents belonging to 10 different disease categories are used in the experiments for both datasets. Experimental results show that the better result is obtained with combination of distinguishing feature selector, TF feature weighting, and Bayesian network classifier.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)CrossRef Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)CrossRef
2.
go back to reference Idris, I., Selamat, A., Nguyen, N.T., Omatu, S., Krejcar, O., Kuca, K., Penhaker, M.: A combined negative selection algorithm—particle swarm optimization for an email spam detection system. Eng. Appl. Artif. Intell. 39, 33–44 (2015)CrossRef Idris, I., Selamat, A., Nguyen, N.T., Omatu, S., Krejcar, O., Kuca, K., Penhaker, M.: A combined negative selection algorithm—particle swarm optimization for an email spam detection system. Eng. Appl. Artif. Intell. 39, 33–44 (2015)CrossRef
3.
go back to reference Zhang, C., Wu, X., Niu, Z., Ding, W.: Authorship identification from unstructured texts. Knowl.-Based Syst. 66, 99–111 (2014)CrossRef Zhang, C., Wu, X., Niu, Z., Ding, W.: Authorship identification from unstructured texts. Knowl.-Based Syst. 66, 99–111 (2014)CrossRef
4.
go back to reference Ozel, S.A.: A Web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst. Appl. 38(4), 3407–3415 (2011)MathSciNetCrossRef Ozel, S.A.: A Web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst. Appl. 38(4), 3407–3415 (2011)MathSciNetCrossRef
5.
go back to reference Agarwal, B., Mittal, N.: Prominent Feature Extraction for Sentiment Analysis, pp. 21–45. Springer (2016) Agarwal, B., Mittal, N.: Prominent Feature Extraction for Sentiment Analysis, pp. 21–45. Springer (2016)
6.
go back to reference Pak, M.Y., Gunal, S.: Sentiment classification based on domain prediction. Elektronika ir Elektrotechnika 22(2), 96–99 (2016)CrossRef Pak, M.Y., Gunal, S.: Sentiment classification based on domain prediction. Elektronika ir Elektrotechnika 22(2), 96–99 (2016)CrossRef
7.
go back to reference Garla, V., Taylor, C., Brandt, C.: Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management. J. Biomed. Inform. 46(5), 869–875 (2013)CrossRef Garla, V., Taylor, C., Brandt, C.: Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management. J. Biomed. Inform. 46(5), 869–875 (2013)CrossRef
8.
go back to reference Yetisgen-Yildiz, M., Pratt, W.: The effect of feature representation on MEDLINE document classification. In: AMIA Annual Symposium Proceedings, p. 849. American Medical Informatics Association (2005) Yetisgen-Yildiz, M., Pratt, W.: The effect of feature representation on MEDLINE document classification. In: AMIA Annual Symposium Proceedings, p. 849. American Medical Informatics Association (2005)
9.
go back to reference Yepes, A.J.J., Plaza, L., Carrillo-de-Albornoz, J., Mork, J.G., Aronson, A.R.: Feature engineering for MEDLINE citation categorization with MeSH. BMC Bioinform. 16(1), 1 (2015)CrossRef Yepes, A.J.J., Plaza, L., Carrillo-de-Albornoz, J., Mork, J.G., Aronson, A.R.: Feature engineering for MEDLINE citation categorization with MeSH. BMC Bioinform. 16(1), 1 (2015)CrossRef
12.
go back to reference Rak, R., Kurgan, L.A., Reformat, M.: Multilabel associative classification categorization of MEDLINE articles into MeSH keywords. IEEE Eng. Med. Biol. Mag. 26(2), 47 (2007)CrossRef Rak, R., Kurgan, L.A., Reformat, M.: Multilabel associative classification categorization of MEDLINE articles into MeSH keywords. IEEE Eng. Med. Biol. Mag. 26(2), 47 (2007)CrossRef
13.
go back to reference Spat, S., Cadonna, B., Rakovac, I., Gutl, C., Leitner, H., Stark, G., Beck, P.: Multi-label text classification of German language medical documents. In: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems, p. 2343 (2007) Spat, S., Cadonna, B., Rakovac, I., Gutl, C., Leitner, H., Stark, G., Beck, P.: Multi-label text classification of German language medical documents. In: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems, p. 2343 (2007)
14.
go back to reference Camous, F., Blott, S., Smeaton, A.F.: Ontology-based MEDLINE document classification. In: Bioinformatics Research and Development, pp. 439–452. Springer Berlin Heidelberg (2007) Camous, F., Blott, S., Smeaton, A.F.: Ontology-based MEDLINE document classification. In: Bioinformatics Research and Development, pp. 439–452. Springer Berlin Heidelberg (2007)
15.
go back to reference Poulter, G.L., Rubin, D.L., Altman, R.B.: Seoighe, C.: MScanner: a classifier for retrieving medline citations. BMC Bioinform. 9(1), 108 (2008)CrossRef Poulter, G.L., Rubin, D.L., Altman, R.B.: Seoighe, C.: MScanner: a classifier for retrieving medline citations. BMC Bioinform. 9(1), 108 (2008)CrossRef
16.
go back to reference Yi, K., Beheshti, J.: A hidden Markov model-based text classification of medical documents. J. Inf. Sci. (2008) Yi, K., Beheshti, J.: A hidden Markov model-based text classification of medical documents. J. Inf. Sci. (2008)
17.
go back to reference Frunza, O., Inkpen, D., Matwin, S., Klement, W., O’blenis, P.: Exploiting the systematic review protocol for classification of medical abstracts. Artif. Intell. Med. 51(1), 17–25 (2011) Frunza, O., Inkpen, D., Matwin, S., Klement, W., O’blenis, P.: Exploiting the systematic review protocol for classification of medical abstracts. Artif. Intell. Med. 51(1), 17–25 (2011)
18.
go back to reference Dollah, R.B., Aono, M.: Ontology based approach for classifying biomedical text abstracts. Int. J. Data Engi. (IJDE), 2(1), 1–15 (2011) Dollah, R.B., Aono, M.: Ontology based approach for classifying biomedical text abstracts. Int. J. Data Engi. (IJDE), 2(1), 1–15 (2011)
19.
go back to reference Albitar, S., Espinasse, B., Fournier, S.: Semantic enrichments in text supervised classification: application to medical domain. In: The Twenty-Seventh International Flairs Conference (2014) Albitar, S., Espinasse, B., Fournier, S.: Semantic enrichments in text supervised classification: application to medical domain. In: The Twenty-Seventh International Flairs Conference (2014)
20.
go back to reference Uysal, A.K., Gunal, S.: Text classification using genetic algorithm oriented latent semantic features. Expert Syst. Appl. 41(13), 5938–5947 (2014)CrossRef Uysal, A.K., Gunal, S.: Text classification using genetic algorithm oriented latent semantic features. Expert Syst. Appl. 41(13), 5938–5947 (2014)CrossRef
21.
go back to reference Parlak, B., Uysal, A. K.: Classification of medical documents according to diseases. In: 23th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1635–1638 (2015) Parlak, B., Uysal, A. K.: Classification of medical documents according to diseases. In: 23th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1635–1638 (2015)
22.
go back to reference Rais, M., Lachkar, A.: Evaluation of disambiguation strategies on biomedical text categorization. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 790–801. Springer International Publishing (2016) Rais, M., Lachkar, A.: Evaluation of disambiguation strategies on biomedical text categorization. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 790–801. Springer International Publishing (2016)
23.
go back to reference Baker, S., Silins, I., Guo, Y., Ali, I., Högberg, J., Stenius, U., Korhonen, A.: Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32(3), 432–440 (2016)CrossRef Baker, S., Silins, I., Guo, Y., Ali, I., Högberg, J., Stenius, U., Korhonen, A.: Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32(3), 432–440 (2016)CrossRef
24.
go back to reference Morid, M.A., Fiszman, M., Raja, K., Jonnalagadda, S.R., Del Fiol, G.: Classification of clinically useful sentences in clinical evidence resources. J. Biomed. Inform. 60, 14–22 (2016)CrossRef Morid, M.A., Fiszman, M., Raja, K., Jonnalagadda, S.R., Del Fiol, G.: Classification of clinically useful sentences in clinical evidence resources. J. Biomed. Inform. 60, 14–22 (2016)CrossRef
25.
go back to reference Parlak, B., Uysal, A.K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016) Parlak, B., Uysal, A.K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016)
26.
go back to reference Pakhomov, S.V., Buntrock, J.D., Chute, C.G.: Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. J. Am. Med. Inform. Assoc. 13(5), 516–525 (2006)CrossRef Pakhomov, S.V., Buntrock, J.D., Chute, C.G.: Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. J. Am. Med. Inform. Assoc. 13(5), 516–525 (2006)CrossRef
27.
go back to reference Van Der Zwaan, J., Sang, E.T.K., de Rijke, M.: An experiment in automatic classification of pathological reports. In: Artificial Intelligence in Medicine, pp. 207–216. Springer, Berlin Heidelberg (2007) Van Der Zwaan, J., Sang, E.T.K., de Rijke, M.: An experiment in automatic classification of pathological reports. In: Artificial Intelligence in Medicine, pp. 207–216. Springer, Berlin Heidelberg (2007)
28.
go back to reference Waraporn, P., Meesad, P., Clayton, G.: Ontology-supported processing of clinical text using medical knowledge integration for multi-label classification of diagnosis coding (2010). arXiv:1004.1230 Waraporn, P., Meesad, P., Clayton, G.: Ontology-supported processing of clinical text using medical knowledge integration for multi-label classification of diagnosis coding (2010). arXiv:​1004.​1230
29.
go back to reference Boytcheva, S.: Automatic matching of ICD-10 codes to diagnoses in discharge letters. In: Proceedings of the Workshop on Biomedical Natural Language Processing, pp. 11–18. Hissar, Bulgaria (2011) Boytcheva, S.: Automatic matching of ICD-10 codes to diagnoses in discharge letters. In: Proceedings of the Workshop on Biomedical Natural Language Processing, pp. 11–18. Hissar, Bulgaria (2011)
30.
go back to reference Ceylan, N.M., Alpkocak, A., Esatoglu, A.E.: Tıbbi Kayıtlara ICD-10 Hastalık Kodlarının Atanmasına Yardımcı Akıllı Bir Sistem (2012) Ceylan, N.M., Alpkocak, A., Esatoglu, A.E.: Tıbbi Kayıtlara ICD-10 Hastalık Kodlarının Atanmasına Yardımcı Akıllı Bir Sistem (2012)
31.
go back to reference Arifoglu, D., Deniz, O., Alecakır, K., Yondem, M.: CodeMagic: semi-automatic assignment of ICD-10-AM codes to patient records. In: Information Sciences and Systems 2014, pp. 259–268. Springer International Publishing (2014) Arifoglu, D., Deniz, O., Alecakır, K., Yondem, M.: CodeMagic: semi-automatic assignment of ICD-10-AM codes to patient records. In: Information Sciences and Systems 2014, pp. 259–268. Springer International Publishing (2014)
32.
go back to reference Uysal, A.K., Gunal, S., Ergin, S., Gunal, E.S.: Detection of SMS spam messages on mobile phones. In: 20th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1–4 (2012) Uysal, A.K., Gunal, S., Ergin, S., Gunal, E.S.: Detection of SMS spam messages on mobile phones. In: 20th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1–4 (2012)
33.
go back to reference Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval Cambridge University Press, New York, USA (2008) Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval Cambridge University Press, New York, USA (2008)
34.
go back to reference Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)CrossRef Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)CrossRef
35.
go back to reference Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33(1), 1–5 (2007)CrossRef Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33(1), 1–5 (2007)CrossRef
36.
go back to reference Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explor. 11(1) (2009) Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explor. 11(1) (2009)
37.
go back to reference Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, Jim Gray (ed.). Morgan Kaufmann Publishers, San Fransisco (2005) Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, Jim Gray (ed.). Morgan Kaufmann Publishers, San Fransisco (2005)
38.
go back to reference Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Proceedings of the Europe Conference Information Retrieval Research, pp. 345–359 (2005) Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Proceedings of the Europe Conference Information Retrieval Research, pp. 345–359 (2005)
39.
Metadata
Title
On Feature Weighting and Selection for Medical Document Classification
Authors
Bekir Parlak
Alper Kursat Uysal
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-58965-7_19

Premium Partner