Skip to main content
Top

2019 | OriginalPaper | Chapter

Comparative Study of Feature Selection Methods for Medical Full Text Classification

Authors : Carlos Adriano Gonçalves, Eva Lorenzo Iglesias, Lourdes Borrajo, Rui Camacho, Adrián Seara Vieira, Célia Talma Gonçalves

Published in: Bioinformatics and Biomedical Engineering

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

There is a lot of work in text categorization using only the title and abstract of the papers. However, in a full paper there is a much larger amount of information that could be used to improve the text classification performance. The potential benefits of using full texts come with an additional problem: the increased size of the data sets.
To overcome the increased the size of full text data sets we performed an assessment study on the use of feature selection methods for full text classification. We have compared two existing feature selection methods (Information Gain and Correlation) and a novel method called k-Best-Discriminative-Terms. The assessment was conducted using the Ohsumed corpora. We have made two sets of experiments: using title and abstract only; and full text.
The results achieved by the novel method show that the novel method does not perform well in small amounts of text like title and abstract but performs much better for the full text data sets and requires a much smaller number of attributes.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
We have used single words in our study but the k-BDT can also be used with other groupings of words like n-grams (n > 1), NERs, etc.
 
2
It has been used in binary text classification but can also be adapted to non binary classification problems.
 
Literature
1.
go back to reference Gonçalves, C.A., Iglesias, E.L., Borrajo, L., Camacho, R., Vieira, A. S., Gonçalves, C.T.: LearnSec: a framework for full text analysis. In: de Cos Juez, F. et al. (eds) Hybrid Artificial Intelligent Systems HAIS 2018, vol. 10870, pp. 502–513. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92639-1_42 Gonçalves, C.A., Iglesias, E.L., Borrajo, L., Camacho, R., Vieira, A. S., Gonçalves, C.T.: LearnSec: a framework for full text analysis. In: de Cos Juez, F. et al. (eds) Hybrid Artificial Intelligent Systems HAIS 2018, vol. 10870, pp. 502–513. Springer, Cham (2018). https://​doi.​org/​10.​1007/​978-3-319-92639-1_​42
2.
go back to reference Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007)CrossRef Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007)CrossRef
3.
go back to reference Markov, A.A., Nitussov, A.Y., Voropai, L., Link, D., Custance, G., Mahoney, M.S.: Classical Text in Translation: An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains (2006) Markov, A.A., Nitussov, A.Y., Voropai, L., Link, D., Custance, G., Mahoney, M.S.: Classical Text in Translation: An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains (2006)
4.
go back to reference Borasem, P.N., Kinariwala, S.A.: Image re-ranking using information gain and relative consistency through multigraph learning (2016) Borasem, P.N., Kinariwala, S.A.: Image re-ranking using information gain and relative consistency through multigraph learning (2016)
5.
go back to reference Vieira, A.S., Iglesias, E.L., Borrajo, L.: An HMM-based text classier less sensitive to document management problems. Bioinformatics 11, 503–515 (2016) Vieira, A.S., Iglesias, E.L., Borrajo, L.: An HMM-based text classier less sensitive to document management problems. Bioinformatics 11, 503–515 (2016)
6.
go back to reference Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: 16th International Conference on Machine Learning (ICML), pp. 258–267. Morgan Kaufmann Publishers, San Francisco (1999) Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: 16th International Conference on Machine Learning (ICML), pp. 258–267. Morgan Kaufmann Publishers, San Francisco (1999)
7.
go back to reference Yang, Y., Pedersen, J. O.: A comparative study on feature selection in text categorization. In: Fourteenth International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997) Yang, Y., Pedersen, J. O.: A comparative study on feature selection in text categorization. In: Fourteenth International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)
8.
go back to reference Parlak, B., Uysal, A. K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016) Parlak, B., Uysal, A. K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016)
9.
go back to reference Imambi, S.S., Sudha, T.: Article: a novel feature selection method for classification of medical documents from pubmed. Int. J. Comput. Appl. 26(9), 29–33 (2011) Imambi, S.S., Sudha, T.: Article: a novel feature selection method for classification of medical documents from pubmed. Int. J. Comput. Appl. 26(9), 29–33 (2011)
10.
go back to reference Monta, E., Ranilla, J., Fernandez, J., Combarro, E.F., Diaz, I.: Scoring and selecting terms for text categorization. IEEE Intell. Syst. 20, 40–47 (2005) Monta, E., Ranilla, J., Fernandez, J., Combarro, E.F., Diaz, I.: Scoring and selecting terms for text categorization. IEEE Intell. Syst. 20, 40–47 (2005)
11.
go back to reference Forman, G.: Feature selection for text classification. In: Liu, H., Motoda, H. (eds.) Computational Methods of Feature Selection, Data Mining and Knowledge Discoveries Series, pp. 257–276. Chapman and Hall/CRC, Boca Raton (2007)CrossRef Forman, G.: Feature selection for text classification. In: Liu, H., Motoda, H. (eds.) Computational Methods of Feature Selection, Data Mining and Knowledge Discoveries Series, pp. 257–276. Chapman and Hall/CRC, Boca Raton (2007)CrossRef
12.
go back to reference Hall, M.A., Smith, L.A.: Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, pp. 235–239. AAAI Press (1999) Hall, M.A., Smith, L.A.: Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, pp. 235–239. AAAI Press (1999)
13.
go back to reference Hersh, W.R., Buckley, C., Leone, T.J., Hickam, D.H.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press (1994) Hersh, W.R., Buckley, C., Leone, T.J., Hickam, D.H.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press (1994)
14.
go back to reference Zdravevski, E., Lameski, P., Kulakov, A., Filiposka, S., Trajanov, D., Boro, J.: Parallel computation of information gain using Hadoop and MapReduce. In: Federated Conference on Computer Science and Information Systems (2015) Zdravevski, E., Lameski, P., Kulakov, A., Filiposka, S., Trajanov, D., Boro, J.: Parallel computation of information gain using Hadoop and MapReduce. In: Federated Conference on Computer Science and Information Systems (2015)
15.
go back to reference Shang, C., Li, M., Feng, S., Jiang, Q, Fan, J.: Feature selection via maximizing global information gain for text classification. J. Know.-Based Syst. 54, 298–309 (2013)CrossRef Shang, C., Li, M., Feng, S., Jiang, Q, Fan, J.: Feature selection via maximizing global information gain for text classification. J. Know.-Based Syst. 54, 298–309 (2013)CrossRef
16.
go back to reference Wang, F., Li, C., Wang, J., Xu, J., Li, L.: A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing. J. Shanghai Jiaotong Univ. (Sci.) 20(1), 44–50 (2015)CrossRef Wang, F., Li, C., Wang, J., Xu, J., Li, L.: A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing. J. Shanghai Jiaotong Univ. (Sci.) 20(1), 44–50 (2015)CrossRef
17.
go back to reference Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef
19.
go back to reference Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)CrossRef Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)CrossRef
20.
go back to reference Talma Gonçalves, C., Camacho, R., Oliveira, E.: BioTextRetriever: a tool to retrieve relevant papers. Int. J. Knowl. Discov. Bioinform. 2(3), 21–36 (2011)CrossRef Talma Gonçalves, C., Camacho, R., Oliveira, E.: BioTextRetriever: a tool to retrieve relevant papers. Int. J. Knowl. Discov. Bioinform. 2(3), 21–36 (2011)CrossRef
Metadata
Title
Comparative Study of Feature Selection Methods for Medical Full Text Classification
Authors
Carlos Adriano Gonçalves
Eva Lorenzo Iglesias
Lourdes Borrajo
Rui Camacho
Adrián Seara Vieira
Célia Talma Gonçalves
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-17935-9_49

Premium Partner