Skip to main content

2019 | OriginalPaper | Buchkapitel

Comparative Study of Feature Selection Methods for Medical Full Text Classification

verfasst von : Carlos Adriano Gonçalves, Eva Lorenzo Iglesias, Lourdes Borrajo, Rui Camacho, Adrián Seara Vieira, Célia Talma Gonçalves

Erschienen in: Bioinformatics and Biomedical Engineering

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

There is a lot of work in text categorization using only the title and abstract of the papers. However, in a full paper there is a much larger amount of information that could be used to improve the text classification performance. The potential benefits of using full texts come with an additional problem: the increased size of the data sets.
To overcome the increased the size of full text data sets we performed an assessment study on the use of feature selection methods for full text classification. We have compared two existing feature selection methods (Information Gain and Correlation) and a novel method called k-Best-Discriminative-Terms. The assessment was conducted using the Ohsumed corpora. We have made two sets of experiments: using title and abstract only; and full text.
The results achieved by the novel method show that the novel method does not perform well in small amounts of text like title and abstract but performs much better for the full text data sets and requires a much smaller number of attributes.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
We have used single words in our study but the k-BDT can also be used with other groupings of words like n-grams (n > 1), NERs, etc.
 
2
It has been used in binary text classification but can also be adapted to non binary classification problems.
 
Literatur
1.
Zurück zum Zitat Gonçalves, C.A., Iglesias, E.L., Borrajo, L., Camacho, R., Vieira, A. S., Gonçalves, C.T.: LearnSec: a framework for full text analysis. In: de Cos Juez, F. et al. (eds) Hybrid Artificial Intelligent Systems HAIS 2018, vol. 10870, pp. 502–513. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92639-1_42 Gonçalves, C.A., Iglesias, E.L., Borrajo, L., Camacho, R., Vieira, A. S., Gonçalves, C.T.: LearnSec: a framework for full text analysis. In: de Cos Juez, F. et al. (eds) Hybrid Artificial Intelligent Systems HAIS 2018, vol. 10870, pp. 502–513. Springer, Cham (2018). https://​doi.​org/​10.​1007/​978-3-319-92639-1_​42
2.
Zurück zum Zitat Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007)CrossRef Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007)CrossRef
3.
Zurück zum Zitat Markov, A.A., Nitussov, A.Y., Voropai, L., Link, D., Custance, G., Mahoney, M.S.: Classical Text in Translation: An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains (2006) Markov, A.A., Nitussov, A.Y., Voropai, L., Link, D., Custance, G., Mahoney, M.S.: Classical Text in Translation: An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains (2006)
4.
Zurück zum Zitat Borasem, P.N., Kinariwala, S.A.: Image re-ranking using information gain and relative consistency through multigraph learning (2016) Borasem, P.N., Kinariwala, S.A.: Image re-ranking using information gain and relative consistency through multigraph learning (2016)
5.
Zurück zum Zitat Vieira, A.S., Iglesias, E.L., Borrajo, L.: An HMM-based text classier less sensitive to document management problems. Bioinformatics 11, 503–515 (2016) Vieira, A.S., Iglesias, E.L., Borrajo, L.: An HMM-based text classier less sensitive to document management problems. Bioinformatics 11, 503–515 (2016)
6.
Zurück zum Zitat Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: 16th International Conference on Machine Learning (ICML), pp. 258–267. Morgan Kaufmann Publishers, San Francisco (1999) Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: 16th International Conference on Machine Learning (ICML), pp. 258–267. Morgan Kaufmann Publishers, San Francisco (1999)
7.
Zurück zum Zitat Yang, Y., Pedersen, J. O.: A comparative study on feature selection in text categorization. In: Fourteenth International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997) Yang, Y., Pedersen, J. O.: A comparative study on feature selection in text categorization. In: Fourteenth International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)
8.
Zurück zum Zitat Parlak, B., Uysal, A. K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016) Parlak, B., Uysal, A. K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016)
9.
Zurück zum Zitat Imambi, S.S., Sudha, T.: Article: a novel feature selection method for classification of medical documents from pubmed. Int. J. Comput. Appl. 26(9), 29–33 (2011) Imambi, S.S., Sudha, T.: Article: a novel feature selection method for classification of medical documents from pubmed. Int. J. Comput. Appl. 26(9), 29–33 (2011)
10.
Zurück zum Zitat Monta, E., Ranilla, J., Fernandez, J., Combarro, E.F., Diaz, I.: Scoring and selecting terms for text categorization. IEEE Intell. Syst. 20, 40–47 (2005) Monta, E., Ranilla, J., Fernandez, J., Combarro, E.F., Diaz, I.: Scoring and selecting terms for text categorization. IEEE Intell. Syst. 20, 40–47 (2005)
11.
Zurück zum Zitat Forman, G.: Feature selection for text classification. In: Liu, H., Motoda, H. (eds.) Computational Methods of Feature Selection, Data Mining and Knowledge Discoveries Series, pp. 257–276. Chapman and Hall/CRC, Boca Raton (2007)CrossRef Forman, G.: Feature selection for text classification. In: Liu, H., Motoda, H. (eds.) Computational Methods of Feature Selection, Data Mining and Knowledge Discoveries Series, pp. 257–276. Chapman and Hall/CRC, Boca Raton (2007)CrossRef
12.
Zurück zum Zitat Hall, M.A., Smith, L.A.: Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, pp. 235–239. AAAI Press (1999) Hall, M.A., Smith, L.A.: Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, pp. 235–239. AAAI Press (1999)
13.
Zurück zum Zitat Hersh, W.R., Buckley, C., Leone, T.J., Hickam, D.H.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press (1994) Hersh, W.R., Buckley, C., Leone, T.J., Hickam, D.H.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press (1994)
14.
Zurück zum Zitat Zdravevski, E., Lameski, P., Kulakov, A., Filiposka, S., Trajanov, D., Boro, J.: Parallel computation of information gain using Hadoop and MapReduce. In: Federated Conference on Computer Science and Information Systems (2015) Zdravevski, E., Lameski, P., Kulakov, A., Filiposka, S., Trajanov, D., Boro, J.: Parallel computation of information gain using Hadoop and MapReduce. In: Federated Conference on Computer Science and Information Systems (2015)
15.
Zurück zum Zitat Shang, C., Li, M., Feng, S., Jiang, Q, Fan, J.: Feature selection via maximizing global information gain for text classification. J. Know.-Based Syst. 54, 298–309 (2013)CrossRef Shang, C., Li, M., Feng, S., Jiang, Q, Fan, J.: Feature selection via maximizing global information gain for text classification. J. Know.-Based Syst. 54, 298–309 (2013)CrossRef
16.
Zurück zum Zitat Wang, F., Li, C., Wang, J., Xu, J., Li, L.: A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing. J. Shanghai Jiaotong Univ. (Sci.) 20(1), 44–50 (2015)CrossRef Wang, F., Li, C., Wang, J., Xu, J., Li, L.: A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing. J. Shanghai Jiaotong Univ. (Sci.) 20(1), 44–50 (2015)CrossRef
17.
Zurück zum Zitat Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef
19.
Zurück zum Zitat Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)CrossRef Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)CrossRef
20.
Zurück zum Zitat Talma Gonçalves, C., Camacho, R., Oliveira, E.: BioTextRetriever: a tool to retrieve relevant papers. Int. J. Knowl. Discov. Bioinform. 2(3), 21–36 (2011)CrossRef Talma Gonçalves, C., Camacho, R., Oliveira, E.: BioTextRetriever: a tool to retrieve relevant papers. Int. J. Knowl. Discov. Bioinform. 2(3), 21–36 (2011)CrossRef
Metadaten
Titel
Comparative Study of Feature Selection Methods for Medical Full Text Classification
verfasst von
Carlos Adriano Gonçalves
Eva Lorenzo Iglesias
Lourdes Borrajo
Rui Camacho
Adrián Seara Vieira
Célia Talma Gonçalves
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-17935-9_49