Skip to main content

2018 | OriginalPaper | Buchkapitel

LearnSec: A Framework for Full Text Analysis

verfasst von : Carlos Gonçalves, E. L. Iglesias, L. Borrajo, Rui Camacho, A. Seara Vieira, Célia Talma Gonçalves

Erschienen in: Hybrid Artificial Intelligent Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Large corpus of scientific research papers have been available for a long time. However, most of those corpus store only the title and the abstract of the paper. For some domains this information may not be enough to achieve high performance in text mining tasks. This problem has been recently reduced by the growing availability of full text scientific research papers. A full text version provides more detailed information but, on the other hand, a large amount of data needs to be processed. A priori, it is difficult to know if the extra work of the full text analysis has a significant impact in the performance of text mining tasks, or if the effect depends on the scientific domain or the specific corpus under analysis.
The goal of this paper is to show a framework for full text analysis, called LearnSec, which incorporates domain specific knowledge and information about the content of the document sections to improve the classification process with propositional and relational learning.
To demonstrate the usefulness of the tool, we process a scientific corpus based on OSHUMED, generating an attribute/value dataset in Weka format and a First Order Logic dataset in Inductive Logic Programming (ILP) format. Results show a successful assessment of the framework.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
3.
Zurück zum Zitat Muggleton, S., De Raedt, L.: Inductive Logic Programming: theory and methods. J. Logic Program. 19/20, 629–679 (1994)MathSciNetCrossRef Muggleton, S., De Raedt, L.: Inductive Logic Programming: theory and methods. J. Logic Program. 19/20, 629–679 (1994)MathSciNetCrossRef
5.
Zurück zum Zitat Muggleton, S.: Inverse entailment and progol. New Gener. Comput. 1(3–4), 245–286 (1995). Special issue on Inductive Logic ProgrammingCrossRef Muggleton, S.: Inverse entailment and progol. New Gener. Comput. 1(3–4), 245–286 (1995). Special issue on Inductive Logic ProgrammingCrossRef
6.
Zurück zum Zitat Zhou, W., Smalheiser, N.R., Yu, C.: A tutorial on information retrieval, basics terms and concepts. J. Biomed. Discov. Collab. 1, 2 (2006)CrossRef Zhou, W., Smalheiser, N.R., Yu, C.: A tutorial on information retrieval, basics terms and concepts. J. Biomed. Discov. Collab. 1, 2 (2006)CrossRef
7.
Zurück zum Zitat Srinivasan, A.: The aleph manual (2001) Srinivasan, A.: The aleph manual (2001)
8.
Zurück zum Zitat Gonçalves, C.T., Camacho, R., Oliveira, E.: BioTextRetriever: a tool to retrieve relevant papers. Int. J. Knowl. Discov. Bioinform. 2, 21–36 (2011). IGI PublishingCrossRef Gonçalves, C.T., Camacho, R., Oliveira, E.: BioTextRetriever: a tool to retrieve relevant papers. Int. J. Knowl. Discov. Bioinform. 2, 21–36 (2011). IGI PublishingCrossRef
9.
Zurück zum Zitat Gonçalves, C.A., Gonçalves, C.T., Camacho, R., Oliveira, E.: The Impact of pre-processing in classifying MEDLINE documents. In: Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems (PRIS2010), Funchal, Madeira, pp. 53–61 (2010) Gonçalves, C.A., Gonçalves, C.T., Camacho, R., Oliveira, E.: The Impact of pre-processing in classifying MEDLINE documents. In: Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems (PRIS2010), Funchal, Madeira, pp. 53–61 (2010)
10.
Zurück zum Zitat Aprile, A., Castellano, M., Mastronardi, G., Tarricone, G.: A web text mining flexible architecture. Int. J. Comput. Sci. Eng. (2007) Aprile, A., Castellano, M., Mastronardi, G., Tarricone, G.: A web text mining flexible architecture. Int. J. Comput. Sci. Eng. (2007)
11.
Zurück zum Zitat Oram, P.: WordNet: an electronical lexical database. Appl. Psycholinguist. 22, 131–134 (1998). Cambridge University PressCrossRef Oram, P.: WordNet: an electronical lexical database. Appl. Psycholinguist. 22, 131–134 (1998). Cambridge University PressCrossRef
12.
Zurück zum Zitat Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Sherlock, G.: Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000)CrossRef Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Sherlock, G.: Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000)CrossRef
13.
Zurück zum Zitat Rebholz-Schuhmann, D., Pezik, P., Lee, V., Kim, J.J., Del Gratta, R., Sasaki, Y., McNaught, J., Montemagni, S., Monachini, M., Calzolari, N., Ananiadou, S.: BioLexicon: towards a reference terminological resource in the biomedical domain. In: Proceedings of the 16th Annual International Conference on Intelligent Systems for Molecular Biology (2008) Rebholz-Schuhmann, D., Pezik, P., Lee, V., Kim, J.J., Del Gratta, R., Sasaki, Y., McNaught, J., Montemagni, S., Monachini, M., Calzolari, N., Ananiadou, S.: BioLexicon: towards a reference terminological resource in the biomedical domain. In: Proceedings of the 16th Annual International Conference on Intelligent Systems for Molecular Biology (2008)
14.
Zurück zum Zitat The Hosford Medical Terms Dictionary v3.0 (2004) The Hosford Medical Terms Dictionary v3.0 (2004)
15.
Zurück zum Zitat Porter, M.F.: An algorithm for suffix stripping. In: Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc. (1997) Porter, M.F.: An algorithm for suffix stripping. In: Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc. (1997)
16.
Zurück zum Zitat Witten, I.H., Eibe, F., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: WEKA: practical machine learning tools and techniques with Java implementations. In: Proceedings of the ICONIP/ANZIIS/ANNES99 Future Directions for Intelligent Systems and Information Sciences, pp. 192–196. Morgan Kaufmann (1999) Witten, I.H., Eibe, F., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: WEKA: practical machine learning tools and techniques with Java implementations. In: Proceedings of the ICONIP/ANZIIS/ANNES99 Future Directions for Intelligent Systems and Information Sciences, pp. 192–196. Morgan Kaufmann (1999)
17.
Zurück zum Zitat Borase, P.N., Kinariwala, S.A.: Image Re-ranking using Information Gain and relative consistency through multi-graph learning. Int. J. Comput. Appl. 147, 29–32 (2016). Foundation of Computer Science, NY, USA Borase, P.N., Kinariwala, S.A.: Image Re-ranking using Information Gain and relative consistency through multi-graph learning. Int. J. Comput. Appl. 147, 29–32 (2016). Foundation of Computer Science, NY, USA
18.
Zurück zum Zitat Hall, M.: Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato (1999) Hall, M.: Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato (1999)
19.
Zurück zum Zitat Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)CrossRef Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)CrossRef
Metadaten
Titel
LearnSec: A Framework for Full Text Analysis
verfasst von
Carlos Gonçalves
E. L. Iglesias
L. Borrajo
Rui Camacho
A. Seara Vieira
Célia Talma Gonçalves
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-92639-1_42

Premium Partner