Skip to main content

2015 | OriginalPaper | Buchkapitel

Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications

verfasst von : Stefan Klampfl, Roman Kern

Erschienen in: Semantic Web Evaluation Challenges

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Scholarly publishing increasingly requires automated systems that semantically enrich documents in order to support management and quality assessment of scientific output. However, contextual information, such as the authors’ affiliations, references, and funding agencies, is typically hidden within PDF files. To access this information we have developed a processing pipeline that analyses the structure of a PDF document incorporating a diverse set of machine learning techniques. First, unsupervised learning is used to extract contiguous text blocks from the raw character stream as the basic logical units of the article. Next, supervised learning is employed to classify blocks into different meta-data categories, including authors and affiliations. Then, a set of heuristics are applied to detect the reference section at the end of the paper and segment it into individual reference strings. Sequence classification is then utilised to categorise the tokens of individual references to obtain information such as the journal and the year of the reference. Finally, we make use of named entity recognition techniques to extract references to research grants, funding agencies, and EU projects. Our system is modular in nature. Some parts rely on models learnt on training data, and the overall performance scales with the quality of these data sets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002)CrossRefMATH Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002)CrossRefMATH
2.
Zurück zum Zitat Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22(1), 39–71 (1996) Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22(1), 39–71 (1996)
3.
Zurück zum Zitat Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of LREC, vol. 2008, pp. 661–667. Citeseer, European Language Resources Association (ELRA) (2008) Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of LREC, vol. 2008, pp. 661–667. Citeseer, European Language Resources Association (ELRA) (2008)
4.
Zurück zum Zitat Kern, R., Jack, K., Hristakeva, M., Granitzer, M.: TeamBeam - meta-data extraction from scientific literature. D-Lib Mag. 18(7/8) (2012) Kern, R., Jack, K., Hristakeva, M., Granitzer, M.: TeamBeam - meta-data extraction from scientific literature. D-Lib Mag. 18(7/8) (2012)
5.
Zurück zum Zitat Kern, R., Klampfl, S.: Extraction of references using layout and formatting information from scientific articles. D-Lib Mag. 19(9/10) (2013) Kern, R., Klampfl, S.: Extraction of references using layout and formatting information from scientific articles. D-Lib Mag. 19(9/10) (2013)
6.
Zurück zum Zitat Klampfl, S., Granitzer, M., Jack, K., Kern, R.: Unsupervised document structure analysis of digital scientific articles. Int. J. Digit. Libr. 14(3–4), 83–99 (2014)CrossRef Klampfl, S., Granitzer, M., Jack, K., Kern, R.: Unsupervised document structure analysis of digital scientific articles. Int. J. Digit. Libr. 14(3–4), 83–99 (2014)CrossRef
7.
Zurück zum Zitat Klampfl, S., Kern, R.: An unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds.) TPDL 2013. LNCS, vol. 8092, pp. 144–155. Springer, Heidelberg (2013) CrossRef Klampfl, S., Kern, R.: An unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds.) TPDL 2013. LNCS, vol. 8092, pp. 144–155. Springer, Heidelberg (2013) CrossRef
8.
Zurück zum Zitat Kröll, M., Klampfl, S., Kern, R.: Towards a marketplace for the scientific community: accessing knowledge from the computer science domain. D-Lib Mag. 20(11/12) (2014) Kröll, M., Klampfl, S., Kern, R.: Towards a marketplace for the scientific community: accessing knowledge from the computer science domain. D-Lib Mag. 20(11/12) (2014)
9.
Zurück zum Zitat Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML-2001), pp. 282–289 (2001) Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML-2001), pp. 282–289 (2001)
10.
Zurück zum Zitat Ratnaparkhi, A.: Maximum entropy models for natural langual ambiguity resolution. Ph.D. thesis (1998) Ratnaparkhi, A.: Maximum entropy models for natural langual ambiguity resolution. Ph.D. thesis (1998)
Metadaten
Titel
Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications
verfasst von
Stefan Klampfl
Roman Kern
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-25518-7_9

Premium Partner