Skip to main content

2018 | OriginalPaper | Buchkapitel

Metadata Extraction for Scientific Papers

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Metadata extraction for scientific literature is to automatically annotate each paper with metadata that represents its most valuable information, including problem, method and dataset. Most existing work normally extract keywords or key phrases as concepts for further analysis without their fine-grained types. In this paper, we present a supervised method with three-stages to address the problem. The first step extracts key phrases as metadata candidates, and the second step introduces various features, i.e., statistical features, linguistics features, position features and a novel fine-grained distribution feature which has high relevance with metadata categories, to type the candidates into three foregoing categories. In the evaluation, we conduct extensive experiments on a manually-labeled dataset from ACL Anthology and the results show our proposed method achieves a +3.2% improvement in accuracy compared with strong baseline methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
5
The metadata of papers in the Dublin Core include: Title, Creator, Subject, Description, Contributor, Publisher, Date, Type, Format, Identifier, Source, Relation, Reference, Is referenced By, Language, Rights and Coverage.
 
Literatur
1.
Zurück zum Zitat D’Avanzo, E., Magnini, B.: A keyphrase-based approach to summarization: the LAKE system at DUC-2005. In: Proceedings of DUC (2005) D’Avanzo, E., Magnini, B.: A keyphrase-based approach to summarization: the LAKE system at DUC-2005. In: Proceedings of DUC (2005)
2.
Zurück zum Zitat Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(Suppl. 1), 5228–5235 (2004)CrossRef Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(Suppl. 1), 5228–5235 (2004)CrossRef
3.
Zurück zum Zitat Gupta, S., Manning, C.: Analyzing the dynamics of research by extracting key aspects of scientific papers. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 1–9 (2011) Gupta, S., Manning, C.: Analyzing the dynamics of research by extracting key aspects of scientific papers. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 1–9 (2011)
4.
Zurück zum Zitat Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 363–371. Association for Computational Linguistics (2008) Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 363–371. Association for Computational Linguistics (2008)
5.
Zurück zum Zitat Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of 2003 Joint Conference on Digital Libraries, pp. 37–48. IEEE (2003) Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of 2003 Joint Conference on Digital Libraries, pp. 37–48. IEEE (2003)
6.
Zurück zum Zitat Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(1), 9–27 (1995)CrossRef Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(1), 9–27 (1995)CrossRef
7.
Zurück zum Zitat Krishnan, A., Sankar, A., Zhi, S., Han, J.: Unsupervised concept categorization and extraction from scientific document titles. CoRR abs/1710.02271 (2017) Krishnan, A., Sankar, A., Zhi, S., Han, J.: Unsupervised concept categorization and extraction from scientific document titles. CoRR abs/1710.02271 (2017)
8.
Zurück zum Zitat Li, N., Zhu, L., Mitra, P., Mueller, K., Poweleit, E., Giles, C.L.: oreChem ChemXSeer: a semantic digital library for chemistry. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, pp. 245–254. ACM (2010) Li, N., Zhu, L., Mitra, P., Mueller, K., Poweleit, E., Giles, C.L.: oreChem ChemXSeer: a semantic digital library for chemistry. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, pp. 245–254. ACM (2010)
9.
Zurück zum Zitat Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1729–1744. ACM (2015) Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1729–1744. ACM (2015)
10.
Zurück zum Zitat McKeown, K., et al.: Predicting the impact of scientific concepts using full-text features. J. Assoc. Inf. Sci. Technol. 67(11), 2684–2696 (2016)CrossRef McKeown, K., et al.: Predicting the impact of scientific concepts using full-text features. J. Assoc. Inf. Sci. Technol. 67(11), 2684–2696 (2016)CrossRef
11.
Zurück zum Zitat Pan, L., Wang, X., Li, C., Li, J., Tang, J.: Course concept extraction in MOOCs via embedding-based graph propagation. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (vol. 1: Long Papers), pp. 875–884 (2017) Pan, L., Wang, X., Li, C., Li, J., Tang, J.: Course concept extraction in MOOCs via embedding-based graph propagation. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (vol. 1: Long Papers), pp. 875–884 (2017)
12.
Zurück zum Zitat Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011) Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
13.
Zurück zum Zitat Prabhakaran, V., Hamilton, W.L., McFarland, D., Jurafsky, D.: Predicting the rise and fall of scientific topics from trends in their rhetorical framing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), vol. 1, pp. 1170–1180 (2016) Prabhakaran, V., Hamilton, W.L., McFarland, D., Jurafsky, D.: Predicting the rise and fall of scientific topics from trends in their rhetorical framing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), vol. 1, pp. 1170–1180 (2016)
14.
Zurück zum Zitat Teufel, S., Carletta, J., Moens, M.: An annotation scheme for discourse-level argumentation in research articles. In: Proceedings of the Ninth Conference on European chapter of the Association for Computational Linguistics, pp. 110–117. Association for Computational Linguistics (1999) Teufel, S., Carletta, J., Moens, M.: An annotation scheme for discourse-level argumentation in research articles. In: Proceedings of the Ninth Conference on European chapter of the Association for Computational Linguistics, pp. 110–117. Association for Computational Linguistics (1999)
15.
Zurück zum Zitat Teufel, S., Siddharthan, A., Batchelor, C.: Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: vol. 3, pp. 1493–1502. Association for Computational Linguistics (2009) Teufel, S., Siddharthan, A., Batchelor, C.: Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: vol. 3, pp. 1493–1502. Association for Computational Linguistics (2009)
16.
Zurück zum Zitat Tsai, C.T., Kundu, G., Roth, D.: Concept-based analysis of scientific literature. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, pp. 1733–1738. ACM (2013) Tsai, C.T., Kundu, G., Roth, D.: Concept-based analysis of scientific literature. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, pp. 1733–1738. ACM (2013)
Metadaten
Titel
Metadata Extraction for Scientific Papers
verfasst von
Binjie Meng
Lei Hou
Erhong Yang
Juanzi Li
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-01716-3_10