Skip to main content
Top

2019 | OriginalPaper | Chapter

Semantically Aware Text Categorisation for Metadata Annotation

Authors : Giulio Carducci, Marco Leontino, Daniele P. Radicioni, Guido Bonino, Enrico Pasini, Paolo Tripodi

Published in: Digital Libraries: Supporting Open Science

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper we illustrate a system aimed at solving a long-standing and challenging problem: acquiring a classifier to automatically annotate bibliographic records by starting from a huge set of unbalanced and unlabelled data. We illustrate the main features of the dataset, the learning algorithm adopted, and how it was used to discriminate philosophical documents from documents of other disciplines. One strength of our approach lies in the novel combination of a standard learning approach with a semantic one: the results of the acquired classifier are improved by accessing a semantic network containing conceptual information. We illustrate the experimentation by describing the construction rationale of training and test set, we report and discuss the obtained results and conclude by drawing future work.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
3
Full account of the EThOS UKETD_DC application profile can be found at the URL http://​ethostoolkit.​cranfield.​ac.​uk/​tiki-index.​php?​page=​Metadata.
 
4
The final test set is available within the bundle containing the implementation of the described system [4].
 
5
An off-the shelf implementation of the Random Forest algorithm was used, as provided by the scikit-learn framework, http://​scikit-learn.​org/​stable/​.
 
6
We used the list of English stop-words from the NLTK package available at the URL https://​gist.​github.​com/​sebleier/​554280.
 
7
Stemming was done using the WordNet Lemmatizer, also available within the NLTK library, https://​github.​com/​nltk/​nltk/​blob/​develop/​nltk/​stem/​wordnet.​py.
 
8
It is worth noting that the human experts adopted a rather inclusive attitude with respect to religious studies, based on their previous acquaintance with an analogous dataset of US PhD dissertations, in which a significant number of ‘religious’ dissertations have been defended in philosophy departments.
 
9
We obtained a list of some relevant philosophical concepts from the upper levels of the Taxonomy of Philosophy by David Chalmers, http://​consc.​net/​taxonomy.​html.
 
10
We presently employ the Stanford Named Entity Recognizer [10].
 
11
We presently employ the Stanford POS Tagger [31].
 
12
The implemented system is delivered through the Zenodo platform [4].
 
Literature
1.
go back to reference Akinyelu, A.A., Adewumi, A.O.: Classification of phishing email using random forest machine learning technique. J. Appl. Math. 2014 (2014). Hindawi Akinyelu, A.A., Adewumi, A.O.: Classification of phishing email using random forest machine learning technique. J. Appl. Math. 2014 (2014). Hindawi
2.
go back to reference Begum, N., Fattah, M., Ren, F.: Automatic text summarization using support vector machine. Int. J. Innov. Comput. Inf. Control 5, 1987–1996 (2009) Begum, N., Fattah, M., Ren, F.: Automatic text summarization using support vector machine. Int. J. Innov. Comput. Inf. Control 5, 1987–1996 (2009)
5.
go back to reference Chen, J., Huang, H., Tian, S., Qu, Y.: Feature selection for text classification with naïve bayes. Expert Syst. Appl. 36(3), 5432–5435 (2009)CrossRef Chen, J., Huang, H., Tian, S., Qu, Y.: Feature selection for text classification with naïve bayes. Expert Syst. Appl. 36(3), 5432–5435 (2009)CrossRef
7.
go back to reference Cutler, D.R., et al.: Random forests for classification in ecology. Ecology 88(11), 2783–92 (2007)CrossRef Cutler, D.R., et al.: Random forests for classification in ecology. Ecology 88(11), 2783–92 (2007)CrossRef
9.
go back to reference Ferilli, S., Leuzzi, F., Rotella, F.: Cooperating techniques for extracting conceptual taxonomies from text. In: Appice, A., Ceci, M., Loglisci, C., Manco, G. (eds.) Proceedings of the Workshop on Mining Complex Patterns at AI*IA XIIth Conference (2011) Ferilli, S., Leuzzi, F., Rotella, F.: Cooperating techniques for extracting conceptual taxonomies from text. In: Appice, A., Ceci, M., Loglisci, C., Manco, G. (eds.) Proceedings of the Workshop on Mining Complex Patterns at AI*IA XIIth Conference (2011)
10.
go back to reference Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005) Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)
11.
go back to reference Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st National Conference on Artificial Intelligence. AAAI 2006, vol. 2, pp. 1301–1306. AAAI Press (2006) Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st National Conference on Artificial Intelligence. AAAI 2006, vol. 2, pp. 1301–1306. AAAI Press (2006)
12.
go back to reference Ghignone, L., Lieto, A., Radicioni, D.P.: Typicality-based inference by plugging conceptual spaces into ontologies. In: Proceedings of the AIC. CEUR (2013) Ghignone, L., Lieto, A., Radicioni, D.P.: Typicality-based inference by plugging conceptual spaces into ontologies. In: Proceedings of the AIC. CEUR (2013)
13.
go back to reference Harabagiu, S., Moldovan, D.: Question answering. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics. Oxford University Press, New York (2003) Harabagiu, S., Moldovan, D.: Question answering. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics. Oxford University Press, New York (2003)
14.
go back to reference Ho, T.K.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1). ICDAR 1995, vol. 1, pp. 278–282. IEEE Computer Society, Washington, DC (1995) Ho, T.K.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1). ICDAR 1995, vol. 1, pp. 278–282. IEEE Computer Society, Washington, DC (1995)
15.
go back to reference Hovy, E.: Text summarization. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, 2nd edn. Oxford University Press, New York (2003) Hovy, E.: Text summarization. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, 2nd edn. Oxford University Press, New York (2003)
17.
go back to reference Johnson, R., Zhang, T.: Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in Neural Information Processing Systems, pp. 919–927 (2015) Johnson, R., Zhang, T.: Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in Neural Information Processing Systems, pp. 919–927 (2015)
18.
go back to reference Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Short Papers, vol. 2, pp. 427–431. Association for Computational Linguistics (2017) Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Short Papers, vol. 2, pp. 427–431. Association for Computational Linguistics (2017)
19.
go back to reference Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2267–2273. AAAI 2015. AAAI Press (2015) Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2267–2273. AAAI 2015. AAAI Press (2015)
22.
go back to reference Lison, P., Kennington, C.: Opendial: a toolkit for developing spoken dialogue systems with probabilistic rules. In: Proceedings of ACL-2016 System Demonstrations, pp. 67–72 (2016) Lison, P., Kennington, C.: Opendial: a toolkit for developing spoken dialogue systems with probabilistic rules. In: Proceedings of ACL-2016 System Demonstrations, pp. 67–72 (2016)
23.
go back to reference Liu, H., Singh, P.: Conceptnet-a practical commonsense reasoning tool-kit. BT Technol. J. 22(4), 211–226 (2004)CrossRef Liu, H., Singh, P.: Conceptnet-a practical commonsense reasoning tool-kit. BT Technol. J. 22(4), 211–226 (2004)CrossRef
24.
go back to reference Marujo, L., Ribeiro, R., de Matos, D.M., Neto, J.P., Gershman, A., Carbonell, J.: Key phrase extraction of lightly filtered broadcast news. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 290–297. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_35CrossRef Marujo, L., Ribeiro, R., de Matos, D.M., Neto, J.P., Gershman, A., Carbonell, J.: Key phrase extraction of lightly filtered broadcast news. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 290–297. Springer, Heidelberg (2012). https://​doi.​org/​10.​1007/​978-3-642-32790-2_​35CrossRef
25.
go back to reference McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: IN AAAI-98 Workshop on Learning for Text Categorization, pp. 41–48. AAAI Press (1998) McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: IN AAAI-98 Workshop on Learning for Text Categorization, pp. 41–48. AAAI Press (1998)
26.
go back to reference Mensa, E., Radicioni, D.P., Lieto, A.: COVER: a linguistic resource combining common sense and lexicographic information. Lang. Resour. Eval. 52(4), 921–948 (2018)CrossRef Mensa, E., Radicioni, D.P., Lieto, A.: COVER: a linguistic resource combining common sense and lexicographic information. Lang. Resour. Eval. 52(4), 921–948 (2018)CrossRef
27.
go back to reference Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)CrossRef Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)CrossRef
28.
go back to reference Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th ACL, pp. 216–225. ACL (2010) Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th ACL, pp. 216–225. ACL (2010)
29.
go back to reference Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)MATHCrossRef Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)MATHCrossRef
30.
31.
go back to reference Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, vol. 13. pp. 63–70. Association for Computational Linguistics (2000) Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, vol. 13. pp. 63–70. Association for Computational Linguistics (2000)
32.
go back to reference Wang, J.: A knowledge network constructed by integrating classification, thesaurus, and metadata in digital library. Int. Inf. Libr. Rev. 35(2–4), 383–397 (2003)CrossRef Wang, J.: A knowledge network constructed by integrating classification, thesaurus, and metadata in digital library. Int. Inf. Libr. Rev. 35(2–4), 383–397 (2003)CrossRef
33.
go back to reference Weibel, S.: The dublin core: a simple content description model for electronic resources. Bull. Am. Soc. Inf. Sci. Technol. 24(1), 9–11 (1997)CrossRef Weibel, S.: The dublin core: a simple content description model for electronic resources. Bull. Am. Soc. Inf. Sci. Technol. 24(1), 9–11 (1997)CrossRef
34.
go back to reference Xu, B., Guo, X., Ye, Y., Cheng, J.: An improved random forest classifier for text categorization. JCP 7(12), 2913–2920 (2012) Xu, B., Guo, X., Ye, Y., Cheng, J.: An improved random forest classifier for text categorization. JCP 7(12), 2913–2920 (2012)
Metadata
Title
Semantically Aware Text Categorisation for Metadata Annotation
Authors
Giulio Carducci
Marco Leontino
Daniele P. Radicioni
Guido Bonino
Enrico Pasini
Paolo Tripodi
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-11226-4_25