Skip to main content

2018 | OriginalPaper | Buchkapitel

Cross-Evaluation of Automated Term Extraction Tools by Measuring Terminological Saturation

verfasst von : Victoria Kosa, David Chaves-Fraga, Dmitriy Naumenko, Eugene Yuschenko, Carlos Badenes-Olmedo, Vadim Ermolayev, Aliaksandr Birukou

Erschienen in: Information and Communication Technologies in Education, Research, and Industrial Applications

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper reports on cross-evaluating the two software tools for automated term extraction (ATE) from English texts: NaCTeM TerMine and UPM Term Extractor. The objective was to find the most fitting software for extracting the bags of terms to be the part of our instrumental pipeline for exploring terminological saturation in text document collections in a domain of interest. The choice of these particular tools from the bunch of the other available is explained in our review of the related work in ATE. The approach to measure terminological saturation is based on the use of the THD algorithm developed in frame of our OntoElect methodology for ontology refinement. The paper presents the suite of instrumental software modules, experimental workflow, 2 synthetic and 3 real document collections, generated datasets, and set-up of our experiments. Next, the results of the cross-evaluation experiments are presented, analyzed, and discussed. Finally the paper offers some conclusions and recommendations on the use of ATE software for measuring terminological saturation in retrospective text document collections.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
This paper is based on [1] in terms of its idea and research agenda presented as its research hypothesis and questions in Sect. 2. The rest constitutes the new result elaborated after the submission and publication of [1].
 
2
UPM Term Extractor could be downloaded from https://​github.​com/​ontologylearning​-oeg/​epnoi-legacy. It has to be further installed locally for use.
 
3
The batch service of NaCTeM TerMine is available at http://​www.​nactem.​ac.​uk/​batch.​php. Access needs to be requested.
 
5
Batch mode for TerMine is freely accessible at http://​www.​nactem.​ac.​uk/​batch.​php for academic purposes, provided that the permission by NaCTeM is granted for non-UK users.
 
6
All the five collections in plain text and the datasets generated of these texts are publicly available at: https://​www.​dropbox.​com/​sh/​64pbodb2dmpndcy/​AACoDO0iBKP6Lm44​00uxJQ6Ca?​dl=​0.
 
11
The values measured in all the reported experiments, though sometimes mentioned in the text, are not presented in the paper for saving space. All these experimental data and results are presented in full detail in the supporting technical report [27] which is publicly available online.
 
12
We did not yet check this. So, it is only a hypothesis.
 
13
D12 is the dataset from which B12 is extracted by UPM Extractor and B12m by TerMine. B12m is further converted to the UPM Extractor format and the pair (B12, B12m) is fed into the THD module. The module returns eps, thd, and thdr values for the pair as described in Sect. 3.
 
14
At the time of writing the final version of this paper, December, 2017.
 
Literatur
1.
Zurück zum Zitat Kosa, V., Chugunenko, A., Yuschenko, E., Badenes, C., Ermolayev, V., Birukou, A.: Semantic saturation in retrospective text document collections. In: Mallet, F., Zholtkevych, G. (eds.) Proceedings of ICTERI 2017 PhD Symposium, CEUR-WS, Kyiv, Ukraine, 16–17 May, vol. 1851, pp. 1–8 (2017). Online Kosa, V., Chugunenko, A., Yuschenko, E., Badenes, C., Ermolayev, V., Birukou, A.: Semantic saturation in retrospective text document collections. In: Mallet, F., Zholtkevych, G. (eds.) Proceedings of ICTERI 2017 PhD Symposium, CEUR-WS, Kyiv, Ukraine, 16–17 May, vol. 1851, pp. 1–8 (2017). Online
2.
Zurück zum Zitat Tatarintseva, O., Ermolayev, V., Keller, B., Matzke, W.-E.: Quantifying ontology fitness in OntoElect using saturation- and vote-based metrics. In: Ermolayev, V., Mayr, H.C., Nikitchenko, M., Spivakovsky, A., Zholtkevych, G. (eds.) ICTERI 2013. CCIS, vol. 412, pp. 136–162. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-03998-5_8 Tatarintseva, O., Ermolayev, V., Keller, B., Matzke, W.-E.: Quantifying ontology fitness in OntoElect using saturation- and vote-based metrics. In: Ermolayev, V., Mayr, H.C., Nikitchenko, M., Spivakovsky, A., Zholtkevych, G. (eds.) ICTERI 2013. CCIS, vol. 412, pp. 136–162. Springer, Cham (2013). https://​doi.​org/​10.​1007/​978-3-319-03998-5_​8
4.
Zurück zum Zitat Astrakhantsev, N.: ATR4S: toolkit with state-of-the-art automatic terms recognition methods in scala. arXiv preprint arXiv:1611.07804 (2016) Astrakhantsev, N.: ATR4S: toolkit with state-of-the-art automatic terms recognition methods in scala. arXiv preprint arXiv:​1611.​07804 (2016)
5.
Zurück zum Zitat Zhang, Z., Iria, J., Brewster, C., Ciravegna, F.: A comparative evaluation of term recognition algorithms. In: Proceedings of Sixth International Conference on Language Resources and Evaluation, LREC08, Marrakech, Morocco (2008) Zhang, Z., Iria, J., Brewster, C., Ciravegna, F.: A comparative evaluation of term recognition algorithms. In: Proceedings of Sixth International Conference on Language Resources and Evaluation, LREC08, Marrakech, Morocco (2008)
6.
Zurück zum Zitat Fahmi, I., Bouma, G., van der Plas, L.: Improving statistical method using known terms for automatic term extraction. In: Computational Linguistics in the Netherlands, CLIN 2007, vol. 17 (2007) Fahmi, I., Bouma, G., van der Plas, L.: Improving statistical method using known terms for automatic term extraction. In: Computational Linguistics in the Netherlands, CLIN 2007, vol. 17 (2007)
8.
Zurück zum Zitat Daille, B.: Study and implementation of combined techniques for automatic extraction of terminology. In: Klavans, J., Resnik, P. (eds.) The Balancing Act: Combining Symbolic and Statistical Approaches to Language, pp. 49–66. The MIT Press, Cambridge (1996) Daille, B.: Study and implementation of combined techniques for automatic extraction of terminology. In: Klavans, J., Resnik, P. (eds.) The Balancing Act: Combining Symbolic and Statistical Approaches to Language, pp. 49–66. The MIT Press, Cambridge (1996)
10.
Zurück zum Zitat Caraballo, S.A., Charniak, E.: Determining the specificity of nouns from text. In: Proceedings of 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 63–70 (1999) Caraballo, S.A., Charniak, E.: Determining the specificity of nouns from text. In: Proceedings of 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 63–70 (1999)
11.
Zurück zum Zitat Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In: Marchionini, G., Nelson, M.L., Marshall, C.C. (eds.) Proceedings of ACM/IEEE Joint Conference on Digital Libraries, JCDL 2006, pp. 296–297. ACM, Chapel Hill (2006). https://doi.org/10.1145/1141753.1141819 Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In: Marchionini, G., Nelson, M.L., Marshall, C.C. (eds.) Proceedings of ACM/IEEE Joint Conference on Digital Libraries, JCDL 2006, pp. 296–297. ACM, Chapel Hill (2006). https://​doi.​org/​10.​1145/​1141753.​1141819
12.
Zurück zum Zitat Ahmad, K., Gillam, L., Tostevin, L.: University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In: Proeedings 8th Text REtrieval Conference, TREC-8 (1999) Ahmad, K., Gillam, L., Tostevin, L.: University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In: Proeedings 8th Text REtrieval Conference, TREC-8 (1999)
14.
Zurück zum Zitat Sclano, F., Velardi, P.: TermExtractor: a web application to learn the common terminology of interest groups and research communities. In: Proceedings of 9th Conference on Terminology and Artificial Intelligence, TIA 2007, Sophia Antinopolis, France (2007) Sclano, F., Velardi, P.: TermExtractor: a web application to learn the common terminology of interest groups and research communities. In: Proceedings of 9th Conference on Terminology and Artificial Intelligence, TIA 2007, Sophia Antinopolis, France (2007)
16.
Zurück zum Zitat Astrakhantsev, N.: Methods and software for terminology extraction from domain-specific text collection. Ph.D. thesis, Institute for System Programming of Russian Academy of Sciences (2015) Astrakhantsev, N.: Methods and software for terminology extraction from domain-specific text collection. Ph.D. thesis, Institute for System Programming of Russian Academy of Sciences (2015)
17.
Zurück zum Zitat Bordea, G., Buitelaar, P., Polajnar, T.: Domain-independent term extraction through domain modelling. In: Proceedings of 10th International Conference on Terminology and Artificial Intelligence, TIA 2013, Paris, France (2013) Bordea, G., Buitelaar, P., Polajnar, T.: Domain-independent term extraction through domain modelling. In: Proceedings of 10th International Conference on Terminology and Artificial Intelligence, TIA 2013, Paris, France (2013)
19.
Zurück zum Zitat Nokel, M., Loukachevitch, N.: An experimental study of term extraction for real information-retrieval thesauri. In: Proceedings of 10th International Conference on Terminology and Artificial Intelligence, pp. 69–76 (2013) Nokel, M., Loukachevitch, N.: An experimental study of term extraction for real information-retrieval thesauri. In: Proceedings of 10th International Conference on Terminology and Artificial Intelligence, pp. 69–76 (2013)
20.
Zurück zum Zitat Zhang, Z., Gao, J., Ciravegna, F.: Jate 2.0: Java automatic term extraction with Apache Solr. In: Proceedings of LREC 2016, Slovenia, pp. 2262–2269 (2016) Zhang, Z., Gao, J., Ciravegna, F.: Jate 2.0: Java automatic term extraction with Apache Solr. In: Proceedings of LREC 2016, Slovenia, pp. 2262–2269 (2016)
23.
24.
Zurück zum Zitat Oliver, A., V`azquez, M.: TBXTools: a free, fast and flexible tool for automatic terminology extraction. In: Angelova, G., Bontcheva, K., Mitkov, R. (eds.) Proceedings of Recent Advances in Natural Language Processing, pp. 473–479, Hissar, Bulgaria, 7–9 September 2015 Oliver, A., V`azquez, M.: TBXTools: a free, fast and flexible tool for automatic terminology extraction. In: Angelova, G., Bontcheva, K., Mitkov, R. (eds.) Proceedings of Recent Advances in Natural Language Processing, pp. 473–479, Hissar, Bulgaria, 7–9 September 2015
25.
Zurück zum Zitat Corcho, O., Gonzalez, R., Badenes, C., Dong, F.: Repository of indexed ROs. Deliverable No. 5.4. Dr Inventor project (2015) Corcho, O., Gonzalez, R., Badenes, C., Dong, F.: Repository of indexed ROs. Deliverable No. 5.4. Dr Inventor project (2015)
26.
Zurück zum Zitat Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: review and trends. Int. J. Comput. Sci. Appl. 11(3), 57–115 (2014) Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: review and trends. Int. J. Comput. Sci. Appl. 11(3), 57–115 (2014)
Metadaten
Titel
Cross-Evaluation of Automated Term Extraction Tools by Measuring Terminological Saturation
verfasst von
Victoria Kosa
David Chaves-Fraga
Dmitriy Naumenko
Eugene Yuschenko
Carlos Badenes-Olmedo
Vadim Ermolayev
Aliaksandr Birukou
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-76168-8_7