These contributions were classified according to their goals. The first group of contributions (subsection ‘Research developed for Brazilian Portuguese term extraction’) corresponds to investigations that primarily compared, adapted, or developed investigations for term extraction. More specifically, such contributions are described in function of the approach (linguistic, statistical, and hybrid) in which the ATE is based. Furthermore, for those contributions that are based on linguistic knowledge, the level of knowledge employed is stated (i.e., morphological, syntactic, and/or semantic level). The second group of contributions (subsection ‘Research related to Brazilian Portuguese term extraction’) are investigations that aimed at discussing about term extraction in the Brazilian Portuguese language. The names of the authors of each contribution are rendered in italics in order to highlight their contributions.
Additionally, we describe the main current projects (subsection ‘Projects related to the Brazilian Portuguese term extraction’ includes a general overview of the performance of the ATE work available for processing Brazilian Portuguese, a discussion regarding the state of the art on ATE, as well as a geographical mapping of all contributions directly related to the ATE of Brazilian Portuguese.
Research developed for Brazilian Portuguese term extraction
Teline et al.[
66] evaluated the use of statistical measures applied to the corpus of the Industrial Ceramic Magazine, described in Section ‘Corpora for Portuguese’. The evaluated measures were, for unigrams, frequency; for bigrams, frequency, mutual information, log likelihood ratio, and Dice’s coefficient; and for trigrams,
tf,
mi, and
ll. As results, the authors observed that it was not possible to identify which of the adopted measures are the best to be applied in bigrams of this corpus, because the results were quite similar. Regarding the case of trigrams, the absolute frequency measure presented a better result than the mutual information and log likelihood ratio measures. Based on the results presented by the authors, it was possible to calculate the following values for the F-measure: 26%, 9%, and 0.62% for unigrams, bigrams, and trigrams, respectively.
Zavaglia and her co-workers[
16,
17] evaluated the term extraction according to the linguistic, statistical, and hybrid approaches. For all three approaches, the authors removed stopwords and used some indicative phrases, for instance ‘
definido como’ (‘defined as’) and ‘
chamado’ (‘called’). More specifically, for the linguistic approach, they considered morphosyntactic patterns for the term extraction. For the statistical approach, the authors compared the ATE results obtained using separately different statistical measures (absolute frequency, mutual information, log likelihood ratio, and Dice’s coefficient), available in Ngram Statistics Package (NSP) [
73]. Regarding the hybrid approach (which corresponded to the use of the aforementioned statistical and linguistic measures together), they combined the knowledge obtained by these adopted measures. For the experiments, the authors used the ECO [
16] corpus, which contains 390 text documents in Portuguese from the ecology domain. They observed that the hybrid approach obtained the best results, although the number of extracted candidate terms was median. As results, the best values for the F-measures were for unigrams, 16.48%; for bigrams, 16.88%; for trigrams, 5.77%.
In the work of
Honorato and Monard[
8], the authors developed a framework for the extraction of terminology using the hybrid approach for the medical report. This framework, called ‘Term Pattern Discover’ (
TP-Discover), in summary, selects words and phrases that occur with a certain Absolute Frequency (statistical method) and, for that purpose, the lemmatisation technique (linguistic method) is applied using the TreeTagger [
74] lemmatiser. Then, the terms that follow predefined morphosyntactic patters are selected (for example: term ‘
terço distal’ (‘distal third’) follows the N+Adj pattern). As we did not have access to the exact measures of precision and recall, based in their available results, we assumed that the best F-measure value was 59%.
Ribeiro Junior and Vieira[
14,
53] performed term extraction using the hybrid approach and the following three stages: (i) selection of semantic sets, (ii) simple term extraction, and (iii) composed term extraction. The selection of semantic sets consists of the removal of stopwords and the use of the semantic information made available by the PALAVRAS parser [
75]. These are prototypical information that classify common names in general classes, for example the tag ‘<
an>’ attributed to the noun ‘
olho’ (‘eye’), indicates that the word belongs to the class ‘
Anatomia’ (‘Anatomy’). In this wise, the nouns tagged with the same tag are grouped in semantic groups. The expert in the domain analizes the list of obtained semantic tags ordered by the relative frequency (
r f) of each tag. Then, the expert excludes the semantic groups that he/she considers not to have relation with the domain in question. For the extraction of simple and composed terms, the authors used the relative frequency (
r f),
t f-
i d f[
33], and
n c-
v a l u e[
49] statistical measures. Moreover, for the extraction of the composed terms, they used the
c a-
v a l u e[
49] statistical measure. Regarding the extraction of simple terms, they only extracted the candidates that belong to certain grammatical classes defined by the expert, as well as the head of the noun phrases. For the candidates to composed terms, instead, the authors only considered those that consisted of determined morphosyntactic patterns, as well as those that constituted noun phrases. Finally, the authors combined these linguistic and statistical methods, originating hybrid methods. All these methods were used to extract terms from two corpora in Portuguese: Nanoscience and Nanotechnology [
62], which contain 1,057 texts from those domains, and JPED, which is composed by 283 text documents from the pediatric domain. For the first corpus, the precision of the extracted terms were calculated and, for the JPED corpus, the F-measure values were also calculated, and the best results were 22.39%, 10.04%, and 5.46% for unigrams, bigrams, and trigrams, respectively.
Conrado and co-workers[
52,
69] performed term extraction (unigrams, bigrams, and trigrams) using the hybrid approach. The authors applied word normalization techniques (stemming, lemmatization, and nominalization) in the agribusiness domain. They removed standard stopwords for the Portuguese available at
PRETEXT[
76], together with the conjugations of the verb TO BE as well as the words composed of only one character. In the sequence, they applied statistical measures, as follow: for unigrams, bigrams, and trigrams, they used document frequency (
d f≥2) and for the bigrams and trigrams, they applied log likelihood ratio. They removed unigrams considering their df values formed a new list of words, denominated ‘stoplist of the collection’ or ‘stoplist of the domain’, and this list is incorporated to the standard stopwords and used to form
n-grams. The extracted terms were evaluated by the authors in an objective manner - using, for instance, the
c t w[
51] measure - and in a subjective way with the support of domain experts. This term extraction approach was focused to be used in the TopTax project [
77], whose explanation is included in this section, although it may be used for other objectives.
There is also the work of
Lopes et al.[
71] that used the OntoLP tool [
53] to compare three ways of term extraction (bigrams and trigrams) based on the linguistic approach. The first way considered only
n-grams, the second one used pre-established morphosyntactic patterns, while the third one only considered the noun phrases. Both, for the identification of morphosyntactic patterns and noun phrases, the authors used the PALAVRAS parser [
75]. The authors compared these three strategies among themselves. Moreover, they added information regarding semantic groups to each one of these forms, generating, now, three new ways of ATE, which were compared with the strategies that did not use semantic information. These semantic groups are prototypic information supplied by the PALAVRAS parser and made available by the OntoLP tool. Such information classify common names in general classes. An example given by the authors is the tag ‘<
an>’ attributed to the noun ‘
músculo’ (‘muscle’), which indicates that the word belongs to the class ‘
Anatomia’ (‘Anatomy’). As a result, the best F-measure values for bigrams were 11.51% and for trigrams were 8.41%, obtained with the JPED corpus, considering the noun phrases, excluding terms by semantic groups.
In the contributions of
Lopes et al.[
67,
68], the authors performed a comparative analysis of the extraction of bigrams and trigams using a linguistic and a statistical approach. They extracted these terms from the JPED corpus. The linguistic approach used the
E χ ATO
LP tool [
19] to identify noun phrases from a corpus previously noted by the PALAVRAS parser [
75]. The statistical approach used the NSP package to identify terms that contained an absolute frequency superior to a given value. Also for the statistical approach, they removed (i) stopwords; (ii) text structural demarcations, such as ‘
Introduction’ and ‘
References’; and (iii) the candidate terms whose words began with capital letters in order to remove proper nouns, such as ‘
São Paulo’. The extracted terms were evaluated with the support of a gold standard of
n-grams. As a result, the authors state that the statistical approach presents high simplicity in its execution. However, they obtained better results when using the linguistic approach. The values obtained by the tool, for a corpus of the JPED corpus, have the F-measure = 34.48% for bigrams and the F-measure = 38.37% for trigrams [
68].
Lopes et al.[
72] performed term extraction (bigrams and trigrams) from the JPED corpus. This extraction used the OntoLP tool [
53], that considers three different ways to extract terms: (i) the more frequently used
n-grams, (ii) candidates that follow some morphosyntactic patterns; and (iii) noun phrases. Additionally, the authors tested different cut-off points. The best F-measure values were, for bigrams, 56.84% when using POS and, for trigrams, 52.11% when using the frequency of
n-grams, both values were achieved considering the thresholds of 5E-6 and 6E-6 for absolute cut-off points.
In the contributions of
Muniz and collaborators[
11,
12,
61], the authors presented the NorMan Extractor tool
29, which extracts terms from instruction manuals using the hybrid approach, such as manuals of appliances. The term extraction is based on specific relations existing in the genre in question. That is, the instruction manuals have two basic procedural relations: relation ‘
gera’ (‘generation’), when an action
A automatically generates an action
B, and the relation ‘
habilita’ (‘enablement’), when the realization of an action
A allows the realization of action
B. The steps taken for the term extraction were the following: Firstly, the user selects an instruction manual to be used. There is also the possibility of the user to submit a corpus on the domain so it may be used in the calculation of the
c-
v a l u e measure [
49]. This measure is used for the extraction of composed candidate terms. Then, the instruction manual is noted by the PALAVRAS parser [
75]. From this notation, it is possible to extract the terms using the ‘
gera’ and ‘
habilita’ relations. Lastly, the lists of unigrams, bigrams, and
n-grams extracted are presented to the user, allowing the user to perform a cut using the offered values by the
c-
v a l u e measure for each extracted term. Considering there is not a gold standard for instruction manuals, the authors did not present results using the F-measure. However, the results were compared with other methods of extraction focused on scientific papers. Additionally, the authors also used a statistical measure,
Kappa[
78], which indicates the concordance among annotators at the same time that it discounts the concordance by chance.
Lopes and Vieira[
10] extracted terms using the linguistic knowledge. For this, they only considered the noun phrases that fit one of the 11 proposed linguistic heuristics. An example of these heuristics is the removal of the NPs that begin with an adverb. The best values for F-measure in the experiments carried out using the JPED corpus were 64% for bigrams and 50% for trigrams.
Lopes and co-workers[
9,
34] extracted bigrams and trigrams based on the same linguistic methods used by Lopes and Vieira [
10] and ordered them using the numerical values obtained by the application of the following statistical measures:
t f,
t f-
i d f,
t d s,
t h d,
T F-
I D F (refered in this work in upper-case letters to differentiate it from the
t f-
i d f measure). Lopes has also proposed and used the
t f-
d c f. That measure, according to the author, considers the absolute frequency of the term as a primary indication of the relevance of a term, and penalizes the terms that occur in the contrasting corpora of other domains dividing the term absolute frequency in the corpus of the domain by the geometric composition of the absolute frequency in each of the contrasting corpora. After the ordering of the terms by each of these measures, cut points were chosen and applied to the ordered lists of terms. For the experiments, they used the JPED corpus and four other contrasting corpora [
79], which are Stochastic modelling, Data mining, Parallel processing, and Geology. The precision of bigrams and trigrams extracted from the JPED corpus were evaluated in the following scenarios: (i) comparison of the linguistic heuristics adopted for the selection or removal of NPs, while it is possible to show that the use of the proposed heuristics significantly improve results; (ii) comparison of the statistical measures used, while, for this corpus, the precision rates are higher when the
t f-
d c f is used; and (iii) comparison of the variation of the contrasting corpora using the
t f-
d c f measure, that made it possible to show that when the four contrasting corpora are used together, better results are obtained. As results and considering cuts in the number of candidate terms, the author obtained the F-measure values equal to 81% for bigrams and 84% for trigrams.
Conrado, Pardo, and Rezende[
5] presented a term extraction approach (unigrams) in which inductors classify the words in terms or non-terms. This classification is based on a set of 19 characteristics identified for each word. These characteristics use linguistic knowledge (such as noun phrases and POS), statistical knowledge (such as
t f and
t f-
i d f), and hybrid knowledge (such as the frequency of words of a corpus in the general language and the analysis of the context of the words). For the experiments, three corpora, from different domains, were used: Ecology (ECO), Distance education (DE), and Nanoscience and Nanotechnology (N&N). The authors tested two different cutoffs (
C1 and
C2). In
C1, only unigrams that occur in at least two documents in the corpus were preserved. In
C2, considering the candidates of
C1, the authors preserved only the unigrams that occur in noun or prepositional phrases and also follow some of these POS: nouns, proper nouns, verbs, and adjectives. The best F-measure values were 24.26%, 17.58%, and 54.04% for ECO, EaD, and N&N, respectively. Among the identified characteristics for the candidate terms,
t f-
i d f[
33] was the one that better supported the term extraction, followed by
N_Noun (created by the authors, this characteristic counts how many nouns originated the candidate when it was normalized), and TVQ [
37] (characteristic that considers that the terms do not have low frequency and at the same time keep a non-uniform distribution through the corpus).
In the work of
Conrado et al.[
6], the authors used the same characteristics and corpora of those used in [
5]. However, in [
6], the authors followed the
C2 cutoff, described in [
5]. The best F-measure values were 23.40%, 18.39%, and 48.30% for ECO, EaD, and N&N, respectively. Additionally, in [
6], they discussed how the characteristics of different levels of knowledge help classifying terms and gave examples of extracted candidates correctly and incorrectly.
There is also the work of
Conrado et al.[
7] that proposed the use of transductive learning to ATE. Transductive learning performs the classification spreading the labels from labeled to unlabeled data in a corpus. The advantage of this learning is that it needs only a small number of labeled examples (candidates) to perform the classification. The authors extracted terms based on a set of 25 characteristics identified for each unigram. These characteristics use linguistic knowledge (such as noun phrases and POS), statistical knowledge (such as
t f and
t f-
i d f), and hybrid knowledge (such as the behavior of a candidate in a corpus of general language and the analysis of the context of the words). The experiments used a corpus of the ecology domain (ECO) and achieved 27% of the F-measure while the best F-measure for the same corpus when using inductive learning with 19 characteristics was 24% [
5].
The previously described contributions used different corpora and, in some cases, the authors used different evaluation measures for the results or evaluated only a part of the list of candidate terms. Even considering such differences, aiming at providing a general overview of the results of the contributions on term extraction for the Portuguese language, Table
3 presents a summary of the results and their best F-measure values.
Table 3
Summary of the contributions on term extraction
Almeida, Aluísio, and Teline [ 65] | Ceramic coating | - | - | - |
| Ceramic coating, physiotherapy, | - | - | - |
| and nanoscience and nanotechnology | | | |
Conrado [ 52] and Conrado et al. [ 69] | Agribusiness | - | - | - |
Conrado, Pardo, and Rezende [ 5] | Ecology, distance education, | | | |
| and nanoscience and nanotechnology | 54.04% | - | - |
| Ecology | 23.40% | - | - |
| Distance education, and | 18.39% | - | - |
| nanoscience and nanotechnology | 48.30% | - | - |
| Ecology | 27.00% | - | - |
| Medicine (medical reports) | 59.00% |
Lopes [ 9] and Lopes and Vieira [ 70] | Pediatrics | - | 81.00% | 84.00% |
Lopes, Fernandes, and Vieira [ 34] | Pediatrics | - | - | - |
Lopes, Oliveira, and Vieira [ 67, 68] | Pediatrics | - | 51.42% | 41.26% |
| Pediatrics | - | 64.00% | 50.00% |
| Pediatrics | - | 81.00% | 84.00% |
| Pediatrics | - | 11.50% | 8.40% |
| Pediatrics | - | 56.84% | 52.11% |
| Appliances | - | - | - |
Muniz et al. [ 12] and Muniz [ 61] | | | | |
| Pediatrics | 22.39% | 10.04% | 5.46% |
Ribeiro Junior and Vieira [ 14] | | | | |
| Ceramic coating | 11.00% | 17.00% | 46.00% |
Teline, Manfrin, and Aluísio [ 66] | Ceramic coating | 26.00% | 9.00% | 0.62% |
| Ecology | 16.48% | 16.88% | 5.77% |
Some investigations discuss about term extraction in the Brazilian Portuguese language.
Almeida et al.[
65] used the corpus of the Industrial Ceramics Magazine to discuss the manual and automatic process of term extraction. The authors stated that the manual extraction carried out by domain experts, in which the experts indicated the terms of the domain, is considered as a semantic criterion. In addition, the authors analysed the candidate terms obtained in the extraction considering the three cases: with and without stopword removal and with the correction of possible errors in the used corpus. For the extraction of unigrams, the authors used the frequency measure and, for bigrams, they compared mutual information, log likelihood ratio, and frequency.
In the work of
Teline[
31], the author carried out a bibliographical review on the weak and strong points of the term extraction methods in these three approaches, statistical, linguistic, and hybrid. Regarding the statistical approach, the author compared the frequency, mutual information, log likelihood ratio, and Dice’s coefficient measures. For the linguistic approach, Teline removed stopwords and used some indicative phrases, such as ‘
definido(a)(s) como’ (‘defined as’), ‘
caracterizado(a)’ (‘described as’), ‘
conhecido(a)(s) como’ (‘known as’), ‘
significa(m)’ (‘mean(s)’), as well morphosyntactic patterns. In the hybrid approach, the author combined the linguistic approach with the frequency measure separately for the extraction of unigrams, bigrams, and trigrams. Also, for the extraction of bigrams and trigrams, separately, the author combined the linguistic approach with frequency with the mutual information measure. Each one of these extractions was evaluated in the domain of ceramic coating. The hybrid approach (linguistic part with frequency) presented better F-measure values, which were 11%, 17%, and 33% for unigrams, bigrams, and trigrams, respectively. Considering a manual selection (a linguist and the author) carried out in the candidate terms, it has reached the F-measure values for the unigrams, also ordered by the frequency measure of 58% and for trigrams using Dice’s coefficient, obtained at 26%.
Almeida and Vale[
27] discussed about the specific morphological patterns that occur in three domains: ceramic coating, physiotherapy, and nanoscience and nanotechnology. As results, for the ceramic coating domain, they obtained a high frequency of combinations, such as ‘
argila refratária aluminosa’ (‘alumina refractory clay’) and ‘
análise granulométrica por peneiramento’ (‘granulometric analysis by sieving’), and of simple words followed by morphemes that may be useful as term identifiers, such as derivational suffixes
-agem,
-ção, for instance ‘
secagem’ (‘drying’) and ‘
moagem’ (‘grind’). For the physiotherapy domain, there are many erudite formations, of Greek or Latin origin, due to the fact that such terminology has many terms from medicine, such as ‘
arthr(o)-’ (‘arthr(o)-’) that may form, for instance, the terms ‘
artralgia’ (‘arthrodesis’) and ‘
artrite’ (‘arthritis’). Regarding the nanoscience and nanotechnology domain, the most remarking characteristic is the high absolute frequency of the
nano- prefix, which may originate, for instance, the terms ‘
nanocristais’ (‘nanocrystals’) and ‘
nanossistema biológico’ (‘biological nanosystem’).
The most common way to extract terms is to attribute a value for each candidate term according to some measure. Therefore, the candidates are ranked using their values and it is necessary to know how to perform the cutoff, i.e., to know until which value/candidate should be considered as good candidate.
Lopes and Vieira[
70] discussed and compared three different forms of candidate cutoffs, which were (i) absolute cutoff, the authors ranked the candidates using the
t f-
d c f measure and performed cutoffs considering intervals between 100 and 3,500 first candidates; (ii) threshold cutoff, they carried out cutoffs in the Pediatrics domain considering the frequency of the candidate in the corpus (0 until 15); and (iii) relative cutoff, they also used
t f-
d c f and removed percentage of candidates (1% until 30%). Finally, they analyzed these three cutoffs and proposed the combination of threshold and relative cutoffs, in which mantained candidates that have
t f-
d c f >2 and correspond up to 15% of the ranked candidades.
In this section, we presented some of the main projects related to the term extraction in the Brazilian Portuguese language, namely: NANOTERM, TEXTQUIM/TEXTECC, Bio-C,
E-TERMOS, and TermiNet. For each project, we highlighted where they applied term extraction by using italicized words. We also described the OntoLP portal and the Linguateca repository, in which researchers may have found resources to perform term extraction. A summary of these projects, portal, and repository is presented in Table
4 and their main characteristics are highlighted.
Table 4
Summary of the projects, portal, and repository related to the term extraction
| GETerm | | Generate the systematized | |
Bio-C | and NILC | 2007 to 2009 | terminology of the | Biofuel |
| | | biofuel domain | |
| EMBRAPA, | | | Several |
e-Termos
| GETerm, | 2009 to Today | Terminological management | |
| and NILC | | | |
| NILC | | Constitute corpus, | Nanoscience and |
NANOTERM | and IFSC | 2006 to 2008 | build ontology, and | nanotechnology |
| | | elaborate pilot-dictionary | |
| | | Build ontologies, | |
| GETerm | | develop terminological | |
TermiNet | and NILC | 2009 to Today | textual bases, and | Several |
| | | build a WordNet | |
TEXTQUIM / | | | Develop dictionary | Chemistry, Physics |
TEXTECC | UFRGS | 2003 to Today | for translation | Pediatrics, Cardiology |
| | | | Nursing, and Veterinary |
| LABIC and | | Organize and keep | |
TopTax
| EMBRAPA | 2005 to Today | information on | Several |
| | | specific domains | |
OntoLP | PUCRS | 2008 | Divulge tools | Several |
| | | and resources | |
Linguateca | IST-UTL, UC, and PUC-Rio | 1998 | Maintain linguistic resources | Several |
The NANOTERM Project
The project named ‘Terminology in the Portuguese language of Nanoscience and Nanotechnology: Systematisation of Vocabular Repertory and Creation of a Pilot Dictionary’ (NANOTERM) [
63] was developed between 2006 and 2008 in GETerm of the Federal University of São Carlos with the collaboration of NILC of the University of São Paulo.
The objectives of this project were (i) the constitution of a corpus in the Portuguese language of nanoscience and nanotechnology; (ii) the search for equivalents in Portuguese (input language) from a nomenclature in English (output language); (iii) creation of an ontology in the Portuguese language of the domain of nanoscience and nanotechnology, and (iv) the elaboration of the first pilot-dictionary of nanoscience and nanotechnology in the mother language.
The semi-automatic term extraction in this project is related to the obtainment of the terminological set that will compose the nomenclature of the dictionary or glossary and it is done a semi-automatic manner, as in this task, the role of the linguist is always foreseen, in addition to the automatic work carried out with the NSP package. Nomenclature is understood as the set of lexical units30 that will constitute the inputs of the glossary or dictionary. For the term extraction, the E-TERMOS computational environment is used, which is described afterwards. At last, the extracted terms are inserted in the ontology in the Portuguese language in the domain of nanoscience and nanotechnology.
The TEXTQUIM/TEXTECC Project
The project named Texts of Chemistry (TEXTQUIM)31, which began in 2003 and is developed by the Federal University of Rio Grande do Sul (UFRGS), is becoming project Technical and Scientific Texts (TEXTECC)32 because it will also comprise the domains of Chemistry, Physics, Pediatrics, Cardiology, Nursing, and Veterinary.
The objective of this project is to develop a dictionary to support translation students, initially in the domain of pediatrics. For the purpose of studying patterns of the Portuguese-English translation, Coulthard [
54] built a corpus, namely JPED, which is composed of 283 texts (785,448 words) in the Portuguese language extracted from the Journal of Pediatrics. In the scope of project TEXTQUIM-TEXTECC, it was carried out a manual term extraction (without linguistic notation) from this corpus considering only
n-grams that occurred more than four times in this corpus. In the sequence, it was carried out a filtering based on heuristics that resulted in a new list of
n-grams considered as possibly relevant to integrate the glossary. These
n-grams were evaluated in relation to their relevance and manually refined by translation students with knowledge of the domain. With this process, the gold standards of the JPED corpus were originated. These gold standards have been used in experiments of
composed term extractions and concept candidates, as the case of the OntoLP project [
53].
The Bio-C Project
The project named ‘Biofuel Terminology: morphological and semantic description aiming at systematisation’ (Bio-C)33 was developed between 2007 and 2009, by GETerm of the Federal University of São Carlos together with the support of NILC of the University of São Paulo.
The purpose of this project is to generate the systematized terminology of the biofuel domain, including the fundamental terms of the aforementioned domain, which includes the sub-domains of the ethanol and bio-diesel, in order to support the creation, a posteriori, of the first glossary of this knowledge domain in the Brazilian Portuguese language.
For the semi-automatic term extraction in this project, the NSP package was used to generate lists of candidate terms (unigrams, bigrams, trigrams, quadrigrams, and pentagrams). Stopwords were removed to reduce the excess of noise in the lists of candidate terms. In the sequence, such lists of candidate terms were manually cleaned by a linguist, as well as a posteriori validation of the candidates by a domain expert. As a result, it is expected to obtain validated terms that will integrate the area glossary.
The
E-TERMOS
Project
The ‘Electronic terms’ (
E-TERMOS34) project [
80] originated from the transformation of the TermEx project to
E-TERMOS. It is a free WEB collaborative computational environment with free access dedicated to the terminological management. This project was developed at NILC of the University of São Paulo with the collaboration of GETerm of the Federal University of São Carlos and of the Brazilian Agricultural Research Corporation (EMBRAPA).
The main objective of
E-TERMOS is to make the creation of terminological products possible, whether they are for academic research or promotion purposes, by means of the (semi) automation of the stages of the terminological work. The goal of the
automatic term extraction in this project is to obtain candidate terms from the corpora of the specificity in question. In order to perform the extraction, firstly, it is possible to choose the size of the gram to be used, which may be from 2 to 7. Then, it is possible to remove stopwords with the use of a list of provided stopwords, which is a result of the work of Teline [
31]. After the removal of the stopwords, the terms are extracted with the support of the statistical and/or linguistic knowledge. According to the author, the incorporation of statistical measures (log likelihood ratio, mutual information, and Dice’s coefficient) from the NSP package, linguistic (to be defined) and hybrid (union of statistical and linguistic knowledge) are to be included. Nowadays, the simple frequency statistical measure is available.
For the edition of the conceptual map of term categorization, in E-TERMOS, the creation, edition, and visualization of the conceptual maps and computational resources for the insertion and evaluation of the terms by experts is allowed.
Therefore, the management of the terminological database is obtained, in which the terminological record is created and filled and the definitional base is elaborated, with the support of tools that manage the terminological database.
Finally, in the stage of interchange and diffusion of terms, the entries are edited and the diffusion, interchange, and query of the terminological products may be performed with the help of applying terminological data exporting tools, making it possible for the users to query the entries.
The TermiNet Project
The TermiNet project (Terminological WordNet) [
81] is under development, since 2009, at the laboratory of GETerm of the Federal University of São Carlos with the collaboration of NILC of the University of São Paulo.
This project has two main objectives. The first one is to develop a generic semi-automatic methodology, based on corpus, for the building of lexical databases in the WordNet format. The second objective is to validate this methodology with the help of the building of a TermiNet.
The
candidate term extraction of TermiNet uses the linguistic approach with the help of the
E χ ATO
LP[
67] and OntoLP [
14] tools, as well as the statistical approach with the use of the NSP [
73] package. Experts in the domain in question carry out a manual validation of the candidate terms. The candidate terms are also compared to a list of lexical units from a contrasting corpora.
The
TOPTAX
Methodology
The Topic Taxonomy Environment (
TOPTAX)
35[
77] methodology aims at organizing and maintaining information of specific domains. This is possible due to the creation of a topic taxonomy on the domain knowledge represented by the collection of texts. The considered taxonomy is a hierarchical topic organization extracted from a collection of texts, in which the upper topics are
parents of the lower topics, i.e., the lower topics are specializations of the upper topics. In addition, it is possible to associate resources of the textual base at each level of the taxonomy, referring to its domain, thus, facilitating the organization of the information under this taxonomy.
TOPTAX, in order to achieve its objectives, follows the stages of the Text Mining process [
82], which are problem identification, pre-processing, pattern extraction, post-processing, and use of the knowledge.
The stage of problem identification must delimit the problem to be tackled by selecting and retrieving the documents that form a textual collection to be worked with.
In the pre-processing stage, the documents of the obtained textual base are prepared to serve as input for the tools that will be used. In this stage, the documents are converted to the form of plain text without formatting. Afterwards, the words of the documents are normalized using one of the word normalization techniques (stemming, lemmatization, or nominalization) and stopwords are removed. Next,
terms are extracted, and therefore, they are used to describe the text base, as detailed in Conrado [
52]. To reduce the amount of terms to be worked with, a term selection is performed by using, e.g., the Luhn, Salton, and term variance, methods, which are detailed in the work of Nogueira [
38].
In the pattern extraction stage, the document hierarchical clustering is performed in order to build a topic taxonomy. With the hierarchy, the obtained clusters keep topics or sub-topics to which the documents refer to. In the sequence, as described in Moura et al. [
77], the descriptors for each group found by obtaining the most significant terms are identified, while it is possible to add resources of topic information to each node, such as documents, videos, and associated images.
In the post-processing stage, this obtained hierarchy is visualized and validated. The knowledge regarding the domain at hand represented in this hierarchy is then used to support the decision and organization of the information contained there (stage of Knowledge Use).
The OntoLP Portal
The ‘Portal de Ontologia’ (OntoLP)
36[
53] is developed by the Group of Natural Language Processing of the University of Rio Grande do Sul (PUCRS) and has the objective of divulging the available ontologies in the Portuguese language, as well as terminological bases, controlled vocabularies, and even more complex ontologies of the OWL-DL (Web Ontology Language-Description Logics) type, and tools and resources related to the research in the area.
The Linguateca Repository
The Linguateca Repository37 is formally named Centro de Recursos – distribuído – para a língua portuguesa and was officially created in 2002, but the initial contributions related to it started in 1998. Linguateca consists of a repository of linguistic resources focused on the Portuguese language. The responsibility on Linguateca, since its start (in 1998) used to be passed from pole to pole in several colleges. From 2009 on, it was established that the responsibility on it would be given only to the Oslo operational pole. The people in charge of this pole are four researchers (Diana Santos, Cristina Mota, Rosário Silva, and Fernando Ribeiro) of Instituto Superior Técnico, Universidade Técnica de Lisboa (IST-UTL)38 and Universidade de Coimbra (UC)39, two Ph.D. students in Portugal (Nuno Cardoso and Hugo Oliveira) of IST-UTL, all in Portugal, as well as a Brazilian researcher (Maria Cláudia de Freitas) of Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio40).
Next, we discuss about the state of the art of term extraction in the Brazilian Portuguese language.