Skip to main content
Top
Published in: Mobile Networks and Applications 1/2021

08-09-2020

An Improved NER Methodology to the Portuguese Language

Authors: Rogerio de Aquino Silva, Luana da Silva, Moisés Lima Dutra, Gustavo Medeiros de Araujo

Published in: Mobile Networks and Applications | Issue 1/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The text mining process typically involves the application of natural language processing (NLP) techniques, in order to obtain important information and extract insights from texts. This is achieved by detecting patterns, which are not explicitly a priori in this unstructured or semi-structured dataset. One of the most significant NLP tasks is Named Entity Recognition (NER). The NER process seeks to extract and classify mentioned entities detected in a text written in natural language. These categories are predefined and can be names of people or organizations, locations, dates, monetary values, specific codes, etc. A wide range of algorithms based on LSTM (Long-Short Term Memory) architecture has being proposed to enhance the NER accuracy. However, a key component to a successful information extraction is the corpora used for NER training. Another key issue concerns the language being worked on, since the vast majority of algorithms were designed to work with English. According to the literature, while the NER process applied to the English language reaches about 90% accuracy, when it is applied to the Portuguese language, this precision reaches a maximum of 83.38%. This paper proposes a methodology to improve the Portuguese-based NER, which uses journalistic corpora as a basis for text corpora training. We believe the journalistic writting has the best adherence to the contemporaneity of any language, since it preserves features such as objectivity, simplicity, impartiality, and is a reference of transmitting the information without ambiguity. The proposed methodology provides a model to extract entities and assess the obtained results with the use of Recurrent Neural Network architectures. At the best of our knowledge, the proposed methodology applied to the Portuguese language not only overcomes the average accuracy found in the literature by increasing it from 83.38% to 85.64%, but also could decrease the computational costs related to the NER processing tasks.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Literature
1.
go back to reference Abdelali A, Cowie J, Soliman H (2005) Building a modern standard arabic corpus. In: Workshop on computational modeling of lexical acquisition 25–28 Abdelali A, Cowie J, Soliman H (2005) Building a modern standard arabic corpus. In: Workshop on computational modeling of lexical acquisition 25–28
2.
go back to reference do Amaral DOF, Vieira R (2014) Nerp-crf: uma ferramenta para o reconhecimento de entidades nomeadas por meio de conditional random fields. Linguamática 6(1):41–49 do Amaral DOF, Vieira R (2014) Nerp-crf: uma ferramenta para o reconhecimento de entidades nomeadas por meio de conditional random fields. Linguamática 6(1):41–49
3.
go back to reference Amri S, Zenkouar L, Outahajala M (2017) Build a morphosyntaxically annotated amazigh corpus. In: Proceedings of the 2nd international Conference on Big Data, Cloud and Applications 1–7 Amri S, Zenkouar L, Outahajala M (2017) Build a morphosyntaxically annotated amazigh corpus. In: Proceedings of the 2nd international Conference on Big Data, Cloud and Applications 1–7
4.
go back to reference de Araujo PHL, de Campos TE, de Oliveira RR, Stauffer M, Couto S, Bermejo P (2018) Lener-br: A dataset for named entity recognition in brazilian legal text. In: International Conference on Computational Processing of the Portuguese Language 313–323. Springer de Araujo PHL, de Campos TE, de Oliveira RR, Stauffer M, Couto S, Bermejo P (2018) Lener-br: A dataset for named entity recognition in brazilian legal text. In: International Conference on Computational Processing of the Portuguese Language 313–323. Springer
5.
go back to reference Cavalin P, Figueiredo F, de Bayser M, Moyano L, Candello H, Appel A, Souza R (2016) Building a question-answering corpus using social media and news articles. In: International Conference on Computational Processing of the Portuguese Language 353–358. Springer Cavalin P, Figueiredo F, de Bayser M, Moyano L, Candello H, Appel A, Souza R (2016) Building a question-answering corpus using social media and news articles. In: International Conference on Computational Processing of the Portuguese Language 353–358. Springer
6.
go back to reference da Consolação Dias, C (2015) A análise de domínio, as comunidades discursivas, a garantia de literatura e outras garantias. Informação & Sociedade 25(2) da Consolação Dias, C (2015) A análise de domínio, as comunidades discursivas, a garantia de literatura e outras garantias. Informação & Sociedade 25(2)
7.
go back to reference Holthaus P, Leichsenring C, Bernotat J, Richter V, Pohling M, Carlmeyer B, Köster N, zu Borgsen SM, Zorn R, Schiffhauer B et al. (2016) How to address smart homes with a social robot? a multi-modal corpus of user interactions with an intelligent environment. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 3440–3446 Holthaus P, Leichsenring C, Bernotat J, Richter V, Pohling M, Carlmeyer B, Köster N, zu Borgsen SM, Zorn R, Schiffhauer B et al. (2016) How to address smart homes with a social robot? a multi-modal corpus of user interactions with an intelligent environment. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 3440–3446
8.
go back to reference Mooers CN (1960) The next twenty years in information retrieval; some goals and predictions. Am Doc 11(3):229–236CrossRef Mooers CN (1960) The next twenty years in information retrieval; some goals and predictions. Am Doc 11(3):229–236CrossRef
9.
go back to reference Mosley M, Brackett MH, Earley S, Henderson D (2010) DAMA guide to the data management body of knowledge. Technics Publications Mosley M, Brackett MH, Earley S, Henderson D (2010) DAMA guide to the data management body of knowledge. Technics Publications
10.
go back to reference Nelson DMQ (2017) Uso de redes neurais recorrentes para previsão de séries temporais ffinanceiras Nelson DMQ (2017) Uso de redes neurais recorrentes para previsão de séries temporais ffinanceiras
11.
go back to reference de Oliveira MG, de Souza Baptista C, Campelo CE, Bertolotto M (2017) A goldstandard social media corpus for urban issues. In: Proceedings of the Symposium on Applied Computing 1011–1016 de Oliveira MG, de Souza Baptista C, Campelo CE, Bertolotto M (2017) A goldstandard social media corpus for urban issues. In: Proceedings of the Symposium on Applied Computing 1011–1016
12.
go back to reference Pirovani J, Oliveira E (2018) Portuguese named entity recognition using conditional random fields and local grammars. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) Pirovani J, Oliveira E (2018) Portuguese named entity recognition using conditional random fields and local grammars. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
13.
go back to reference PNAD I (2017) Acesso à internet e à televisão e posse de telefone móvel celular para uso pessoal 2017. Rio de Janeiro: IBGE (Instituto Brasileiro de Geografa e Estatística) PNAD I (2017) Acesso à internet e à televisão e posse de telefone móvel celular para uso pessoal 2017. Rio de Janeiro: IBGE (Instituto Brasileiro de Geografa e Estatística)
14.
go back to reference Pretto JR (2009) O estilo jornalístico. Estudos Linguísticos 38(3):481–491 Pretto JR (2009) O estilo jornalístico. Estudos Linguísticos 38(3):481–491
15.
go back to reference Santos J, Consoli B, dos Santos C, Terra J, Collonini S, Vieira R (2019) Assessing the impact of contextual embeddings for portuguese named entity recognition. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), 437–442. IEEE Santos J, Consoli B, dos Santos C, Terra J, Collonini S, Vieira R (2019) Assessing the impact of contextual embeddings for portuguese named entity recognition. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), 437–442. IEEE
16.
go back to reference de Aquino Silva R, da Silva L, Dutra ML, de Araujo GM (2020) A new entity extraction model based on journalistic Brazilian Portuguese language to enhance named entity recognition. In: Mugnaini R (eds) Data and information in online environments. DIONE 2020. Lecture notes of the institute for computer sciences, social informatics and telecommunications engineering, vol 319. Springer, Cham. https://doi.org/10.1007/978-3-030-50072-6_5 de Aquino Silva R, da Silva L, Dutra ML, de Araujo GM (2020) A new entity extraction model based on journalistic Brazilian Portuguese language to enhance named entity recognition. In: Mugnaini R (eds) Data and information in online environments. DIONE 2020. Lecture notes of the institute for computer sciences, social informatics and telecommunications engineering, vol 319. Springer, Cham. https://​doi.​org/​10.​1007/​978-3-030-50072-6_​5
17.
go back to reference Souza F, Nogueira R, Lotufo R (2019) Portuguese named entity recognition using bert-crf. arXiv preprint arXiv:1909.10649 Souza F, Nogueira R, Lotufo R (2019) Portuguese named entity recognition using bert-crf. arXiv preprint arXiv:1909.10649
18.
go back to reference Spoustová J, Spousta M, Pecina P (2010) Building a web corpus of czech Spoustová J, Spousta M, Pecina P (2010) Building a web corpus of czech
19.
go back to reference Villalva A, Mateus MHM (2008) Morfologia do português. Universidade Aberta Lisboa Villalva A, Mateus MHM (2008) Morfologia do português. Universidade Aberta Lisboa
20.
go back to reference Yan S (2015) Understanding lstm networks. Online). Accessed on August 11 Yan S (2015) Understanding lstm networks. Online). Accessed on August 11
21.
go back to reference Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers), 207–212 Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers), 207–212
Metadata
Title
An Improved NER Methodology to the Portuguese Language
Authors
Rogerio de Aquino Silva
Luana da Silva
Moisés Lima Dutra
Gustavo Medeiros de Araujo
Publication date
08-09-2020
Publisher
Springer US
Published in
Mobile Networks and Applications / Issue 1/2021
Print ISSN: 1383-469X
Electronic ISSN: 1572-8153
DOI
https://doi.org/10.1007/s11036-020-01644-x

Other articles of this Issue 1/2021

Mobile Networks and Applications 1/2021 Go to the issue