nach oben

Erschienen in:

2020 | OriginalPaper | Buchkapitel

Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain

verfasst von : Thiago Raulino Dal Pont, Isabela Cristina Sabo, Jomi Fred Hübner, Aires José Rover

Erschienen in: Intelligent Systems

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Word embeddings is a text representation technique capable of capturing syntactic and semantic linguistic patterns and of representing each word as an n-dimensional dense vector. In the domain of legal texts, there are trained word embeddings in languages like English, Polish, and Chinese. However, to the best of our knowledge, there are no embeddings based on Portuguese (Brazilian and European) legal texts. Given that, our research question is: does the specificity and size of the text corpus used for a word embedding training contribute to a more successful classification? To answer the question, we train word embeddings models in the legal domain with different levels of specificity and size. Then we evaluate their impact on text classification. To deal with the different levels of specificity, we collect text documents from different courts of the Brazilian Judiciary, in hierarchical order. We used these text corpora to train a word embeddings model (GloVe) and then had then evaluated while classifying processes with a deep learning model (CNN). In a context perspective, the results show that in word embeddings trained on smaller corpora sizes, text specificity has a higher impact than for large sizes. Also, in a corpus size perspective, the results demonstrate that the greater the corpus size in embeddings training, the better are the results. However, this impact decreases as the corpus size increases until a point where more words in the corpus have little impact on the results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Identifying Fine-Grained Opinion and Classifying Polarity on Coronavirus Pandemic

Nächstes Kapitel Machine Learning for Suicidal Ideation Identification on Twitter for the Portuguese Language

Code and Word Embeddings available at https://github.com/thiagordp/embeddings_in_law_paper.

Brazilian Federal Constitution (1988). http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm

Ptwiki dump progress on 20191120 (2019). http://wikipedia.c3sl.ufpr.br/ptwiki/20191120/

Aggarwal, C.C.: Machine Learning for Text, 1st edn. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73531-3CrossRefMATH

Aggarwal, C.C., Zhai, C. (eds.): Mining Text Data, 27th edn. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4CrossRef

Alami, N., Meknassi, M., En-nahnahi, N.: Enhancing unsupervised neural networks based text summarization with word embedding and ensemble learning. Expert Syst. Appl. 123, 195–211 (2019)CrossRef

Aubaid, A.M., Mishra, A.: Text classification using word embedding in rule-based methodologies: a systematic mapping. TEM J. 7(4), 902–914 (2018)

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2016)CrossRef

Braz, F.A., et al.: Document classification using a Bi-LSTM to unclog Brazil’s supreme court. In: NeurIPS Workshop on Machine Learning for the Developing World (ML4D), 8 December 2018

Cardoso, E.F., Silva, R.M., Almeida, T.A.: Towards automatic filtering of fake reviews. Neurocomputing 309, 106–116 (2018)CrossRef

10.

Chalkidis, I., Kampas, D.: Deep learning in law: early adaptation and legal word embeddings trained on large corpora. Artif. Intell. Law 27(2), 171–198 (2019). https://doi.org/10.1007/s10506-018-9238-9CrossRef

11.

Chocron, P., Pareti, P.: Vocabulary alignment for collaborative agents: a study with real-world multilingual how-to instructions. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization, pp. 159–165, July 2018

12.

Christensen, H.: HC Corpora (2016). https://web.archive.org/web/20161021044006/http://corpora.heliohost.org/

13.

Cintra, A.C.d.A., Grinover, A.P., Dinamarco, C.R.: Teoria geral do processo. Malheiros (2011)

14.

Cohen, P.R.: Empirical Methods for Artificial Intelligence. MIT Press, Cambridge (1995)MATH

15.

Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., Aluisio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks (Section 3), August 2017

16.

JusBrasil: JusBrasil. Conectando pessoas à justiça (2020). https://www.jusbrasil.com.br/home

17.

Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 2017-January, pp. 1746–1751. Association for Computational Linguistics, Stroudsburg, September 2014

18.

Kowsari, K., Meimandi, J., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019)CrossRef

19.

Kumar, G.R., Mangathayaru, N., Narasimha, G.: Intrusion detection using text processing techniques. In: Proceedings of the The International Conference on Engineering & MIS 2015 - ICEMIS 2015. ACM Press (2015)

20.

Lai, S., Liu, K., He, S., Zhao, J.: How to generate a good word embedding. IEEE Intell. Syst. 31(6), 5–14 (2016)CrossRef

21.

Marlessonn: News of the Brazilian newspaper (2019). https://www.kaggle.com/marlesson/news-of-the-site-folhauol

22.

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12, January 2013

23.

Peng, H., et al.: Large-scale hierarchical text classification with recursively regularized deep graph-CNN. In: Proceedings of the 2018 World Wide Web Conference, WWW 2018, pp. 1063–1072. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva (2018)

24.

Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 19, pp. 1532–1543. Association for Computational Linguistics, Stroudsburg (2014)

25.

Rodrigues, R.C., Rodrigues, J., de Castro, P.V.Q., da Silva, N.F.F., Soares, A.: Portuguese language models and word embeddings: evaluating on semantic similarity tasks. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds.) PROPOR 2020. LNCS (LNAI), vol. 12037, pp. 239–248. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41505-1_23CrossRef

26.

Santos, H., Woloszyn, V., Vieira, R.: BlogSet-BR: a Brazilian Portuguese blog corpus. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki (2018)

27.

Sheikhalishahi, S., Miotto, R., Dudley, J.T., Lavelli, A., Rinaldi, F., Osmani, V.: Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med. Inform. 7(2), e12239 (2019)CrossRef

28.

da Silva, N.C., et al.: Document type classification for Brazil’s supreme court using a convolutional neural network. In: 10th International Conference on Forensic Computer Science and Cyber Law (ICoFCS), Sao Paulo, Brazil, October 2018

29.

Smywiński-Pohl, A., Lasocki, K., Wróbel, K., Strzałta, M.: Automatic construction of a polish legal dictionary with mappings to extra-legal terms established via word embeddings. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law - ICAIL 2019. ACM Press (2019)

30.

STF: Supremo Tribunal Federal (2020). http://portal.stf.jus.br/

31.

STJ: STJ - Jurisprudência do STJ (2020). https://scon.stj.jus.br/SCON/

32.

Tan, L.: Old newspapers (2020). https://www.kaggle.com/alvations/old-newspapers

33.

Tatman, R.: Brazilian literature books (2017). https://www.kaggle.com/rtatman/brazilian-portuguese-literature-corpus

34.

TJSC: Jurisprudência Catarinense - TJSC (2020). http://busca.tjsc.jus.br/jurisprudencia/

35.

Uysal, A.K.: An improved global feature selection scheme for text classification. Expert Syst. Appl. 43, 82–92 (2016)CrossRef

36.

Wang, S., Zhou, W., Jiang, C.: A survey of word embeddings based on deep learning. Computing 102(3), 717–740 (2019)MathSciNetCrossRef

Titel: Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain
verfasst von: Thiago Raulino Dal Pont
Isabela Cristina Sabo
Jomi Fred Hübner
Aires José Rover
Verlag: Springer International Publishing
Buch: Intelligent Systems
Print ISBN: 978-3-030-61376-1

Electronic ISBN: 978-3-030-61377-8

Copyright-Jahr: 2020
DOI: https://doi.org/10.1007/978-3-030-61377-8_36

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"