Skip to main content

2020 | OriginalPaper | Buchkapitel

Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain

verfasst von : Thiago Raulino Dal Pont, Isabela Cristina Sabo, Jomi Fred Hübner, Aires José Rover

Erschienen in: Intelligent Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Word embeddings is a text representation technique capable of capturing syntactic and semantic linguistic patterns and of representing each word as an n-dimensional dense vector. In the domain of legal texts, there are trained word embeddings in languages like English, Polish, and Chinese. However, to the best of our knowledge, there are no embeddings based on Portuguese (Brazilian and European) legal texts. Given that, our research question is: does the specificity and size of the text corpus used for a word embedding training contribute to a more successful classification? To answer the question, we train word embeddings models in the legal domain with different levels of specificity and size. Then we evaluate their impact on text classification. To deal with the different levels of specificity, we collect text documents from different courts of the Brazilian Judiciary, in hierarchical order. We used these text corpora to train a word embeddings model (GloVe) and then had then evaluated while classifying processes with a deep learning model (CNN). In a context perspective, the results show that in word embeddings trained on smaller corpora sizes, text specificity has a higher impact than for large sizes. Also, in a corpus size perspective, the results demonstrate that the greater the corpus size in embeddings training, the better are the results. However, this impact decreases as the corpus size increases until a point where more words in the corpus have little impact on the results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
Literatur
5.
Zurück zum Zitat Alami, N., Meknassi, M., En-nahnahi, N.: Enhancing unsupervised neural networks based text summarization with word embedding and ensemble learning. Expert Syst. Appl. 123, 195–211 (2019)CrossRef Alami, N., Meknassi, M., En-nahnahi, N.: Enhancing unsupervised neural networks based text summarization with word embedding and ensemble learning. Expert Syst. Appl. 123, 195–211 (2019)CrossRef
6.
Zurück zum Zitat Aubaid, A.M., Mishra, A.: Text classification using word embedding in rule-based methodologies: a systematic mapping. TEM J. 7(4), 902–914 (2018) Aubaid, A.M., Mishra, A.: Text classification using word embedding in rule-based methodologies: a systematic mapping. TEM J. 7(4), 902–914 (2018)
7.
Zurück zum Zitat Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2016)CrossRef Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2016)CrossRef
8.
Zurück zum Zitat Braz, F.A., et al.: Document classification using a Bi-LSTM to unclog Brazil’s supreme court. In: NeurIPS Workshop on Machine Learning for the Developing World (ML4D), 8 December 2018 Braz, F.A., et al.: Document classification using a Bi-LSTM to unclog Brazil’s supreme court. In: NeurIPS Workshop on Machine Learning for the Developing World (ML4D), 8 December 2018
9.
Zurück zum Zitat Cardoso, E.F., Silva, R.M., Almeida, T.A.: Towards automatic filtering of fake reviews. Neurocomputing 309, 106–116 (2018)CrossRef Cardoso, E.F., Silva, R.M., Almeida, T.A.: Towards automatic filtering of fake reviews. Neurocomputing 309, 106–116 (2018)CrossRef
11.
Zurück zum Zitat Chocron, P., Pareti, P.: Vocabulary alignment for collaborative agents: a study with real-world multilingual how-to instructions. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization, pp. 159–165, July 2018 Chocron, P., Pareti, P.: Vocabulary alignment for collaborative agents: a study with real-world multilingual how-to instructions. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization, pp. 159–165, July 2018
13.
Zurück zum Zitat Cintra, A.C.d.A., Grinover, A.P., Dinamarco, C.R.: Teoria geral do processo. Malheiros (2011) Cintra, A.C.d.A., Grinover, A.P., Dinamarco, C.R.: Teoria geral do processo. Malheiros (2011)
14.
Zurück zum Zitat Cohen, P.R.: Empirical Methods for Artificial Intelligence. MIT Press, Cambridge (1995)MATH Cohen, P.R.: Empirical Methods for Artificial Intelligence. MIT Press, Cambridge (1995)MATH
15.
Zurück zum Zitat Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., Aluisio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks (Section 3), August 2017 Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., Aluisio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks (Section 3), August 2017
17.
Zurück zum Zitat Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 2017-January, pp. 1746–1751. Association for Computational Linguistics, Stroudsburg, September 2014 Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 2017-January, pp. 1746–1751. Association for Computational Linguistics, Stroudsburg, September 2014
18.
Zurück zum Zitat Kowsari, K., Meimandi, J., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019)CrossRef Kowsari, K., Meimandi, J., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019)CrossRef
19.
Zurück zum Zitat Kumar, G.R., Mangathayaru, N., Narasimha, G.: Intrusion detection using text processing techniques. In: Proceedings of the The International Conference on Engineering & MIS 2015 - ICEMIS 2015. ACM Press (2015) Kumar, G.R., Mangathayaru, N., Narasimha, G.: Intrusion detection using text processing techniques. In: Proceedings of the The International Conference on Engineering & MIS 2015 - ICEMIS 2015. ACM Press (2015)
20.
Zurück zum Zitat Lai, S., Liu, K., He, S., Zhao, J.: How to generate a good word embedding. IEEE Intell. Syst. 31(6), 5–14 (2016)CrossRef Lai, S., Liu, K., He, S., Zhao, J.: How to generate a good word embedding. IEEE Intell. Syst. 31(6), 5–14 (2016)CrossRef
22.
Zurück zum Zitat Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12, January 2013 Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12, January 2013
23.
Zurück zum Zitat Peng, H., et al.: Large-scale hierarchical text classification with recursively regularized deep graph-CNN. In: Proceedings of the 2018 World Wide Web Conference, WWW 2018, pp. 1063–1072. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva (2018) Peng, H., et al.: Large-scale hierarchical text classification with recursively regularized deep graph-CNN. In: Proceedings of the 2018 World Wide Web Conference, WWW 2018, pp. 1063–1072. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva (2018)
24.
Zurück zum Zitat Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 19, pp. 1532–1543. Association for Computational Linguistics, Stroudsburg (2014) Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 19, pp. 1532–1543. Association for Computational Linguistics, Stroudsburg (2014)
25.
Zurück zum Zitat Rodrigues, R.C., Rodrigues, J., de Castro, P.V.Q., da Silva, N.F.F., Soares, A.: Portuguese language models and word embeddings: evaluating on semantic similarity tasks. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds.) PROPOR 2020. LNCS (LNAI), vol. 12037, pp. 239–248. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41505-1_23CrossRef Rodrigues, R.C., Rodrigues, J., de Castro, P.V.Q., da Silva, N.F.F., Soares, A.: Portuguese language models and word embeddings: evaluating on semantic similarity tasks. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds.) PROPOR 2020. LNCS (LNAI), vol. 12037, pp. 239–248. Springer, Cham (2020). https://​doi.​org/​10.​1007/​978-3-030-41505-1_​23CrossRef
26.
Zurück zum Zitat Santos, H., Woloszyn, V., Vieira, R.: BlogSet-BR: a Brazilian Portuguese blog corpus. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki (2018) Santos, H., Woloszyn, V., Vieira, R.: BlogSet-BR: a Brazilian Portuguese blog corpus. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki (2018)
27.
Zurück zum Zitat Sheikhalishahi, S., Miotto, R., Dudley, J.T., Lavelli, A., Rinaldi, F., Osmani, V.: Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med. Inform. 7(2), e12239 (2019)CrossRef Sheikhalishahi, S., Miotto, R., Dudley, J.T., Lavelli, A., Rinaldi, F., Osmani, V.: Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med. Inform. 7(2), e12239 (2019)CrossRef
28.
Zurück zum Zitat da Silva, N.C., et al.: Document type classification for Brazil’s supreme court using a convolutional neural network. In: 10th International Conference on Forensic Computer Science and Cyber Law (ICoFCS), Sao Paulo, Brazil, October 2018 da Silva, N.C., et al.: Document type classification for Brazil’s supreme court using a convolutional neural network. In: 10th International Conference on Forensic Computer Science and Cyber Law (ICoFCS), Sao Paulo, Brazil, October 2018
29.
Zurück zum Zitat Smywiński-Pohl, A., Lasocki, K., Wróbel, K., Strzałta, M.: Automatic construction of a polish legal dictionary with mappings to extra-legal terms established via word embeddings. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law - ICAIL 2019. ACM Press (2019) Smywiński-Pohl, A., Lasocki, K., Wróbel, K., Strzałta, M.: Automatic construction of a polish legal dictionary with mappings to extra-legal terms established via word embeddings. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law - ICAIL 2019. ACM Press (2019)
35.
Zurück zum Zitat Uysal, A.K.: An improved global feature selection scheme for text classification. Expert Syst. Appl. 43, 82–92 (2016)CrossRef Uysal, A.K.: An improved global feature selection scheme for text classification. Expert Syst. Appl. 43, 82–92 (2016)CrossRef
36.
Zurück zum Zitat Wang, S., Zhou, W., Jiang, C.: A survey of word embeddings based on deep learning. Computing 102(3), 717–740 (2019)MathSciNetCrossRef Wang, S., Zhou, W., Jiang, C.: A survey of word embeddings based on deep learning. Computing 102(3), 717–740 (2019)MathSciNetCrossRef
Metadaten
Titel
Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain
verfasst von
Thiago Raulino Dal Pont
Isabela Cristina Sabo
Jomi Fred Hübner
Aires José Rover
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-61377-8_36