nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

A Low Dimensionality Representation for Language Variety Identification

verfasst von : Francisco Rangel, Marc Franco-Salvador, Paolo Rosso

Erschienen in: Computational Linguistics and Intelligent Text Processing

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of \({\sim }\)35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show competitive performance while dramatically reducing the dimensionality—and increasing the big data suitability—to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel A Parallel Corpus of Translationese

Nächstes Kapitel Towards Empathetic Human-Robot Interactions

http://alt.qcri.org/LT4CloseLang/index.html.

http://corporavm.uni-koeln.de/vardial/sharedtask.html.

http://ttg.uni-saarland.de/lt4vardial2015/dsl.html.

It is important to highlight the importance of this aspect from an evaluation perspective in an author profiling scenario. In fact, if texts from the same authors are both part of the training and test sets, their particular style and vocabulary choice may contribute at training time to learn the profile of the authors. In consequence, over-fitting would be biasing the results.

Our hypothesis is that the distribution of weights for a given document should be closer to the weights of its corresponding language variety, therefore, we use the most common descriptive statistics to measure this variability among language varieties.

Using the LDR a document is represented by a total set of features equal to 6 multiplied by the number of categories (the 5 language varieties), in our case 30 features. This is a considerable dimensionality reduction that may be helpful to deal with big data environments.

The HispaBlogs dataset was collected by experts on social media from the Autoritas Consulting company (http://www.autoritas.net). Autoritas experts in the different countries selected popular bloggers related to politics, online marketing, technology or trends. The HispaBlogs dataset is publicly available at: https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs.

We used 300-dimensional vectors, context windows of size 10, and 20 negative words for each sample. We preprocessed the text with word lowercase, tokenization, removing the words of length one, and with phrase detection using word2vec tools: https://code.google.com/p/word2vec/.

http://www.cs.waikato.ac.nz/ml/weka/.

We used SVM with default parameters and exhaustive correction code to transform the multiclass problem into a binary one.

http://ttg.uni-saarland.de/lt4vardial2015/dsl.html.

http://pan.webis.de.

http://www.clef-innitiative.org.

Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., Antònia Martít, M.: Language variety identification using distributed representations of words and documents. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 28–40. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24027-5_3 CrossRef

Goodman, J.: Classes for fast maximum entropy training. In: Proceedings of the Acoustics, Speech, and Signal Processing (ICASSP 2001), vol. 1, pp. 561–564 (2001)

Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13, 307–361 (2012)MathSciNetMATH

Hinton, G.E., Mcclelland, J.L., Rumelhart, D.E.: Distributed Representations, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Foundations, vol. 1. MIT Press, Cambridge (1986)

Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), vol. 32 (2014)

Maier, W., Gómez-Rodríguez, C.: Language variety identification in Spanish tweets. In: Workshop on Language Technology for Closely Related Languages and Language Variants (EMNLP 2014), pp. 25–35 (2014)

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at International Conference on Learning Representations (ICLR 2013) (2013)

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models. In: Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 1751–1758 (2012)

10.

Sadat, F., Kazemi, F., Farzindar, A.: Automatic identification of Arabic language varieties and dialects in social media. In: 1st International Workshop on Social Media Retrieval and Analysis (SoMeRa 2014) (2014)

11.

Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)CrossRef

12.

Tan, L., Zampieri, M., Ljubešic, N., Tiedemann, J.: Merging comparable data sources for the discrimination of similar languages: the DSL corpus collection. In: 7th Workshop on Building and Using Comparable Corpora Building Resources for Machine Translation Research (BUCC 2014), pp. 6–10 (2014)

13.

Zampieri, M., Gebrekidan-Gebre, B.: Automatic identification of language varieties: the case of Portuguese. In: Proceedings of the 11th Conference on Natural Language Processing (KONVENS 2012), pp. 233–237 (2012)

14.

Zampieri, M., Tan, L., Ljubeši, N., Tiedemann, J.: A report on the DSL shared task 2014. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial 2014), pp. 58–67 (2014)

Titel: A Low Dimensionality Representation for Language Variety Identification
verfasst von: Francisco Rangel
Marc Franco-Salvador
Paolo Rosso
Verlag: Springer International Publishing
Buch: Computational Linguistics and Intelligent Text Processing
Print ISBN: 978-3-319-75486-4

Electronic ISBN: 978-3-319-75487-1

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-3-319-75487-1_13

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner