Skip to main content

2019 | OriginalPaper | Buchkapitel

Expanding N-grams for Code-Switch Language Models

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

It has become common, especially among urban youth, for people to use more than one language in their everyday conversations - a phenomenon referred to by linguists as “code-switching”. With the rise in globalization and the widespread of code-switching among multilingual societies, a great demand has been placed on Natural Language Processing (NLP) applications to be able to handle such mixed data. In this paper, we present our efforts in language modeling for code-switch Arabic-English. In order to train a language model (LM), huge amounts of text data is required in the respective language. However, the main challenge faced in language modeling for code-switch languages, is the lack of available data. In this paper, we propose an approach to artificially generate code-switch Arabic-English n-grams and thus improve the language model. This was done by expanding the relatively-small available corpus and its corresponding n-grams using translation-based approaches. The final LM achieved relative improvements in both perplexity and OOV rates of 1.97% and 16.36% respectively.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Adel, H., Kirchhoff, K., Vu, N.T., Telaar, D., Schultz, T.: Comparing approaches to convert recurrent neural networks into backoff language models for efficient decoding. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association (Interspeech 2014), Singapore, pp. 651–655 (2014) Adel, H., Kirchhoff, K., Vu, N.T., Telaar, D., Schultz, T.: Comparing approaches to convert recurrent neural networks into backoff language models for efficient decoding. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association (Interspeech 2014), Singapore, pp. 651–655 (2014)
2.
Zurück zum Zitat Ardila, A.: Spanglish: an anglicized Spanish dialect. Hispanic J. Behav. Sci. 27(1), 60–81 (2005)CrossRef Ardila, A.: Spanglish: an anglicized Spanish dialect. Hispanic J. Behav. Sci. 27(1), 60–81 (2005)CrossRef
3.
Zurück zum Zitat Auer, P.: A postscript: code-switching and social identity. J. Pragmat. 37(3), 403–410 (2005)CrossRef Auer, P.: A postscript: code-switching and social identity. J. Pragmat. 37(3), 403–410 (2005)CrossRef
4.
Zurück zum Zitat Auer, P. (ed.): Code-Switching in Conversation: Language, Interaction and Identity. Routledge, London (1998) Auer, P. (ed.): Code-Switching in Conversation: Language, Interaction and Identity. Routledge, London (1998)
5.
Zurück zum Zitat Bhuvanagiri, K., Kopparapu, S.: An approach to mixed language automatic speech recognition. In: Proceedings of the Oriental COCOSDA, Kathmandu, Nepal (2010) Bhuvanagiri, K., Kopparapu, S.: An approach to mixed language automatic speech recognition. In: Proceedings of the Oriental COCOSDA, Kathmandu, Nepal (2010)
6.
Zurück zum Zitat Bhuvanagirir, K., Kopparapu, S.K.: Mixed language speech recognition without explicit identification of language. Am. J. Sig. Process. 2(5), 92–97 (2012)CrossRef Bhuvanagirir, K., Kopparapu, S.K.: Mixed language speech recognition without explicit identification of language. Am. J. Sig. Process. 2(5), 92–97 (2012)CrossRef
7.
Zurück zum Zitat Cao, H., Ching, P., Lee, T., Yeung, Y.T.: Semantics-based language modeling for Cantonese-English code-mixing speech recognition. In: Proceedings of the 7th International Symposium on Chinese Spoken Language Processing (ISCSLP 2010), pp. 246–250. IEEE, Tainan (2010) Cao, H., Ching, P., Lee, T., Yeung, Y.T.: Semantics-based language modeling for Cantonese-English code-mixing speech recognition. In: Proceedings of the 7th International Symposium on Chinese Spoken Language Processing (ISCSLP 2010), pp. 246–250. IEEE, Tainan (2010)
8.
Zurück zum Zitat Chan, J.Y., Cao, H., Ching, P., Lee, T.: Automatic recognition of Cantonese-English code-mixing speech. Comput. Linguist. Chin. Lang. Process. 14(3), 281–304 (2009) Chan, J.Y., Cao, H., Ching, P., Lee, T.: Automatic recognition of Cantonese-English code-mixing speech. Comput. Linguist. Chin. Lang. Process. 14(3), 281–304 (2009)
9.
Zurück zum Zitat Chen, C.: Two types of code-switching in Taiwan. In: Proceeding of the 15th Sociolinguistics Symposium, Newcastle, UK (2004) Chen, C.: Two types of code-switching in Taiwan. In: Proceeding of the 15th Sociolinguistics Symposium, Newcastle, UK (2004)
10.
Zurück zum Zitat Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)MATH Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)MATH
11.
Zurück zum Zitat Franco, J., Solorio, T.: Baby-steps towards building a Spanglish language model. In: Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, pp. 75–84. Springer, Heidelberg (2007) Franco, J., Solorio, T.: Baby-steps towards building a Spanglish language model. In: Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, pp. 75–84. Springer, Heidelberg (2007)
12.
Zurück zum Zitat Fung, P., Schultz, T.: Multilingual spoken language processing. IEEE Sig. Process. Mag. 25(3), 89–97 (2008)CrossRef Fung, P., Schultz, T.: Multilingual spoken language processing. IEEE Sig. Process. Mag. 25(3), 89–97 (2008)CrossRef
13.
Zurück zum Zitat Hamed, I., Elmahdy, M., Abdennadher, S.: Building a first language model for code-switch Arabic-English. In: Proceedings of The 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), pp. 208–216. Elsevier, Dubai (2017) Hamed, I., Elmahdy, M., Abdennadher, S.: Building a first language model for code-switch Arabic-English. In: Proceedings of The 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), pp. 208–216. Elsevier, Dubai (2017)
14.
Zurück zum Zitat Li, Y., Fung, P.: Code switch language modeling with functional head constraint. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), pp. 4913–4917. IEEE, Florence (2014) Li, Y., Fung, P.: Code switch language modeling with functional head constraint. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), pp. 4913–4917. IEEE, Florence (2014)
15.
Zurück zum Zitat Li, D.: Cantonese-English code-switching research in Hong Kong: a Y2K review. World Engl. 19(3), 305–322 (2000)CrossRef Li, D.: Cantonese-English code-switching research in Hong Kong: a Y2K review. World Engl. 19(3), 305–322 (2000)CrossRef
16.
Zurück zum Zitat Luján-Mares, M., Martínez-Hinarejos, C.D., Alabau, V.: A study on bilingual speech recognition involving a minority language. In: Proceedings of the Language and Technology Conference, pp. 36–49. Springer, Heidelberg (2007) Luján-Mares, M., Martínez-Hinarejos, C.D., Alabau, V.: A study on bilingual speech recognition involving a minority language. In: Proceedings of the Language and Technology Conference, pp. 36–49. Springer, Heidelberg (2007)
17.
Zurück zum Zitat Lyu, D.-C., Tan, T.-P., Chng, E.-S., Li, H.: An analysis of a Mandarin-English code-switching speech corpus: SEAME. Age, vol. 21, p. 25-8 (2010) Lyu, D.-C., Tan, T.-P., Chng, E.-S., Li, H.: An analysis of a Mandarin-English code-switching speech corpus: SEAME. Age, vol. 21, p. 25-8 (2010)
18.
Zurück zum Zitat Stolcke, A., et al.: SRILM-an extensible language modeling toolkit. In: Proceedings of the 7th Conference on Spoken Language Processing, Denver, Colorado, vol. 2, pp. 901–904 (2002) Stolcke, A., et al.: SRILM-an extensible language modeling toolkit. In: Proceedings of the 7th Conference on Spoken Language Processing, Denver, Colorado, vol. 2, pp. 901–904 (2002)
19.
Zurück zum Zitat Uebler, U.: Multilingual speech recognition in seven languages. Speech Commun. 35(1), 53–69 (2001)CrossRef Uebler, U.: Multilingual speech recognition in seven languages. Speech Commun. 35(1), 53–69 (2001)CrossRef
20.
Zurück zum Zitat van der Westhuizen, E., Niesler, T.: Automatic speech recognition of English-isiZulu codeswitched speech from South African soap operas. Procedia Comput. Sci. 81, 121–127 (2016)CrossRef van der Westhuizen, E., Niesler, T.: Automatic speech recognition of English-isiZulu codeswitched speech from South African soap operas. Procedia Comput. Sci. 81, 121–127 (2016)CrossRef
21.
Zurück zum Zitat Vu, N.T., Schultz, T.: Exploration of the impact of maximum entropy in recurrent neural network language models for code-switching speech. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 34–41 (2014) Vu, N.T., Schultz, T.: Exploration of the impact of maximum entropy in recurrent neural network language models for code-switching speech. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 34–41 (2014)
22.
Zurück zum Zitat Vu, N.T., Lyu, D.-C., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Chng, E.-S., Schultz, T., Li, H.: A first speech recognition system for Mandarin-English code-switch conversational speech. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), pp. 4889–4892. IEEE, Kyoto (2012) Vu, N.T., Lyu, D.-C., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Chng, E.-S., Schultz, T., Li, H.: A first speech recognition system for Mandarin-English code-switch conversational speech. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), pp. 4889–4892. IEEE, Kyoto (2012)
23.
Zurück zum Zitat Weiner, J., Vu, N.T., Telaar, D., Metze, F., Schultz, T., Lyu, D.-C., Chng, E.-S., Li, H.: Integration of language identification into a recognition system for spoken conversations containing code-switches. In: Proceedings of the 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages, Cape Town, South Africa, pp. 61–64 (2012) Weiner, J., Vu, N.T., Telaar, D., Metze, F., Schultz, T., Lyu, D.-C., Chng, E.-S., Li, H.: Integration of language identification into a recognition system for spoken conversations containing code-switches. In: Proceedings of the 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages, Cape Town, South Africa, pp. 61–64 (2012)
24.
Zurück zum Zitat Weng, F., Bratt, H., Neumeyer, L., Stolcke, A.: A study of multilingual speech recognition. In: Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech 1997), Rhodes, Greece, pp. 359–362 (1997) Weng, F., Bratt, H., Neumeyer, L., Stolcke, A.: A study of multilingual speech recognition. In: Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech 1997), Rhodes, Greece, pp. 359–362 (1997)
25.
Zurück zum Zitat Xu, R., Zhang, Q., Pan, J., Yan, Y.: Investigations to minimum phone error training in bilingual speech recognition. In: Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2009), vol. 4, pp. 486–490. IEEE, Tianjin (2009) Xu, R., Zhang, Q., Pan, J., Yan, Y.: Investigations to minimum phone error training in bilingual speech recognition. In: Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2009), vol. 4, pp. 486–490. IEEE, Tianjin (2009)
26.
Zurück zum Zitat Yılmaz, E., van den Heuvel, H., van Leeuwen, D.: Investigating bilingual deep neural networks for automatic recognition of code-switching Frisian speech. Procedia Comput. Sci. 81, 159–166 (2016)CrossRef Yılmaz, E., van den Heuvel, H., van Leeuwen, D.: Investigating bilingual deep neural networks for automatic recognition of code-switching Frisian speech. Procedia Comput. Sci. 81, 159–166 (2016)CrossRef
Metadaten
Titel
Expanding N-grams for Code-Switch Language Models
verfasst von
Injy Hamed
Mohamed Elmahdy
Slim Abdennadher
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-319-99010-1_20