Skip to main content
Top

2019 | OriginalPaper | Chapter

Expanding N-grams for Code-Switch Language Models

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

It has become common, especially among urban youth, for people to use more than one language in their everyday conversations - a phenomenon referred to by linguists as “code-switching”. With the rise in globalization and the widespread of code-switching among multilingual societies, a great demand has been placed on Natural Language Processing (NLP) applications to be able to handle such mixed data. In this paper, we present our efforts in language modeling for code-switch Arabic-English. In order to train a language model (LM), huge amounts of text data is required in the respective language. However, the main challenge faced in language modeling for code-switch languages, is the lack of available data. In this paper, we propose an approach to artificially generate code-switch Arabic-English n-grams and thus improve the language model. This was done by expanding the relatively-small available corpus and its corresponding n-grams using translation-based approaches. The final LM achieved relative improvements in both perplexity and OOV rates of 1.97% and 16.36% respectively.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Adel, H., Kirchhoff, K., Vu, N.T., Telaar, D., Schultz, T.: Comparing approaches to convert recurrent neural networks into backoff language models for efficient decoding. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association (Interspeech 2014), Singapore, pp. 651–655 (2014) Adel, H., Kirchhoff, K., Vu, N.T., Telaar, D., Schultz, T.: Comparing approaches to convert recurrent neural networks into backoff language models for efficient decoding. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association (Interspeech 2014), Singapore, pp. 651–655 (2014)
2.
go back to reference Ardila, A.: Spanglish: an anglicized Spanish dialect. Hispanic J. Behav. Sci. 27(1), 60–81 (2005)CrossRef Ardila, A.: Spanglish: an anglicized Spanish dialect. Hispanic J. Behav. Sci. 27(1), 60–81 (2005)CrossRef
3.
go back to reference Auer, P.: A postscript: code-switching and social identity. J. Pragmat. 37(3), 403–410 (2005)CrossRef Auer, P.: A postscript: code-switching and social identity. J. Pragmat. 37(3), 403–410 (2005)CrossRef
4.
go back to reference Auer, P. (ed.): Code-Switching in Conversation: Language, Interaction and Identity. Routledge, London (1998) Auer, P. (ed.): Code-Switching in Conversation: Language, Interaction and Identity. Routledge, London (1998)
5.
go back to reference Bhuvanagiri, K., Kopparapu, S.: An approach to mixed language automatic speech recognition. In: Proceedings of the Oriental COCOSDA, Kathmandu, Nepal (2010) Bhuvanagiri, K., Kopparapu, S.: An approach to mixed language automatic speech recognition. In: Proceedings of the Oriental COCOSDA, Kathmandu, Nepal (2010)
6.
go back to reference Bhuvanagirir, K., Kopparapu, S.K.: Mixed language speech recognition without explicit identification of language. Am. J. Sig. Process. 2(5), 92–97 (2012)CrossRef Bhuvanagirir, K., Kopparapu, S.K.: Mixed language speech recognition without explicit identification of language. Am. J. Sig. Process. 2(5), 92–97 (2012)CrossRef
7.
go back to reference Cao, H., Ching, P., Lee, T., Yeung, Y.T.: Semantics-based language modeling for Cantonese-English code-mixing speech recognition. In: Proceedings of the 7th International Symposium on Chinese Spoken Language Processing (ISCSLP 2010), pp. 246–250. IEEE, Tainan (2010) Cao, H., Ching, P., Lee, T., Yeung, Y.T.: Semantics-based language modeling for Cantonese-English code-mixing speech recognition. In: Proceedings of the 7th International Symposium on Chinese Spoken Language Processing (ISCSLP 2010), pp. 246–250. IEEE, Tainan (2010)
8.
go back to reference Chan, J.Y., Cao, H., Ching, P., Lee, T.: Automatic recognition of Cantonese-English code-mixing speech. Comput. Linguist. Chin. Lang. Process. 14(3), 281–304 (2009) Chan, J.Y., Cao, H., Ching, P., Lee, T.: Automatic recognition of Cantonese-English code-mixing speech. Comput. Linguist. Chin. Lang. Process. 14(3), 281–304 (2009)
9.
go back to reference Chen, C.: Two types of code-switching in Taiwan. In: Proceeding of the 15th Sociolinguistics Symposium, Newcastle, UK (2004) Chen, C.: Two types of code-switching in Taiwan. In: Proceeding of the 15th Sociolinguistics Symposium, Newcastle, UK (2004)
10.
go back to reference Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)MATH Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)MATH
11.
go back to reference Franco, J., Solorio, T.: Baby-steps towards building a Spanglish language model. In: Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, pp. 75–84. Springer, Heidelberg (2007) Franco, J., Solorio, T.: Baby-steps towards building a Spanglish language model. In: Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, pp. 75–84. Springer, Heidelberg (2007)
12.
go back to reference Fung, P., Schultz, T.: Multilingual spoken language processing. IEEE Sig. Process. Mag. 25(3), 89–97 (2008)CrossRef Fung, P., Schultz, T.: Multilingual spoken language processing. IEEE Sig. Process. Mag. 25(3), 89–97 (2008)CrossRef
13.
go back to reference Hamed, I., Elmahdy, M., Abdennadher, S.: Building a first language model for code-switch Arabic-English. In: Proceedings of The 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), pp. 208–216. Elsevier, Dubai (2017) Hamed, I., Elmahdy, M., Abdennadher, S.: Building a first language model for code-switch Arabic-English. In: Proceedings of The 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), pp. 208–216. Elsevier, Dubai (2017)
14.
go back to reference Li, Y., Fung, P.: Code switch language modeling with functional head constraint. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), pp. 4913–4917. IEEE, Florence (2014) Li, Y., Fung, P.: Code switch language modeling with functional head constraint. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), pp. 4913–4917. IEEE, Florence (2014)
15.
go back to reference Li, D.: Cantonese-English code-switching research in Hong Kong: a Y2K review. World Engl. 19(3), 305–322 (2000)CrossRef Li, D.: Cantonese-English code-switching research in Hong Kong: a Y2K review. World Engl. 19(3), 305–322 (2000)CrossRef
16.
go back to reference Luján-Mares, M., Martínez-Hinarejos, C.D., Alabau, V.: A study on bilingual speech recognition involving a minority language. In: Proceedings of the Language and Technology Conference, pp. 36–49. Springer, Heidelberg (2007) Luján-Mares, M., Martínez-Hinarejos, C.D., Alabau, V.: A study on bilingual speech recognition involving a minority language. In: Proceedings of the Language and Technology Conference, pp. 36–49. Springer, Heidelberg (2007)
17.
go back to reference Lyu, D.-C., Tan, T.-P., Chng, E.-S., Li, H.: An analysis of a Mandarin-English code-switching speech corpus: SEAME. Age, vol. 21, p. 25-8 (2010) Lyu, D.-C., Tan, T.-P., Chng, E.-S., Li, H.: An analysis of a Mandarin-English code-switching speech corpus: SEAME. Age, vol. 21, p. 25-8 (2010)
18.
go back to reference Stolcke, A., et al.: SRILM-an extensible language modeling toolkit. In: Proceedings of the 7th Conference on Spoken Language Processing, Denver, Colorado, vol. 2, pp. 901–904 (2002) Stolcke, A., et al.: SRILM-an extensible language modeling toolkit. In: Proceedings of the 7th Conference on Spoken Language Processing, Denver, Colorado, vol. 2, pp. 901–904 (2002)
19.
go back to reference Uebler, U.: Multilingual speech recognition in seven languages. Speech Commun. 35(1), 53–69 (2001)CrossRef Uebler, U.: Multilingual speech recognition in seven languages. Speech Commun. 35(1), 53–69 (2001)CrossRef
20.
go back to reference van der Westhuizen, E., Niesler, T.: Automatic speech recognition of English-isiZulu codeswitched speech from South African soap operas. Procedia Comput. Sci. 81, 121–127 (2016)CrossRef van der Westhuizen, E., Niesler, T.: Automatic speech recognition of English-isiZulu codeswitched speech from South African soap operas. Procedia Comput. Sci. 81, 121–127 (2016)CrossRef
21.
go back to reference Vu, N.T., Schultz, T.: Exploration of the impact of maximum entropy in recurrent neural network language models for code-switching speech. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 34–41 (2014) Vu, N.T., Schultz, T.: Exploration of the impact of maximum entropy in recurrent neural network language models for code-switching speech. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 34–41 (2014)
22.
go back to reference Vu, N.T., Lyu, D.-C., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Chng, E.-S., Schultz, T., Li, H.: A first speech recognition system for Mandarin-English code-switch conversational speech. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), pp. 4889–4892. IEEE, Kyoto (2012) Vu, N.T., Lyu, D.-C., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Chng, E.-S., Schultz, T., Li, H.: A first speech recognition system for Mandarin-English code-switch conversational speech. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), pp. 4889–4892. IEEE, Kyoto (2012)
23.
go back to reference Weiner, J., Vu, N.T., Telaar, D., Metze, F., Schultz, T., Lyu, D.-C., Chng, E.-S., Li, H.: Integration of language identification into a recognition system for spoken conversations containing code-switches. In: Proceedings of the 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages, Cape Town, South Africa, pp. 61–64 (2012) Weiner, J., Vu, N.T., Telaar, D., Metze, F., Schultz, T., Lyu, D.-C., Chng, E.-S., Li, H.: Integration of language identification into a recognition system for spoken conversations containing code-switches. In: Proceedings of the 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages, Cape Town, South Africa, pp. 61–64 (2012)
24.
go back to reference Weng, F., Bratt, H., Neumeyer, L., Stolcke, A.: A study of multilingual speech recognition. In: Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech 1997), Rhodes, Greece, pp. 359–362 (1997) Weng, F., Bratt, H., Neumeyer, L., Stolcke, A.: A study of multilingual speech recognition. In: Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech 1997), Rhodes, Greece, pp. 359–362 (1997)
25.
go back to reference Xu, R., Zhang, Q., Pan, J., Yan, Y.: Investigations to minimum phone error training in bilingual speech recognition. In: Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2009), vol. 4, pp. 486–490. IEEE, Tianjin (2009) Xu, R., Zhang, Q., Pan, J., Yan, Y.: Investigations to minimum phone error training in bilingual speech recognition. In: Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2009), vol. 4, pp. 486–490. IEEE, Tianjin (2009)
26.
go back to reference Yılmaz, E., van den Heuvel, H., van Leeuwen, D.: Investigating bilingual deep neural networks for automatic recognition of code-switching Frisian speech. Procedia Comput. Sci. 81, 159–166 (2016)CrossRef Yılmaz, E., van den Heuvel, H., van Leeuwen, D.: Investigating bilingual deep neural networks for automatic recognition of code-switching Frisian speech. Procedia Comput. Sci. 81, 159–166 (2016)CrossRef
Metadata
Title
Expanding N-grams for Code-Switch Language Models
Authors
Injy Hamed
Mohamed Elmahdy
Slim Abdennadher
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-319-99010-1_20

Premium Partner