Skip to main content
Top
Published in: International Journal of Speech Technology 2/2012

01-06-2012 | Original Paper

Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling

Authors: G. Bouselmi, D. Fohr, I. Illina

Published in: International Journal of Speech Technology | Issue 2/2012

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This article presents an approach for the automatic recognition of non-native speech. Some non-native speakers tend to pronounce phonemes as they would in their native language. Model adaptation can improve the recognition rate for non-native speakers, but has difficulties dealing with pronunciation errors like phoneme insertions or substitutions. For these pronunciation mismatches, pronunciation modeling can make the recognition system more robust. Our approach is based on acoustic model transformation and pronunciation modeling for multiple non-native accents. For acoustic model transformation, two approaches are evaluated: MAP and model re-estimation. For pronunciation modeling, confusion rules (alternate pronunciations) are automatically extracted from a small non-native speech corpus. This paper presents a novel approach to introduce confusion rules in the recognition system which are automatically learned through pronunciation modelling. The modified HMM of a foreign spoken language phoneme includes its canonical pronunciation along with all the alternate non-native pronunciations, so that spoken language phonemes pronounced correctly by a non-native speaker could be recognized. We evaluate our approaches on the European project HIWIRE non-native corpus which contains English sentences pronounced by French, Italian, Greek and Spanish speakers. Two cases are studied: the native language of the test speaker is either known or unknown. Our approach gives better recognition results than the classical acoustic adaptation of HMM when the foreign origin of the speaker is known. We obtain 22% WER reduction compared to the reference system.
Furthermore, we take into account the written form of the spoken words: non-native speakers may rely on the writing of the words in order to pronounce them. This approach does not provide any further improvements.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
For native speakers it would be more efficient to use a native ASR.
 
Literature
go back to reference Bartkova, K., & Jouvet, D. (2006). Using multilingual units for improved modeling of pronunciation variants. In Proceedings IEEE international conference on acoustic, speech and signal processing, Toulouse, France. Bartkova, K., & Jouvet, D. (2006). Using multilingual units for improved modeling of pronunciation variants. In Proceedings IEEE international conference on acoustic, speech and signal processing, Toulouse, France.
go back to reference Bartkova, K., & Jouvet, D. (2007). On using units trained on foreign data for improved multiple accent speech recognition. Speech Communication, 49, 836–846. CrossRef Bartkova, K., & Jouvet, D. (2007). On using units trained on foreign data for improved multiple accent speech recognition. Speech Communication, 49, 836–846. CrossRef
go back to reference Bisani, M., & Ney, H. (2003). Multigram-based grapheme-to-phoneme conversion for LVCSR. In Proceedings Interspeech. Bisani, M., & Ney, H. (2003). Multigram-based grapheme-to-phoneme conversion for LVCSR. In Proceedings Interspeech.
go back to reference Bouselmi, G., Fohr, D., Illina, I., & Haton, J.-P. (2005). Fully automated non-native speech recognition using confusion-based acoustic model integration. In Proceedings Interspeech, Lisboa. Bouselmi, G., Fohr, D., Illina, I., & Haton, J.-P. (2005). Fully automated non-native speech recognition using confusion-based acoustic model integration. In Proceedings Interspeech, Lisboa.
go back to reference Bouselmi, G., Fohr, D., Illina, I., & Haton, J.-P. (2006). Fully automated non-native speech recognition using confusion-based acoustic model integration and graphemic constraints. In Proceedings IEEE international conference on acoustic, speech and signal processing (pp. 345–348), Toulouse, France. Bouselmi, G., Fohr, D., Illina, I., & Haton, J.-P. (2006). Fully automated non-native speech recognition using confusion-based acoustic model integration and graphemic constraints. In Proceedings IEEE international conference on acoustic, speech and signal processing (pp. 345–348), Toulouse, France.
go back to reference Bouselmi, G., Fohr, D., & Illina, I. (2007). Combined acoustic and pronunciation modelling for non-native speech recognition. In Proceedings Interspeech (pp. 1449–1452), Antwerp, Belgium. Bouselmi, G., Fohr, D., & Illina, I. (2007). Combined acoustic and pronunciation modelling for non-native speech recognition. In Proceedings Interspeech (pp. 1449–1452), Antwerp, Belgium.
go back to reference Clarke, C., & Jurafsky, D. (2006). Limitations of MLLR adaptation with Spanish-accented English: an error analysis. In Proceedings international conference on spoken language processing (pp. 1117–1120), Pittsburgh, PA, USA. Clarke, C., & Jurafsky, D. (2006). Limitations of MLLR adaptation with Spanish-accented English: an error analysis. In Proceedings international conference on spoken language processing (pp. 1117–1120), Pittsburgh, PA, USA.
go back to reference Coile, B. V. (1990). Inductive learning of grapheme-to-phoneme rules. In Proceedings international conference on spoken language processing. Coile, B. V. (1990). Inductive learning of grapheme-to-phoneme rules. In Proceedings international conference on spoken language processing.
go back to reference Compernolle, D. V. (2001). Recognizing speech of goats, wolves, sheep and … non-natives. Speech Communication, 35(1–2), 71–79. MATHCrossRef Compernolle, D. V. (2001). Recognizing speech of goats, wolves, sheep and … non-natives. Speech Communication, 35(1–2), 71–79. MATHCrossRef
go back to reference Cremelie, N., & Martens, J.-P. (1997). Automatic rule based generation of word pronunciation networks. In Proceedings of Eurospeech97 (pp. 2459–2462). Cremelie, N., & Martens, J.-P. (1997). Automatic rule based generation of word pronunciation networks. In Proceedings of Eurospeech97 (pp. 2459–2462).
go back to reference Cremelie, N., & Martens, J.-P. (1999). In search of better pronunciation models for speech recognition. Speech Communication, 29(2–4), 115–136. CrossRef Cremelie, N., & Martens, J.-P. (1999). In search of better pronunciation models for speech recognition. Speech Communication, 29(2–4), 115–136. CrossRef
go back to reference Flege, J., Schirru, C., & MacKay, I. (2003). Interaction between the native and second language phonetic subsystems. Speech Communication, 40, 467–491. CrossRef Flege, J., Schirru, C., & MacKay, I. (2003). Interaction between the native and second language phonetic subsystems. Speech Communication, 40, 467–491. CrossRef
go back to reference Gillick, L., & Cox, S. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings IEEE international conference on acoustic, speech and signal processing (pp. 532–535). Gillick, L., & Cox, S. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings IEEE international conference on acoustic, speech and signal processing (pp. 532–535).
go back to reference Goronzy, S., Rapp, S., & Kompe, R. (2004). Generating non-native pronunciation variants for lexicon adaptation. Speech Communication, 42, 109–123. CrossRef Goronzy, S., Rapp, S., & Kompe, R. (2004). Generating non-native pronunciation variants for lexicon adaptation. Speech Communication, 42, 109–123. CrossRef
go back to reference He, X., & Zhao, Y. (2003). Fast model selection based speaker adaptation for non native speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 298–307. He, X., & Zhao, Y. (2003). Fast model selection based speaker adaptation for non native speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 298–307.
go back to reference Jeffers, R. J., & Lehiste, I. (1979). Principles and methods for historical linguistics. Cambridge: MIT Press. Jeffers, R. J., & Lehiste, I. (1979). Principles and methods for historical linguistics. Cambridge: MIT Press.
go back to reference Jurafsky, D., Ward, W., Jianping, Z., Herold, K., Xiuyang, Y., & Sen, Z. (2001). What kind of pronunciation variation is hard for triphones to model. In Proceedings IEEE international conference on acoustic, speech and signal processing. Jurafsky, D., Ward, W., Jianping, Z., Herold, K., Xiuyang, Y., & Sen, Z. (2001). What kind of pronunciation variation is hard for triphones to model. In Proceedings IEEE international conference on acoustic, speech and signal processing.
go back to reference Ladefoged, P., & Maddieson, I. (1996). The sounds of the world’s languages. Oxford: Blackwell Publishers. Ladefoged, P., & Maddieson, I. (1996). The sounds of the world’s languages. Oxford: Blackwell Publishers.
go back to reference Lawson, A., Harris, D., & Grieco, J. (2003). Effect of foreign accent on speech recognition in the NATO N-4 corpus. In Proceedings interspeech (pp. 1505–1508), Geneva, Switzerland. Lawson, A., Harris, D., & Grieco, J. (2003). Effect of foreign accent on speech recognition in the NATO N-4 corpus. In Proceedings interspeech (pp. 1505–1508), Geneva, Switzerland.
go back to reference Livescu, K., & Glass, J. (2000). Lexical modeling of non-native speech for automatic speech recognition. In Proceedings IEEE international conference on acoustic, speech and signal processing (pp. 1683–1686), Istanbul, Turkey. Livescu, K., & Glass, J. (2000). Lexical modeling of non-native speech for automatic speech recognition. In Proceedings IEEE international conference on acoustic, speech and signal processing (pp. 1683–1686), Istanbul, Turkey.
go back to reference Minematsu, N., Osaki, K., & Hirose, K. (2003). Improvement of non-native speech recognition by effectively modeling frequently observed pronunciation habits. In Proceedings interspeech (pp. 2597–2600), Geneva, Switzerland. Minematsu, N., Osaki, K., & Hirose, K. (2003). Improvement of non-native speech recognition by effectively modeling frequently observed pronunciation habits. In Proceedings interspeech (pp. 2597–2600), Geneva, Switzerland.
go back to reference Morgan, J. (2004). Making a speech recognizer tolerate non-native speech through Gaussian mixture merging. In Proceedings InSTIL/ICALL (pp. 213–216), Venice, Italy. Morgan, J. (2004). Making a speech recognizer tolerate non-native speech through Gaussian mixture merging. In Proceedings InSTIL/ICALL (pp. 213–216), Venice, Italy.
go back to reference Oh, Y. R., Yoon, J. S., & Kim, H. K. (2007). Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition. Speech Communication, 49, 59–70. CrossRef Oh, Y. R., Yoon, J. S., & Kim, H. K. (2007). Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition. Speech Communication, 49, 59–70. CrossRef
go back to reference Raux, A. (2004). Automated lexical adaptation and speaker clustering based on pronunciation habits for non-native speech recognition. In Proceedings international conference on spoken language processing (pp. 613–616), Jeju Island, Korea. Raux, A. (2004). Automated lexical adaptation and speaker clustering based on pronunciation habits for non-native speech recognition. In Proceedings international conference on spoken language processing (pp. 613–616), Jeju Island, Korea.
go back to reference Saraclar, M., Nock, H., & Khudanpur, S. (2000). Pronunciation modeling by sharing Gaussian densities across phonetic models. Computer Speech & Language, 14, 137–160. CrossRef Saraclar, M., Nock, H., & Khudanpur, S. (2000). Pronunciation modeling by sharing Gaussian densities across phonetic models. Computer Speech & Language, 14, 137–160. CrossRef
go back to reference Schaden, S. (2003). Generating non-native pronunciation lexicons by phonological rules. In Proceedings ICPhS (pp. 2545–2548). Schaden, S. (2003). Generating non-native pronunciation lexicons by phonological rules. In Proceedings ICPhS (pp. 2545–2548).
go back to reference Stouten, F., & Martens, J.-P. (2007). Recognition of foreign names spoken by native speakers. In Proceedings Interspeech (pp. 2133–2136), Antwerp, Belgium. Stouten, F., & Martens, J.-P. (2007). Recognition of foreign names spoken by native speakers. In Proceedings Interspeech (pp. 2133–2136), Antwerp, Belgium.
go back to reference Tomokiyo, M., & Waibel, A. (2001). Adaptation methods for non-native speech. In Multilinguality in spoken language processing (pp. 137–140), Aalborg, Denmark. Tomokiyo, M., & Waibel, A. (2001). Adaptation methods for non-native speech. In Multilinguality in spoken language processing (pp. 137–140), Aalborg, Denmark.
Metadata
Title
Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling
Authors
G. Bouselmi
D. Fohr
I. Illina
Publication date
01-06-2012
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 2/2012
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-012-9134-8

Other articles of this Issue 2/2012

International Journal of Speech Technology 2/2012 Go to the issue