Skip to main content

2014 | OriginalPaper | Buchkapitel

Structured GMM Based on Unsupervised Clustering for Recognizing Adult and Child Speech

verfasst von : Arseniy Gorin, Denis Jouvet

Erschienen in: Statistical Language and Speech Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Speaker variability is a well-known problem of state-of-the-art Automatic Speech Recognition (ASR) systems. In particular, handling children speech is challenging because of substantial differences in pronunciation of the speech units between adult and child speakers. To build accurate ASR systems for all types of speakers Hidden Markov Models with Gaussian Mixture Densities were intensively used in combination with model adaptation techniques.
This paper compares different ways to improve the recognition of children speech and describes a novel approach relying on Class-Structured Gaussian Mixture Model (GMM).
A common solution for reducing the speaker variability relies on gender and age adaptation. First, it is proposed to replace gender and age by unsupervised clustering. Speaker classes are first used for adaptation of the conventional HMM. Second, speaker classes are used for initializing structured GMM, where the components of Gaussian densities are structured with respect to the speaker classes. In a first approach mixture weights of the structured GMM are set dependent on the speaker class. In a second approach the mixture weights are replaced by explicit dependencies between Gaussian components of mixture densities (as in stranded GMMs, but here the GMMs are class-structured).
The different approaches are evaluated and compared on the TIDIGITS task. The best improvement is achieved when structured GMM is combined with feature adaptation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Tyagi, V., Wellekens, C.: Automatic speech recognition and speech variability: a review. Speech Commun. 49(10), 763–786 (2007)CrossRef Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Tyagi, V., Wellekens, C.: Automatic speech recognition and speech variability: a review. Speech Commun. 49(10), 763–786 (2007)CrossRef
3.
Zurück zum Zitat Burnett, D.C., Fanty, M.: Rapid unsupervised adaptation to children’s speech on a connected-digit task. In: Proceedings of the ICSLP, vol. 2, pp. 1145–1148. IEEE (1996) Burnett, D.C., Fanty, M.: Rapid unsupervised adaptation to children’s speech on a connected-digit task. In: Proceedings of the ICSLP, vol. 2, pp. 1145–1148. IEEE (1996)
5.
Zurück zum Zitat Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRef Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRef
6.
Zurück zum Zitat Gauvain, J.L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)CrossRef Gauvain, J.L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)CrossRef
7.
Zurück zum Zitat Gorin, A., Jouvet, D.: Class-based speech recognition using a maximum dissimilarity criterion and a tolerance classification margin. In: 2012 IEEE Proceedings of the Spoken Language Technology Workshop (SLT), pp. 91–96. IEEE (2012) Gorin, A., Jouvet, D.: Class-based speech recognition using a maximum dissimilarity criterion and a tolerance classification margin. In: 2012 IEEE Proceedings of the Spoken Language Technology Workshop (SLT), pp. 91–96. IEEE (2012)
8.
Zurück zum Zitat Gorin, A., Jouvet, D.: Efficient constrained parametrization of GMM with class-based mixture weights for automatic speech recognition. In: Proceedings of the LTC-6th Language & Technologies Conference, pp. 550–554 (2013) Gorin, A., Jouvet, D.: Efficient constrained parametrization of GMM with class-based mixture weights for automatic speech recognition. In: Proceedings of the LTC-6th Language & Technologies Conference, pp. 550–554 (2013)
9.
Zurück zum Zitat Jouvet, D., Gorin, A., Vinuesa, N.: Exploitation d’une marge de tolérance de classification pour améliorer l’apprentissage de modèles acoustiques de classes en reconnaissance de la parole. In: JEP-TALN-RECITAL, pp. 763–770 (2012) Jouvet, D., Gorin, A., Vinuesa, N.: Exploitation d’une marge de tolérance de classification pour améliorer l’apprentissage de modèles acoustiques de classes en reconnaissance de la parole. In: JEP-TALN-RECITAL, pp. 763–770 (2012)
10.
Zurück zum Zitat Kuhn, R., Nguyen, P., Junqua, J.C., Goldwasser, L., Niedzielski, N., Fincke, S., Field, K., Contolini, M.: Eigenvoices for speaker adaptation. In: Proceedings of the ICSLP, vol. 98, pp. 1774–1777 (1998) Kuhn, R., Nguyen, P., Junqua, J.C., Goldwasser, L., Niedzielski, N., Fincke, S., Field, K., Contolini, M.: Eigenvoices for speaker adaptation. In: Proceedings of the ICSLP, vol. 98, pp. 1774–1777 (1998)
11.
Zurück zum Zitat Leonard, R.G., Doddington, G.: Tidigits speech corpus. Texas Instruments, Inc. (1993) Leonard, R.G., Doddington, G.: Tidigits speech corpus. Texas Instruments, Inc. (1993)
12.
Zurück zum Zitat O’Shaughnessy, D.: Acoustic analysis for automatic speech recognition. Proc. IEEE 101(5), 1038–1053 (2013)CrossRef O’Shaughnessy, D.: Acoustic analysis for automatic speech recognition. Proc. IEEE 101(5), 1038–1053 (2013)CrossRef
13.
Zurück zum Zitat Panchapagesan, S., Alwan, A.: Frequency warping for vtln and speaker adaptation by linear transformation of standard mfcc. Computer Speech Lang. 23(1), 42–64 (2009)CrossRef Panchapagesan, S., Alwan, A.: Frequency warping for vtln and speaker adaptation by linear transformation of standard mfcc. Computer Speech Lang. 23(1), 42–64 (2009)CrossRef
15.
Zurück zum Zitat Wellekens, C.J.: Explicit time correlation in hidden Markov models for speech recognition. In: Proceedings of the ICASSP, pp. 384–386 (1987) Wellekens, C.J.: Explicit time correlation in hidden Markov models for speech recognition. In: Proceedings of the ICASSP, pp. 384–386 (1987)
16.
Zurück zum Zitat Wenxuan, T., Gravier, G., Bimbot, F., Soufflet, F.: Rapid speaker adaptation by reference model interpolation. In: Proceedings of the INTERSPEECH, pp. 258–261 (2007) Wenxuan, T., Gravier, G., Bimbot, F., Soufflet, F.: Rapid speaker adaptation by reference model interpolation. In: Proceedings of the INTERSPEECH, pp. 258–261 (2007)
17.
Zurück zum Zitat Zhan, P., Waibel, A.: Vocal tract length normalization for large vocabulary continuous speech recognition. Technical report. DTIC Document (1997) Zhan, P., Waibel, A.: Vocal tract length normalization for large vocabulary continuous speech recognition. Technical report. DTIC Document (1997)
18.
Zurück zum Zitat Zhao, Y., Juang, B.H.: Stranded Gaussian mixture hidden Markov models for robust speech recognition. In: Proceedings of the ICASSP, pp. 4301–4304 (2012) Zhao, Y., Juang, B.H.: Stranded Gaussian mixture hidden Markov models for robust speech recognition. In: Proceedings of the ICASSP, pp. 4301–4304 (2012)
Metadaten
Titel
Structured GMM Based on Unsupervised Clustering for Recognizing Adult and Child Speech
verfasst von
Arseniy Gorin
Denis Jouvet
Copyright-Jahr
2014
DOI
https://doi.org/10.1007/978-3-319-11397-5_8