Skip to main content
Erschienen in:

10.02.2024

Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients

verfasst von: Shahid Aziz, S. Shahnawazuddin

Erschienen in: Circuits, Systems, and Signal Processing | Ausgabe 5/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The task of developing an automatic speaker verification (ASV) system for children’s speech is extremely challenging due to the dearth of domain-specific data. The challenges are further exacerbated in the case of short utterances of speech, a relatively unexplored domain in the case of children’s ASV. Voice-based biometric systems require an adequate amount of speech data for enrollment and verification; otherwise, the performance considerably degrades. It is for this reason that the trade-off between convenience and security is gruelling to maintain in practical scenarios. In this paper, we have focused on data paucity and preservation of the higher-frequency contents in order to enhance the performance of a short utterance-based children’s speaker verification system. To deal with data scarcity, an out-of-domain data augmentation approach has been proposed. Since the out-of-domain data used are from adult speakers which are acoustically very different from children’s speech, we have made use of techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data but also in effectively capturing the missing target attributes which helps in boosting the verification. Further to that, we have resorted to concatenation of the classical Mel-frequency cepstral coefficients (MFCC) features with the Gamma-tone frequency cepstral coefficient (GTF-CC) or with the Inverse Gamma-tone frequency cepstral coefficient (IGTF-CC) features. The feature concatenation of MFCC and IGTF-CC is employed with the sole intention of effectively modeling the human auditory system along with the preservation of higher-frequency contents in the children’s speech data. This feature concatenation approach, when combined with data augmentation, helps in further improvement in the verification performance. The experimental results testify our claims, wherein we have achieved an overall relative reduction of \(38.5\%\) for equal error rate.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

ATZelektronik

Die Fachzeitschrift ATZelektronik bietet für Entwickler und Entscheider in der Automobil- und Zulieferindustrie qualitativ hochwertige und fundierte Informationen aus dem gesamten Spektrum der Pkw- und Nutzfahrzeug-Elektronik. 

Lassen Sie sich jetzt unverbindlich 2 kostenlose Ausgabe zusenden.

ATZelectronics worldwide

ATZlectronics worldwide is up-to-speed on new trends and developments in automotive electronics on a scientific level with a high depth of information. 

Order your 30-days-trial for free and without any commitment.

Weitere Produktempfehlungen anzeigen
Literatur
1.
Zurück zum Zitat A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The PF_STAR children’s speech corpus. Proceedings INTERSPEECH, pp. 2761–2764 (2005) A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The PF_STAR children’s speech corpus. Proceedings INTERSPEECH, pp. 2761–2764 (2005)
2.
Zurück zum Zitat E.P. Damskägg, V. Välimäki, Audio time stretching using fuzzy classification of spectral bins. Appl. Sci. 7(12), 1293 (2017)CrossRef E.P. Damskägg, V. Välimäki, Audio time stretching using fuzzy classification of spectral bins. Appl. Sci. 7(12), 1293 (2017)CrossRef
4.
Zurück zum Zitat D. Dimitriadis, P. Maragos, A. Potamianos, On the effects of filterbank design and energy computation on robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(6), 1504–1516 (2010)CrossRef D. Dimitriadis, P. Maragos, A. Potamianos, On the effects of filterbank design and energy computation on robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(6), 1504–1516 (2010)CrossRef
6.
Zurück zum Zitat M. Fedila, M. Bengherabi, A. Amrouche, Gammatone filterbank and symbiotic combination of amplitude and phase-based spectra for robust speaker verification under noisy conditions and compression artifacts. Multimed. Tools Appl. 77(13), 16721–16739 (2018)CrossRef M. Fedila, M. Bengherabi, A. Amrouche, Gammatone filterbank and symbiotic combination of amplitude and phase-based spectra for robust speaker verification under noisy conditions and compression artifacts. Multimed. Tools Appl. 77(13), 16721–16739 (2018)CrossRef
7.
Zurück zum Zitat M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech. Proceeding Workshop on Child, Computer and Interaction, pp. 7:1–7:8 (2009) M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech. Proceeding Workshop on Child, Computer and Interaction, pp. 7:1–7:8 (2009)
8.
Zurück zum Zitat B. Gold, N. Morgan, D. Ellis, Speech and audio signal processing: processing and perception of speech and music (Wiley, 2011) B. Gold, N. Morgan, D. Ellis, Speech and audio signal processing: processing and perception of speech and music (Wiley, 2011)
10.
Zurück zum Zitat T. Kaneko, H. Kameoka, Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017) T. Kaneko, H. Kameoka, Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:​1711.​11293 (2017)
11.
Zurück zum Zitat H.K. Kathania, S.R. Kadiri, P. Alku, M. Kurimo, Study of formant modification for children asr. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2020), pp. 7429–7433 H.K. Kathania, S.R. Kadiri, P. Alku, M. Kurimo, Study of formant modification for children asr. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2020), pp. 7429–7433
12.
Zurück zum Zitat H.K. Kathania, S. Shahnawazuddin, W. Ahmad, N. Adiga, Role of linear, mel and inverse-mel filterbanks in automatic recognition of speech from high-pitched speakers. Circuits Syst. Signal Process. 38(10), 4667–4682 (2019)CrossRef H.K. Kathania, S. Shahnawazuddin, W. Ahmad, N. Adiga, Role of linear, mel and inverse-mel filterbanks in automatic recognition of speech from high-pitched speakers. Circuits Syst. Signal Process. 38(10), 4667–4682 (2019)CrossRef
13.
Zurück zum Zitat V. Kumar, A. Kumar, S. Shahnawazuddin, Creating robust children’s ASR system in zero-resource condition through out-of-domain data augmentation. Circuits Syst. Signal Process. 41(4), 2205–2220 (2022)CrossRef V. Kumar, A. Kumar, S. Shahnawazuddin, Creating robust children’s ASR system in zero-resource condition through out-of-domain data augmentation. Circuits Syst. Signal Process. 41(4), 2205–2220 (2022)CrossRef
14.
Zurück zum Zitat S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)CrossRef S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)CrossRef
15.
Zurück zum Zitat R.D. Patterson, I. Nimmo-Smith, J. Holdsworth, P. Rice, An efficient auditory filterbank based on the gammatone function. A meeting of the IOC Speech Group on Auditory Modelling at RSRE, vol. 2 (1987) R.D. Patterson, I. Nimmo-Smith, J. Holdsworth, P. Rice, An efficient auditory filterbank based on the gammatone function. A meeting of the IOC Speech Group on Auditory Modelling at RSRE, vol. 2 (1987)
16.
Zurück zum Zitat V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts. Proceedings INTERSPEECH (2015) V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts. Proceedings INTERSPEECH (2015)
17.
Zurück zum Zitat A. Poddar, M. Sahidullah, G. Saha, Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biom. 7(2), 91–101 (2018)CrossRef A. Poddar, M. Sahidullah, G. Saha, Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biom. 7(2), 91–101 (2018)CrossRef
19.
Zurück zum Zitat D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi Speech recognition toolkit. Proceedings ASRU (2011) D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi Speech recognition toolkit. Proceedings ASRU (2011)
20.
Zurück zum Zitat D. Povey, X. Zhang, S. Khudanpur, Parallel training of deep neural networks with natural gradient and parameter averaging. Proceedings ICLR (2015) D. Povey, X. Zhang, S. Khudanpur, Parallel training of deep neural networks with natural gradient and parameter averaging. Proceedings ICLR (2015)
21.
Zurück zum Zitat S.R.M. Prasanna, D. Govind, K.S. Rao, B. Yegnanarayana, Fast prosody modification using instants of significant excitation. Proceedings International Conference on Speech Prosody (2010) S.R.M. Prasanna, D. Govind, K.S. Rao, B. Yegnanarayana, Fast prosody modification using instants of significant excitation. Proceedings International Conference on Speech Prosody (2010)
22.
Zurück zum Zitat P. Rajan, T. Kinnunen, C. Hanilci, J. Pohjalainen, P. Alku, Using group delay functions from all-pole models for speaker recognition. INTERSPEECH, pp. 2489–2493 (2013) P. Rajan, T. Kinnunen, C. Hanilci, J. Pohjalainen, P. Alku, Using group delay functions from all-pole models for speaker recognition. INTERSPEECH, pp. 2489–2493 (2013)
23.
Zurück zum Zitat T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: A British English speech corpus for large vocabulary continuous speech recognition. Proceedings ICASSP 1, 81–84 (1995) T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: A British English speech corpus for large vocabulary continuous speech recognition. Proceedings ICASSP 1, 81–84 (1995)
24.
Zurück zum Zitat M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech. Proceedings Speech and Language Technologies in Education (SLaTE) (2007) M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech. Proceedings Speech and Language Technologies in Education (SLaTE) (2007)
25.
Zurück zum Zitat S. Safavi, M. Russell, P. Jančovič, Automatic speaker, age-group and gender identification from children’s speech. Comput. Speech Lang. 50, 141–156 (2018)CrossRef S. Safavi, M. Russell, P. Jančovič, Automatic speaker, age-group and gender identification from children’s speech. Comput. Speech Lang. 50, 141–156 (2018)CrossRef
27.
Zurück zum Zitat S. Shahnawazuddin, N. Adiga, B.T. Sai, W. Ahmad, H.K. Kathania, Developing speaker independent ASR system using limited data through prosody modification based on fuzzy classification of spectral bins. Digit. Signal Process. 93, 34–42 (2019)CrossRef S. Shahnawazuddin, N. Adiga, B.T. Sai, W. Ahmad, H.K. Kathania, Developing speaker independent ASR system using limited data through prosody modification based on fuzzy classification of spectral bins. Digit. Signal Process. 93, 34–42 (2019)CrossRef
28.
Zurück zum Zitat S. Shahnawazuddin, W. Ahmad, N. Adiga, A. Kumar, In-domain and out-of-domain data augmentation to improve children’s speaker verification system in limited data scenario. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7554–7558 (2020) S. Shahnawazuddin, W. Ahmad, N. Adiga, A. Kumar, In-domain and out-of-domain data augmentation to improve children’s speaker verification system in limited data scenario. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7554–7558 (2020)
29.
Zurück zum Zitat K. Shobaki, J.P. Hosom, R. Cole, Cslu: Kids’ speech version 1.1. Linguistic Data Consortium (2007) K. Shobaki, J.P. Hosom, R. Cole, Cslu: Kids’ speech version 1.1. Linguistic Data Consortium (2007)
30.
Zurück zum Zitat D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification. Proceedings INTERSPEECH, pp. 999–1003 (2017) D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification. Proceedings INTERSPEECH, pp. 999–1003 (2017)
31.
Zurück zum Zitat D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-Vectors: Robust DNN Embeddings for Speaker Recognition. Proceedings ICASSP, pp. 5329–5333 (2018) D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-Vectors: Robust DNN Embeddings for Speaker Recognition. Proceedings ICASSP, pp. 5329–5333 (2018)
32.
Zurück zum Zitat G. Yeung, A. Alwan, On the difficulties of automatic speech recognition for kindergarten-aged children. Interspeech 2018 (2018) G. Yeung, A. Alwan, On the difficulties of automatic speech recognition for kindergarten-aged children. Interspeech 2018 (2018)
Metadaten
Titel
Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients
verfasst von
Shahid Aziz
S. Shahnawazuddin
Publikationsdatum
10.02.2024
Verlag
Springer US
Erschienen in
Circuits, Systems, and Signal Processing / Ausgabe 5/2024
Print ISSN: 0278-081X
Elektronische ISSN: 1531-5878
DOI
https://doi.org/10.1007/s00034-023-02592-z