Skip to main content
Erschienen in: International Journal of Speech Technology 1/2016

11.12.2015

Articulatory and excitation source features for speech recognition in read, extempore and conversation modes

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In our previous works, we have explored articulatory and excitation source features to improve the performance of phone recognition systems (PRSs) using read speech corpora. In this work, we have extended the use of articulatory and excitation source features for developing PRSs of extempore and conversation modes of speech, in addition to the read speech. It is well known that the overall performance of speech recognition system heavily depends on accuracy of phone recognition. Therefore, the objective of this paper is to enhance the accuracy of phone recognition systems using articulatory and excitation source features in addition to conventional spectral features. The articulatory features (AFs) are derived from the spectral features using feedforward neural networks (FFNNs). We have considered five AF groups, namely: manner, place, roundness, frontness and height. Five different AF-based tandem PRSs are developed using the combination of Mel frequency cepstral coefficients (MFCCs) and AFs derived from FFNNs. Hybrid PRSs are developed by combining the evidences from AF-based tandem PRSs using weighted combination approach. The excitation source information is derived by processing the linear prediction residual of the speech signal. The vocal tract information is captured using MFCCs. The combination of vocal tract and excitation source features is used for developing PRSs. The PRSs are developed using hidden Markov models. Bengali speech database is used for developing PRSs of read, extempore and conversation modes of speech. The results are analyzed and the performance is compared across different modes of speech. From the results, it is observed that the use of either articulatory or excitation source features along-with to MFCCs will improve the performance of PRSs in all three modes of speech. The improvement in the performance using AFs is much higher compared to the improvement obtained using excitation source features.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Bourlard, H. A., & Morgan, N. (1994). Connnectionist speech recognition: A hybrid approach. Dordrecht: Kluwer.CrossRef Bourlard, H. A., & Morgan, N. (1994). Connnectionist speech recognition: A hybrid approach. Dordrecht: Kluwer.CrossRef
Zurück zum Zitat Chengalvarayan, R. (1998). On the use of normalized LPC error towards better large vocabulary speech recognition systems. In IEEE international conference on acoustics, speech and signal processing (pp. 17–20). Chengalvarayan, R. (1998). On the use of normalized LPC error towards better large vocabulary speech recognition systems. In IEEE international conference on acoustics, speech and signal processing (pp. 17–20).
Zurück zum Zitat Dhananjaya, N., Yegnanarayana, B., & Suryakanth, V. G. (2011). Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5252–5255). Dhananjaya, N., Yegnanarayana, B., & Suryakanth, V. G. (2011). Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5252–5255).
Zurück zum Zitat Fallside, F., Lucke, H., Marsland, T. P., O’Shea, P. J., Owen, M. S. J., Prager, R. W., et al. (1990). Continuous speech recognition for the TIMIT database using neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 445–448). Fallside, F., Lucke, H., Marsland, T. P., O’Shea, P. J., Owen, M. S. J., Prager, R. W., et al. (1990). Continuous speech recognition for the TIMIT database using neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 445–448).
Zurück zum Zitat Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5). Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5).
Zurück zum Zitat He, J., Liu, L., & Palm, G. (1996). On the use of residual cepstrum in speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 5–8). He, J., Liu, L., & Palm, G. (1996). On the use of residual cepstrum in speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 5–8).
Zurück zum Zitat Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1635–1638). Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1635–1638).
Zurück zum Zitat Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.CrossRef Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.CrossRef
Zurück zum Zitat Ketabdar, H., & Bourlard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4065–4068). Ketabdar, H., & Bourlard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4065–4068).
Zurück zum Zitat Kirchhoff, K., Fink, Gernot A., & Sagerer, Gerhard. (2002). Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 37, 303–319.CrossRefMATH Kirchhoff, K., Fink, Gernot A., & Sagerer, Gerhard. (2002). Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 37, 303–319.CrossRefMATH
Zurück zum Zitat Lee, K., & Hon, H. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.CrossRef Lee, K., & Hon, H. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.CrossRef
Zurück zum Zitat Manjunath, K. E., & Sreenivasa Rao, K. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an indian language: Bengali. In IEEE national conference on communications (NCC) (pp. 1–6). Manjunath, K. E., & Sreenivasa Rao, K. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an indian language: Bengali. In IEEE national conference on communications (NCC) (pp. 1–6).
Zurück zum Zitat Manjunath, K. E., & Sreenivasa Rao, K. (2015a). Source and system features for phone recognition. International Journal of Speech Technology, 18, 257–270. Manjunath, K. E., & Sreenivasa Rao, K. (2015a). Source and system features for phone recognition. International Journal of Speech Technology, 18, 257–270.
Zurück zum Zitat Manjunath, K. E., & Sreenivasa Rao, K. (2015b). Improvement of phone recognition accuracy using articulatory features. Applied Soft Computing (revision submitted). Manjunath, K. E., & Sreenivasa Rao, K. (2015b). Improvement of phone recognition accuracy using articulatory features. Applied Soft Computing (revision submitted).
Zurück zum Zitat Manjunath, K. E., Sreenivasa Rao, K., & Gurunath Reddy, M. (2015a). Two-stage phone recognition system using articulatory and spectral features. In IEEE international conference on signal processing and communication engineering systems (SPACES) (pp. 107–111). Manjunath, K. E., Sreenivasa Rao, K., & Gurunath Reddy, M. (2015a). Two-stage phone recognition system using articulatory and spectral features. In IEEE international conference on signal processing and communication engineering systems (SPACES) (pp. 107–111).
Zurück zum Zitat Manjunath, K. E., Sreenivasa Rao, K., & Gurunath Reddy, M. (2015b). Improvement of phone recognition accuracy using source and system features. In IEEE international conference on signal processing and communication engineering systems (SPACES) (pp. 501–505). Manjunath, K. E., Sreenivasa Rao, K., & Gurunath Reddy, M. (2015b). Improvement of phone recognition accuracy using source and system features. In IEEE international conference on signal processing and communication engineering systems (SPACES) (pp. 501–505).
Zurück zum Zitat Manjunath, K. E., Sreenivasa Rao, K., & Pati, D. (2013). Development of phonetic engine for Indian languages: Bengali and Oriya. In 16th International oriental COCOSDA conference (IEEE explore) (pp. 1–6), Gurgoan, India. Manjunath, K. E., Sreenivasa Rao, K., & Pati, D. (2013). Development of phonetic engine for Indian languages: Bengali and Oriya. In 16th International oriental COCOSDA conference (IEEE explore) (pp. 1–6), Gurgoan, India.
Zurück zum Zitat Manjunath, K. E., Sunil Kumar, S. B., Pati, D., Satapathy, B., & Sreenivasa Rao, K. (2013). Development of consonant-vowel recognition systems for Indian languages: Bengali and Oriya. In IEEE INDICON (IEEE Explore) (pp. 1–6), IIT Bombay, Mumbai, India. Manjunath, K. E., Sunil Kumar, S. B., Pati, D., Satapathy, B., & Sreenivasa Rao, K. (2013). Development of consonant-vowel recognition systems for Indian languages: Bengali and Oriya. In IEEE INDICON (IEEE Explore) (pp. 1–6), IIT Bombay, Mumbai, India.
Zurück zum Zitat Metze, F. (2005). Articulatory features for conversational speech recognition. Ph.D. dissertation, Carnegie Mellon University. Metze, F. (2005). Articulatory features for conversational speech recognition. Ph.D. dissertation, Carnegie Mellon University.
Zurück zum Zitat Mitra, V., Wang, W., Stolcke, A., Nam, H., Richey, C., Yuan, J., et al. (2013). Articulatory trajectories for large-vocabulary speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 7145–7149). Mitra, V., Wang, W., Stolcke, A., Nam, H., Richey, C., Yuan, J., et al. (2013). Articulatory trajectories for large-vocabulary speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 7145–7149).
Zurück zum Zitat Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20, 14–22.CrossRef Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20, 14–22.CrossRef
Zurück zum Zitat Sainath, T. N., Mohamed, A., Kingsbury, B., & Ramabhadran, B. (2013). Deep convolutional neural networks for LVCSR. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 8614–8618). Sainath, T. N., Mohamed, A., Kingsbury, B., & Ramabhadran, B. (2013). Deep convolutional neural networks for LVCSR. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 8614–8618).
Zurück zum Zitat Siniscalchi, S. M., & Lee, C. (2009). A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51, 1139–1153.CrossRef Siniscalchi, S. M., & Lee, C. (2009). A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51, 1139–1153.CrossRef
Zurück zum Zitat Sreenivasa Rao, K., & Koolagudi, S. G. (2013). Recognition of emotions from video using acoustic and facial features. In Signal, image and video processing (SIViP) (pp. 1–17). Sreenivasa Rao, K., & Koolagudi, S. G. (2013). Recognition of emotions from video using acoustic and facial features. In Signal, image and video processing (SIViP) (pp. 1–17).
Zurück zum Zitat Sunil Kumar, S. B., Sreenivasa Rao, K., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages: Bengali and Odia. In 16th International Oriental COCOSDA (pp. 1–5). Sunil Kumar, S. B., Sreenivasa Rao, K., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages: Bengali and Odia. In 16th International Oriental COCOSDA (pp. 1–5).
Zurück zum Zitat Toth, L. (2014). Convolutional deep maxout networks for phone recognition. In International speech communication association (INTERSPEECH) (pp. 1078–1082). Toth, L. (2014). Convolutional deep maxout networks for phone recognition. In International speech communication association (INTERSPEECH) (pp. 1078–1082).
Metadaten
Titel
Articulatory and excitation source features for speech recognition in read, extempore and conversation modes
Publikationsdatum
11.12.2015
Erschienen in
International Journal of Speech Technology / Ausgabe 1/2016
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-015-9329-x

Weitere Artikel der Ausgabe 1/2016

International Journal of Speech Technology 1/2016 Zur Ausgabe

Neuer Inhalt