Skip to main content
Erschienen in: International Journal of Speech Technology 2/2015

01.06.2015

Source and system features for phone recognition

verfasst von: K. E. Manjunath, K. Sreenivasa Rao

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this work, we have explored excitation source features in addition to vocal tract system features to improve the performance of phone recognition systems (PRSs). The excitation source information is derived by processing linear prediction residual of the speech signal. The vocal tract information is captured using Mel-frequency cepstral coefficient features. The PRSs are developed using hidden Markov models. The robustness of proposed excitation source features is demonstrated using white and babble noisy speech samples. In this work, TIMIT and Bengali speech databases are used for developing PRSs. The tandem PRSs are developed using the phone posteriors obtained from feedforward neural networks. From the results, it is observed that the tandem PRSs developed using the combination of excitation source and vocal tract system features, outperform the conventional tandem systems developed using system features alone. It is also observed that the PRSs developed using the combination of excitation source and vocal tract features, are more robust to noise than the PRSs developed using vocal tract features alone.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Bourlard, H. A., & Morgan, N. (1994). Connnectionist speech recognition: A hybrid approach. Boston: Kluwer Academic Publishers.CrossRef Bourlard, H. A., & Morgan, N. (1994). Connnectionist speech recognition: A hybrid approach. Boston: Kluwer Academic Publishers.CrossRef
Zurück zum Zitat Chengalvarayan, R. (1998). On the use of normalized LPC Error towards better large vocabulary speech recognition systems. In IEEE international conference on acoustics, speech and signal processing. Chengalvarayan, R. (1998). On the use of normalized LPC Error towards better large vocabulary speech recognition systems. In IEEE international conference on acoustics, speech and signal processing.
Zurück zum Zitat Chetouani, M., Faundez-Zanuy, M., Gas, B., & Zarader, J. L. (2009). Investigation on LP-residual representations for speaker identification. Pattern Recognition, 42, 487–494.CrossRefMATH Chetouani, M., Faundez-Zanuy, M., Gas, B., & Zarader, J. L. (2009). Investigation on LP-residual representations for speaker identification. Pattern Recognition, 42, 487–494.CrossRefMATH
Zurück zum Zitat Csapo, T. G., & Nemeth, G. (2012). A novel codebook-based excitation model for use in speech synthesis. In International conference on cognitive infocommunications. Csapo, T. G., & Nemeth, G. (2012). A novel codebook-based excitation model for use in speech synthesis. In International conference on cognitive infocommunications.
Zurück zum Zitat Dhananjaya, N., Yegnanarayana, B., & Suryakanth, V. G. (2011). Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In IEEE international conference on acoustics, speech and signal processing (ICASSP). Dhananjaya, N., Yegnanarayana, B., & Suryakanth, V. G. (2011). Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Zurück zum Zitat Fallside, F., Lucke, H., Marsland, T.P., O’Shea, P.J., Owen, M.S.J., Prager, R.W., Robinson, A.J., & Russell, N.H. (1990). Continuous speech recognition for the TIMIT database using neural networks. In ICASSP-90. Fallside, F., Lucke, H., Marsland, T.P., O’Shea, P.J., Owen, M.S.J., Prager, R.W., Robinson, A.J., & Russell, N.H. (1990). Continuous speech recognition for the TIMIT database using neural networks. In ICASSP-90.
Zurück zum Zitat Fant, G. (1979). Glottal source and excitation analysis. STL-QPSR, 20, 085–107. Fant, G. (1979). Glottal source and excitation analysis. STL-QPSR, 20, 085–107.
Zurück zum Zitat Graves, Alex, Mohamed, Abdel-rahman, & Hinton, Geoffrey (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP). Graves, Alex, Mohamed, Abdel-rahman, & Hinton, Geoffrey (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Zurück zum Zitat Hayakawa, S., Takeda, K., & Itakura, F. (1997). Speaker identification using harmonic structure of LP-residual spectrum. Biometric personal Aunthentification, Lecture notes, 1206, 253–260. Hayakawa, S., Takeda, K., & Itakura, F. (1997). Speaker identification using harmonic structure of LP-residual spectrum. Biometric personal Aunthentification, Lecture notes, 1206, 253–260.
Zurück zum Zitat He, Jialong, Liu, Li, & Palm, G. (1996). On the use of residual cepstrum in speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP). He, Jialong, Liu, Li, & Palm, G. (1996). On the use of residual cepstrum in speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP).
Zurück zum Zitat Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In IEEE international conference on acoustics, speech and signal processing (ICASSP). Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Zurück zum Zitat Hinton, G., Deng, Li, Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.CrossRef Hinton, G., Deng, Li, Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.CrossRef
Zurück zum Zitat Ketabdar, H., & Bourlard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In IEEE international conference on acoustics, speech and signal processing (ICASSP). Ketabdar, H., & Bourlard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Zurück zum Zitat Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.CrossRef Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.CrossRef
Zurück zum Zitat Mahadeva Prasanna, S. R., Gupta, C. S., & Yegnanarayana, B. (2006). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.CrossRef Mahadeva Prasanna, S. R., Gupta, C. S., & Yegnanarayana, B. (2006). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.CrossRef
Zurück zum Zitat Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63, 561–580.CrossRef Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63, 561–580.CrossRef
Zurück zum Zitat Manjunath, K.E., & Sreenivasa Rao, K. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an Indian language: Bengali. In NCC-2014. Manjunath, K.E., & Sreenivasa Rao, K. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an Indian language: Bengali. In NCC-2014.
Zurück zum Zitat Manjunath, K. E., Sreenivasa Rao, K., & Pati, D. (2013). Development of Phonetic Engine for Indian languages: Bengali and Oriya. In 16th international oriental COCOSDA. Manjunath, K. E., Sreenivasa Rao, K., & Pati, D. (2013). Development of Phonetic Engine for Indian languages: Bengali and Oriya. In 16th international oriental COCOSDA.
Zurück zum Zitat Manjunath, K.E., Sunil Kumar, S. B., Pati, D., Satapathy, B., & Sreenivasa Rao, K. (2013). Development of consonant-vowel recognition systems for Indian languages : Bengali and Odia. In INDICON-2013. Manjunath, K.E., Sunil Kumar, S. B., Pati, D., Satapathy, B., & Sreenivasa Rao, K. (2013). Development of consonant-vowel recognition systems for Indian languages : Bengali and Odia. In INDICON-2013.
Zurück zum Zitat Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20, 14–22.CrossRef Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20, 14–22.CrossRef
Zurück zum Zitat Pati, D., & Mahadeva Prasanna, S. R. (2008). Non-Parametric Vector Quantization of Excitation Source Information for Speaker Recognition. In IEEE region 10 conference TENCON. Pati, D., & Mahadeva Prasanna, S. R. (2008). Non-Parametric Vector Quantization of Excitation Source Information for Speaker Recognition. In IEEE region 10 conference TENCON.
Zurück zum Zitat Pati, D., & Mahadeva Prasanna, S. R. (2012). Speaker verification using excitation source information. The International Journal of Speech Technology (Springer), 15, 241–257.CrossRef Pati, D., & Mahadeva Prasanna, S. R. (2012). Speaker verification using excitation source information. The International Journal of Speech Technology (Springer), 15, 241–257.CrossRef
Zurück zum Zitat Pati, D., & Mahadeva Prasanna, S. R. (2013). A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information. Sadhana, 38, 591–620.CrossRefMathSciNet Pati, D., & Mahadeva Prasanna, S. R. (2013). A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information. Sadhana, 38, 591–620.CrossRefMathSciNet
Zurück zum Zitat Rabiner, L., Juang, B.-H., & Yagnanarayana, B. (2008). Fundamentals of speech recognition. Singapore: Pearson Education. Rabiner, L., Juang, B.-H., & Yagnanarayana, B. (2008). Fundamentals of speech recognition. Singapore: Pearson Education.
Zurück zum Zitat Sri Rama Murty, K., & Yegnanarayana, B. (2006). Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Processing Letters, 13, 52–55.CrossRef Sri Rama Murty, K., & Yegnanarayana, B. (2006). Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Processing Letters, 13, 52–55.CrossRef
Zurück zum Zitat Stevens, K. N. (1998). Acoustic phonetics. Cambridge, MA: MIT Press. Stevens, K. N. (1998). Acoustic phonetics. Cambridge, MA: MIT Press.
Zurück zum Zitat Sunil Kumar, S. B., Sreenivasa Rao, K., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages : Bengali and Odia. In 16th international oriental COCOSDA. Sunil Kumar, S. B., Sreenivasa Rao, K., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages : Bengali and Odia. In 16th international oriental COCOSDA.
Zurück zum Zitat Titze, I. R. (2008). Nonlinear sourcefilter coupling in phonation: Theory. Journal of the Acoustical Society of America, 123(5), 2733–2749.CrossRef Titze, I. R. (2008). Nonlinear sourcefilter coupling in phonation: Theory. Journal of the Acoustical Society of America, 123(5), 2733–2749.CrossRef
Zurück zum Zitat Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12, 247–251. Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12, 247–251.
Zurück zum Zitat Yegnanarayana, B., Mahadeva Prasanna, S. R., Duraiswami, R., & Zotkin, D. (2005). Processing of reverberant speech for time-delay estimation. IEEE Transactions on Audio, Speech, and Language Processing, 13, 1110–1118.CrossRef Yegnanarayana, B., Mahadeva Prasanna, S. R., Duraiswami, R., & Zotkin, D. (2005). Processing of reverberant speech for time-delay estimation. IEEE Transactions on Audio, Speech, and Language Processing, 13, 1110–1118.CrossRef
Metadaten
Titel
Source and system features for phone recognition
verfasst von
K. E. Manjunath
K. Sreenivasa Rao
Publikationsdatum
01.06.2015
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 2/2015
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-014-9266-0

Weitere Artikel der Ausgabe 2/2015

International Journal of Speech Technology 2/2015 Zur Ausgabe

Neuer Inhalt