Top

International Journal of Speech Technology

Published in:

01-06-2015

Source and system features for phone recognition

Authors: K. E. Manjunath, K. Sreenivasa Rao

Published in: International Journal of Speech Technology | Issue 2/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In this work, we have explored excitation source features in addition to vocal tract system features to improve the performance of phone recognition systems (PRSs). The excitation source information is derived by processing linear prediction residual of the speech signal. The vocal tract information is captured using Mel-frequency cepstral coefficient features. The PRSs are developed using hidden Markov models. The robustness of proposed excitation source features is demonstrated using white and babble noisy speech samples. In this work, TIMIT and Bengali speech databases are used for developing PRSs. The tandem PRSs are developed using the phone posteriors obtained from feedforward neural networks. From the results, it is observed that the tandem PRSs developed using the combination of excitation source and vocal tract system features, outperform the conventional tandem systems developed using system features alone. It is also observed that the PRSs developed using the combination of excitation source and vocal tract features, are more robust to noise than the PRSs developed using vocal tract features alone.

previous article Nature-inspired feature subset selection application to arabic speaker recognition system

next article Database development and automatic speech recognition of isolated Pashto spoken digits using MFCC and K-NN

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Bourlard, H. A., & Morgan, N. (1994). Connnectionist speech recognition: A hybrid approach. Boston: Kluwer Academic Publishers.CrossRef

Chengalvarayan, R. (1998). On the use of normalized LPC Error towards better large vocabulary speech recognition systems. In IEEE international conference on acoustics, speech and signal processing.

Chetouani, M., Faundez-Zanuy, M., Gas, B., & Zarader, J. L. (2009). Investigation on LP-residual representations for speaker identification. Pattern Recognition, 42, 487–494.CrossRefMATH

Csapo, T. G. (2012). Increasing the naturalness of synthesizes speech. http://speechlab.tmit.bme.hu/csapo/downloads/Csapo-phonetician2012-paper.pdf.

Csapo, T. G., & Nemeth, G. (2012). A novel codebook-based excitation model for use in speech synthesis. In International conference on cognitive infocommunications.

Dhananjaya, N., Yegnanarayana, B., & Suryakanth, V. G. (2011). Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

Fallside, F., Lucke, H., Marsland, T.P., O’Shea, P.J., Owen, M.S.J., Prager, R.W., Robinson, A.J., & Russell, N.H. (1990). Continuous speech recognition for the TIMIT database using neural networks. In ICASSP-90.

Fant, G. (1979). Glottal source and excitation analysis. STL-QPSR, 20, 085–107.

Graves, Alex, Mohamed, Abdel-rahman, & Hinton, Geoffrey (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

Hayakawa, S., Takeda, K., & Itakura, F. (1997). Speaker identification using harmonic structure of LP-residual spectrum. Biometric personal Aunthentification, Lecture notes, 1206, 253–260.

He, Jialong, Liu, Li, & Palm, G. (1996). On the use of residual cepstrum in speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP).

Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

Hinton, G., Deng, Li, Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.CrossRef

Ketabdar, H., & Bourlard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.CrossRef

Linguistic Data Consortium (1993). TIMIT Acoustic-Phonetic Continuous Speech Corpus. (1993). Available: http://catalog.ldc.upenn.edu/LDC93S1.

Mahadeva Prasanna, S. R., Gupta, C. S., & Yegnanarayana, B. (2006). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.CrossRef

Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63, 561–580.CrossRef

Manjunath, K.E., & Sreenivasa Rao, K. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an Indian language: Bengali. In NCC-2014.

Manjunath, K. E., Sreenivasa Rao, K., & Pati, D. (2013). Development of Phonetic Engine for Indian languages: Bengali and Oriya. In 16th international oriental COCOSDA.

Manjunath, K.E., Sunil Kumar, S. B., Pati, D., Satapathy, B., & Sreenivasa Rao, K. (2013). Development of consonant-vowel recognition systems for Indian languages : Bengali and Odia. In INDICON-2013.

Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20, 14–22.CrossRef

Pati, D., & Mahadeva Prasanna, S. R. (2008). Non-Parametric Vector Quantization of Excitation Source Information for Speaker Recognition. In IEEE region 10 conference TENCON.

Pati, D., & Mahadeva Prasanna, S. R. (2012). Speaker verification using excitation source information. The International Journal of Speech Technology (Springer), 15, 241–257.CrossRef

Pati, D., & Mahadeva Prasanna, S. R. (2013). A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information. Sadhana, 38, 591–620.CrossRefMathSciNet

Rabiner, L., Juang, B.-H., & Yagnanarayana, B. (2008). Fundamentals of speech recognition. Singapore: Pearson Education.

Speech Group at the International Computer Science Ins. (2010) QuickNet Software and Documentation. [Online]. Available: http://www1.icsi.berkeley.edu/Speech.

Sri Rama Murty, K., & Yegnanarayana, B. (2006). Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Processing Letters, 13, 52–55.CrossRef

Stevens, K. N. (1998). Acoustic phonetics. Cambridge, MA: MIT Press.

Sunil Kumar, S. B., Sreenivasa Rao, K., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages : Bengali and Odia. In 16th international oriental COCOSDA.

The Hidden Markov Model Toolkit and HTK book. (2013). Available: http://htk.eng.cam.ac.uk.

The International Phonetic Association. (2005). International phonetic alphabet. Available: http://www.langsci.ucl.ac.uk/ipa/index.html.

Titze, I. R. (2008). Nonlinear sourcefilter coupling in phonation: Theory. Journal of the Acoustical Society of America, 123(5), 2733–2749.CrossRef

Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12, 247–251.

Vaseghi, S. (2008). Speech processing. Available: http://dea.brunel.ac.uk/cmsp/Home_Saeed_Vaseghi/Chapter13-SpeechProcessing.

Yegnanarayana, B., Mahadeva Prasanna, S. R., Duraiswami, R., & Zotkin, D. (2005). Processing of reverberant speech for time-delay estimation. IEEE Transactions on Audio, Speech, and Language Processing, 13, 1110–1118.CrossRef

Title: Source and system features for phone recognition
Authors: K. E. Manjunath
K. Sreenivasa Rao
Publication date: 01-06-2015
Publisher: Springer US
Published in: International Journal of Speech Technology / Issue 2/2015
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-014-9266-0

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2015

Database development and automatic speech recognition of isolated Pashto spoken digits using MFCC and K-NN

Minimum data generation for Telugu speech recognition

Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition

Dictionary design in subspace model for speaker identification

Improving the self-adaptive voice activity detector for speaker verification using map adaptation and asymmetric tapers

Automatic articulation error detection tool for Punjabi language with aid for hearing impaired people