nach oben

International Journal of Speech Technology

Erschienen in:

11.12.2015

Articulatory and excitation source features for speech recognition in read, extempore and conversation modes

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In our previous works, we have explored articulatory and excitation source features to improve the performance of phone recognition systems (PRSs) using read speech corpora. In this work, we have extended the use of articulatory and excitation source features for developing PRSs of extempore and conversation modes of speech, in addition to the read speech. It is well known that the overall performance of speech recognition system heavily depends on accuracy of phone recognition. Therefore, the objective of this paper is to enhance the accuracy of phone recognition systems using articulatory and excitation source features in addition to conventional spectral features. The articulatory features (AFs) are derived from the spectral features using feedforward neural networks (FFNNs). We have considered five AF groups, namely: manner, place, roundness, frontness and height. Five different AF-based tandem PRSs are developed using the combination of Mel frequency cepstral coefficients (MFCCs) and AFs derived from FFNNs. Hybrid PRSs are developed by combining the evidences from AF-based tandem PRSs using weighted combination approach. The excitation source information is derived by processing the linear prediction residual of the speech signal. The vocal tract information is captured using MFCCs. The combination of vocal tract and excitation source features is used for developing PRSs. The PRSs are developed using hidden Markov models. Bengali speech database is used for developing PRSs of read, extempore and conversation modes of speech. The results are analyzed and the performance is compared across different modes of speech. From the results, it is observed that the use of either articulatory or excitation source features along-with to MFCCs will improve the performance of PRSs in all three modes of speech. The improvement in the performance using AFs is much higher compared to the improvement obtained using excitation source features.

Vorheriger Artikel A study on the roles of total variability space and session variability modeling in speaker recognition

Nächster Artikel Efficient feature combination techniques for emotional speech classification

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Bourlard, H. A., & Morgan, N. (1994). Connnectionist speech recognition: A hybrid approach. Dordrecht: Kluwer.CrossRef

Chengalvarayan, R. (1998). On the use of normalized LPC error towards better large vocabulary speech recognition systems. In IEEE international conference on acoustics, speech and signal processing (pp. 17–20).

Dhananjaya, N., Yegnanarayana, B., & Suryakanth, V. G. (2011). Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5252–5255).

Fallside, F., Lucke, H., Marsland, T. P., O’Shea, P. J., Owen, M. S. J., Prager, R. W., et al. (1990). Continuous speech recognition for the TIMIT database using neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 445–448).

Gerfen. (2015). Phonetics theory (online). http://www.unc.edu/~gerfen/Ling 30Sp2002/phonetics.html.

Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5).

He, J., Liu, L., & Palm, G. (1996). On the use of residual cepstrum in speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 5–8).

Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1635–1638).

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.CrossRef

Ketabdar, H., & Bourlard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4065–4068).

Kirchhoff, K., Fink, Gernot A., & Sagerer, Gerhard. (2002). Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 37, 303–319.CrossRefMATH

Lee, K., & Hon, H. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.CrossRef

Manjunath, K. E., & Sreenivasa Rao, K. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an indian language: Bengali. In IEEE national conference on communications (NCC) (pp. 1–6).

Manjunath, K. E., & Sreenivasa Rao, K. (2015a). Source and system features for phone recognition. International Journal of Speech Technology, 18, 257–270.

Manjunath, K. E., & Sreenivasa Rao, K. (2015b). Improvement of phone recognition accuracy using articulatory features. Applied Soft Computing (revision submitted).

Manjunath, K. E., Sreenivasa Rao, K., & Gurunath Reddy, M. (2015a). Two-stage phone recognition system using articulatory and spectral features. In IEEE international conference on signal processing and communication engineering systems (SPACES) (pp. 107–111).

Manjunath, K. E., Sreenivasa Rao, K., & Gurunath Reddy, M. (2015b). Improvement of phone recognition accuracy using source and system features. In IEEE international conference on signal processing and communication engineering systems (SPACES) (pp. 501–505).

Manjunath, K. E., Sreenivasa Rao, K., & Pati, D. (2013). Development of phonetic engine for Indian languages: Bengali and Oriya. In 16th International oriental COCOSDA conference (IEEE explore) (pp. 1–6), Gurgoan, India.

Manjunath, K. E., Sunil Kumar, S. B., Pati, D., Satapathy, B., & Sreenivasa Rao, K. (2013). Development of consonant-vowel recognition systems for Indian languages: Bengali and Oriya. In IEEE INDICON (IEEE Explore) (pp. 1–6), IIT Bombay, Mumbai, India.

Metze, F. (2005). Articulatory features for conversational speech recognition. Ph.D. dissertation, Carnegie Mellon University.

Mitra, V., Wang, W., Stolcke, A., Nam, H., Richey, C., Yuan, J., et al. (2013). Articulatory trajectories for large-vocabulary speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 7145–7149).

Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20, 14–22.CrossRef

Sainath, T. N., Mohamed, A., Kingsbury, B., & Ramabhadran, B. (2013). Deep convolutional neural networks for LVCSR. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 8614–8618).

Siniscalchi, S. M., & Lee, C. (2009). A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51, 1139–1153.CrossRef

Speech Group at the International Computer Science Ins. (2010). QuickNet software and documentation (online). http://www1.icsi.berkeley.edu/Speech.

Sreenivasa Rao, K., & Koolagudi, S. G. (2013). Recognition of emotions from video using acoustic and facial features. In Signal, image and video processing (SIViP) (pp. 1–17).

Sunil Kumar, S. B., Sreenivasa Rao, K., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages: Bengali and Odia. In 16th International Oriental COCOSDA (pp. 1–5).

The Hidden Markov Model Toolkit and HTK book. (2015). (online). http://htk.eng.cam.ac.uk.

The International Phonetic Association. (2015). International Phonetic Alphabet (online). http://www.langsci.ucl.ac.uk/ipa/index.html.

Toth, L. (2014). Convolutional deep maxout networks for phone recognition. In International speech communication association (INTERSPEECH) (pp. 1078–1082).

Titel: Articulatory and excitation source features for speech recognition in read, extempore and conversation modes
Publikationsdatum: 11.12.2015
Erschienen in: International Journal of Speech Technology / Ausgabe 1/2016
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-015-9329-x

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Beijing Auto Show 2024: Deutsche Hersteller wollen angreifen./© EKH-Pictures / Generated with AI / Stock.adobe.com, Buchstaben, die aus einem Megaphon kommen/© MicroStockHub/Getty Images/iStock, Digitale Lieferkette/© zapp2photo / stock.adobe.com, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Sustainibility Finance/© Robert Kneschke / stock.adobe.com / Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2016

Hybridization of spectral filtering with particle swarm optimization for speech signal enhancement

Efficient audio integrity verification algorithm using discrete cosine transform

Speech coding using Best Tree Encoding (BTE) technique based on LPC and trigonometric features

Integration of Yoruba language into MaryTTS

Efficient feature combination techniques for emotional speech classification

Automatic speech segmentation in syllable centric speech recognition system

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.