nach oben

International Journal of Speech Technology

Erschienen in:

01.09.2012

Neural network based feature transformation for emotion independent speaker identification

verfasst von: Sreenivasa Rao Krothapalli, Jaynath Yadav, Sourjya Sarkar, Shashidhar G. Koolagudi, Anil Kumar Vuppala

Erschienen in: International Journal of Speech Technology | Ausgabe 3/2012

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this paper we are proposing neural network based feature transformation framework for developing emotion independent speaker identification system. Most of the present speaker recognition systems may not perform well during emotional environments. In real life, humans extensively express emotions during conversations for effectively conveying the messages. Therefore, in this work we propose the speaker recognition system, robust to variations in emotional moods of speakers. Neural network models are explored to transform the speaker specific spectral features from any specific emotion to neutral. In this work, we have considered eight emotions namely, Anger, Sad, Disgust, Fear, Happy, Neutral, Sarcastic and Surprise. The emotional databases developed in Hindi, Telugu and German are used in this work for analyzing the effect of proposed feature transformation on the performance of speaker identification system. In this work, spectral features are represented by mel-frequency cepstral coefficients, and speaker models are developed using Gaussian mixture models. Performance of the speaker identification system is analyzed with various feature mapping techniques. Results have demonstrated that the proposed neural network based feature transformation has improved the speaker identification performance by 20 %. Feature transformation at the syllable level has shown the better performance, compared to sentence level.

Vorheriger Artikel Speaker identification investigation and analysis in unbiased and biased emotional talking environments

Nächster Artikel Multiple background models for speaker verification using the concept of vocal tract length and MLLR super-vector

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Abe, M., Nakamura, S., Shikano, K., & Kuwabara, H. (1988). Voice conversion through vector quantization. In Proc. IEEE int. conf. acoust., speech, signal processing, May 1988 (Vol. 1, pp. 655–658).

Benesty, J., Sondhi, M. M. & Huang, Y. (Eds.) (2008). Springer handbook on speech processing. Berlin: Springer.

Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz, D., & Reynolds, D. (2004). A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing, 4, 430–451.

Bou-Ghazale, S. E., & Hansen, J. H. L. (1996). Generating stressed speech from neutral speech using a modified celp vocoder. Speech Communication, 20, 93–110. CrossRef

Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In INTERSPEECH-2005 (pp. 1517–1520).

Campbell, N. (2004). Perception of affect in speech—towards an automatic processing of paralinguistic information in spoken conversation. In ICSLP, Jeju, October 2004.

Campbell, J., Reynolds, D., & Dunn, R. (2003). Fusing high- and low-level features for speaker recognition. In Proc. European conf. speech commun. technol. (pp. 2665–2668).

Desai, S., Black, A. W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18, 954–964. CrossRef

Dunn, R., Reynolds, D., & Quatieri, T. (2000). Approaches to speaker detection and tracking in multi-speaker audio. Digital Signal Processing, 10(1), 93–112. CrossRef

Fine, S., Navaratil, J., & Gopinath, R. (2001). A hybrid GMM/SVM approach to speaker identification. In Proc. IEEE int. conf. acoust., speech, signal processing, Utah, USA, May 2001 (Vol. 1).

Govind, D., Prasanna, S. R. M., & Yegnanarayana, B. (2004). Neutral to target emotion conversion using source and suprasegmental information. In Proc. INTERSPEECH 2011, Florence, Italy, August 2004 (pp. 73–76).

Gupta, C. S. (2003). Significance of source features for speaker recognition. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India.

Hansen, J. H. L., & Womack, B. D. (1996). Feature analysis and neural network-based classification of speech under stress. IEEE Transactions on Speech and Audio Processing, 4, 307–313. CrossRef

Haykin, S. (1999). Neural networks: a comprehensive foundation. New Delhi: Pearson Education Aisa, Inc. MATH

Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing. New York: Prentice-Hall, Inc.

Kishore, S. P., & Yegnanarayana, B. (2001). Online text-independent speaker verification system using autoassociative neural network models. In Int. joint conf. neural networks, Washington, USA, Aug. 2001 (pp. 1548–1553).

Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: speech database for emotion analysis. In LNCS. Communications in computer and information science, Aug. 2009. Berlin: Springer.

Koolagudi, S. G., Reddy, R., Yadav, J., & Rao, K. S. (2011). Iitkgp-sehsc: Hindi speech corpus for emotion analysis. In International conference on devices and communication, Mesra, India, Birla Institute of Technology, Feb. 2011. New York: IEEE Press.

Li, D., Yang, Y., Wu, Z., & Wu, T. (2005). Lecture notes in computer science.: Vol. 3784. Emotion-state conversion for speaker recognition. Berlin: Springer. ISBN: 978-3-540-29621-8.

Marshimi, M., Toda, T., Shikano, K., & Campbell, N. (2001). Evaluation of cross-language voice conversion based on GMM and STRAIGHT. In Proceedings of EUROSPEECH 2001, Aalborg, Denmark, Sept. 2001 (pp. 361–364).

Mary, L. (2006). Multi level implicit features for language and speaker recognition. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology, Madras, Chennai, India, June.

Mary, L., & Yegnanarayana, B. (2006). Prosodic features for speaker verification. In Proc. int. conf. spoken language processing, Pittsburgh, PA, USA, Sep. 2006 (pp. 917–920).

Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796. CrossRef

Mary, L., Rao, K. S., Gangashetty, S. V., & Yegnanarayana, B. (2004). Neural network models for capturing duration and intonation knowledge for language and speaker identification. In Int. conf. cognitive and neural systems, Boston, MA, USA, May 2004.

Murty, K. S. R., & Yegnanarayana, B. (2006). Combining evidence from residual phase and mfcc features for speaker recognition. IEEE Signal Processing Letters, 13, 52–55. CrossRef

Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613. CrossRef

Narendra, N. P., Rao, K. S., Ghosh, K., Vempada, R. R., & Maity, S. (2011). Development of syllable-based text-to-speech synthesis system in Bengali. International Journal of Speech Technology, 14, 167–182. CrossRef

Narendranadh, M., Murthy, H. A., Rajendran, S., & Yegnanarayana, B. (1995). Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16, 206–216.

Prasanna, S. R. M., & Pradhan, G. (2011). Significance of vowel-like regions for speaker verification under degraded condition. IEEE Transactions on Speech and Audio Processing, 19, 2552–2565. CrossRef

Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs: Prentice-Hall.

Raja, G. S., & Dandapat, S. (2010). Speaker recognition under stressed condition. International Journal of Speech Technology, 13, 141–161. CrossRef

Rao, K. S. (2010). Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech and Language, 24.

Rao, K. S. (2011). Role of neural networks for developing speech systems. Sadhana, 36, 783–836. CrossRef

Rao, K. S., & Koolagudi, S. G. (2007). Transformation of speaker characteristics in speech using support vector machines. In 15th international conference on advanced computing and communication (ADCOM-2007), Guwahati, India, Dec. 2007.

Rao, K. S., & Yegnanarayana, B. (2006a). Voice conversion by prosody and vocal tract modification. In 9th int. conf. information technology, Bhubaneswar, Orissa, India.

Rao, K. S., & Yegnanarayana, B. (2006b). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980. CrossRef

Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech & Language, 21, 282–295. CrossRef

Rao, K. S., & Yegnanarayana, B. (2008). Intonation modeling for Indian languages. Computer Speech and Language.

Rao, K. S., & Yegnanarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51, 1263–1269. CrossRef

Rao, K. S., Laskar, R. H., & Koolagudi, S. G. (2007a). Voice transformation by mapping the features at syllable level. In 2nd international conference on pattern recognition and machine intelligence (Premi-2007), Kolkata, India, Dec. 2007.

Rao, K. S., Laskar, R. H., & Koolagudi, S. G. (2007b). Voice transformation by mapping the features at syllable level. In R. K. D. A. Ghosh & S. K. Pal (Eds.), LNCS, ISI Kolkata (pp. 479–486). Heidelberg: Springer.

Reddy, K. S. (2004). Source and system features for speaker recognition. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India.

Reynolds, D., & Rose, R. (1995). Robust text independent speaker identification using Gaussian mixture speaker models. IEEE Transactions Speech and Audio Processing, 72–83.

Reynolds, D., Quatieri, T., & Dunn, R. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19–41. CrossRef

Scherer, K. R., Johnstone, T., & Banziger, T. (1998). Automatic verification of emotionally stressed speakers: the problem of individual differences. In Proceedings of the international workshop on speech and computer, St. Petersburg.

Shahin, I. (2006). Enhancing speaker identification performance under the shouted talking condition using second-order circular hidden Markov models. Speech Communication, 48, 1047–1055. CrossRef

Shahin, I. (2008). Using emotions to identify speakers. In 5th int. workshop on signal processing and its applications.

Shahin, I. (2009). Speaker identification in emotional environments. Iranian Journal of Electrical and Computer Engineering, 8(1), 41–46. MathSciNet

Toda, T., Saruwatari, H., & Shikano, K. (2001). Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In Proc. IEEE int. conf. acoust., speech, signal processing, May 2001 (Vol. 2, pp. 841–844).

Vuppala, A. K., Rao, K. S., & Chakrabarti, S. (2011). Improved consonant-vowel recognition for low bit-rate coded speech. International Journal of Adaptive Control Signal Processing, doi:10.1002/acs.1286.

Vuppala, A. K., Yadav, J., Chakrabarti, S., & Rao, K. S. (2011). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33. CrossRef

Vuppala, A. K., Yadav, J., Chakrabarti, S., & Rao, K. S. (2012). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Speech and Audio Processing, 20, 1894–1903. CrossRef

Wu, W., Zheng, T. F., Xu, M. X., & Bao, H. J. (2006). Study on speaker verification on emotional speech. In Proc. of int. conf. on spoken language processing (ICSLP-2006) (pp. 2102–2105).

Yegnanarayana, B. (1999). Artificial neural networks. New Delhi: Prentice-Hall.

Yegnanarayana, B., & Kishore, S. P. (2002). AANN an alternative to GMM for pattern recognition. Neural Networks, 15, 459–469. CrossRef

Yegnanarayana, B., Reddy, K. S., & Kishore, S. P. (2001). Source and system features for speaker recognition using AANN models. In Proc. IEEE int. conf. acoust., speech, signal processing, Salt Lake City, Utah, USA, May 2001 (pp. 409–412).

Zachariah, J. M. (2002). Text-dependent speaker verification using segmental suprasegmental and source features. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India, March.

Titel: Neural network based feature transformation for emotion independent speaker identification
verfasst von: Sreenivasa Rao Krothapalli
Jaynath Yadav
Sourjya Sarkar
Shashidhar G. Koolagudi
Anil Kumar Vuppala
Publikationsdatum: 01.09.2012
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 3/2012
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-012-9148-2

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Jonas Klose/© Pine Valley Capital GmbH, Carina Kießling von der Strategieberatung Roland Berger/© Monika Walther Fotografie | ATZ, Beijing Auto Show 2024: Deutsche Hersteller wollen angreifen./© EKH-Pictures / Generated with AI / Stock.adobe.com, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2012

Multiple background models for speaker verification using the concept of vocal tract length and MLLR super-vector

Synthesized speech for model training in cross-corpus recognition of human emotion

TEO-based speaker stress assessment using hybrid classification and tracking schemes

Robust feature extraction from spectrum estimated using bispectrum for speaker recognition

Speaker recognition using pyramid match kernel based support vector machines

Speaker identification investigation and analysis in unbiased and biased emotional talking environments

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.