Skip to main content
Top
Published in: Artificial Life and Robotics 2/2020

23-01-2020 | Original Article

The long short-term memory based on i-vector extraction for conversational speech gender identification approach

Authors: Barlian Henryranu Prasetio, Hiroki Tamura, Koichi Tanno

Published in: Artificial Life and Robotics | Issue 2/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Stress causes a speaker’s voice characteristics to be changed. Emotional stress alters a person’s speech pattern such that it is distributed non-normally along the temporal dimension. Thus, the methods for identifying the gender of a non-stressed speaker were no longer effective in recognizing the gender of a speaker in stressful conditions. To address this issue, a new gender identification framework is proposed. We leveraged i-vector for capturing gender information on each speech segment. Then the long short-term memory dynamically handled all speech temporal context features and learned the long-term dependency from the input. We evaluated the effectiveness, in terms of accuracy and the number of iterations to saturate, of the proposed method by comparing it with the baseline methods in their respective abilities to identify the speaker’s gender from conversations with different durations. By learning the gender information encoded in long-term dependencies, our proposed method outperforms the baseline methods and is able to correctly identify the speaker’s gender in all conversation types.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Kanervisto A, Vestman V, Sahidullah Md, Hautamaki V, Kinnunen T (2017) Effects of gender information in text-independent and text-dependent speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA Kanervisto A, Vestman V, Sahidullah Md, Hautamaki V, Kinnunen T (2017) Effects of gender information in text-independent and text-dependent speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA
2.
go back to reference Jayasankar T, Vinothkumar K, Vijayaselvi A (2017) Automatic gender identification in speech recognition by genetic algorithm. Appl Math Inf Sci 11(3):907–913CrossRef Jayasankar T, Vinothkumar K, Vijayaselvi A (2017) Automatic gender identification in speech recognition by genetic algorithm. Appl Math Inf Sci 11(3):907–913CrossRef
3.
go back to reference Jayasankar T, Vinothkumar K, Vijayaselvi A (2013) Gender-dependent emotion recognition based on HMMs and SPHMMs. Int J Speech Technol 16(2):133–141CrossRef Jayasankar T, Vinothkumar K, Vijayaselvi A (2013) Gender-dependent emotion recognition based on HMMs and SPHMMs. Int J Speech Technol 16(2):133–141CrossRef
4.
go back to reference Shaqra FA, Duwairi R, Al-Ayyoub M (2019) Recognizing emotion from speech based on age and gender using hierarchical models. Proced Comput Sci 151:37–44CrossRef Shaqra FA, Duwairi R, Al-Ayyoub M (2019) Recognizing emotion from speech based on age and gender using hierarchical models. Proced Comput Sci 151:37–44CrossRef
5.
go back to reference Bisio I, Delfino A, Lavagetto F, Marchese M, Sciarrone A (2013) Gender-driven emotion recognition through speech signals for ambient intelligence applications. IEEE Trans Emerg Top Comput 1(2):224–257 Bisio I, Delfino A, Lavagetto F, Marchese M, Sciarrone A (2013) Gender-driven emotion recognition through speech signals for ambient intelligence applications. IEEE Trans Emerg Top Comput 1(2):224–257
6.
go back to reference Zhang L, Wang L, Dang J, Guo L, Yu Q (2018) Gender-aware CNN-BLSTM for speech emotion recognition. In: International conference on artificial neural networks (ICANN), Rhodes, Greece Zhang L, Wang L, Dang J, Guo L, Yu Q (2018) Gender-aware CNN-BLSTM for speech emotion recognition. In: International conference on artificial neural networks (ICANN), Rhodes, Greece
7.
go back to reference Lester-Smith RA, Brad HS (2016) The effects of physiological adjustments on the perceptual and acoustical characteristics of vibrato as a model of vocal tremor. J Acoust Soc Am 140(5):3827–3833CrossRef Lester-Smith RA, Brad HS (2016) The effects of physiological adjustments on the perceptual and acoustical characteristics of vibrato as a model of vocal tremor. J Acoust Soc Am 140(5):3827–3833CrossRef
8.
go back to reference Archana GS, Malleswari M (2015) Gender identification and performance analysis of speech signals. In: Global conference on communication technologies (GCCT), Thuckalay, India Archana GS, Malleswari M (2015) Gender identification and performance analysis of speech signals. In: Global conference on communication technologies (GCCT), Thuckalay, India
9.
go back to reference Ramdinmawii E, Mittal VK (2016) Gender identification from speech signal by examining the speech production characteristics. In: International conference on signal processing and communication (ICSC), Noida, India Ramdinmawii E, Mittal VK (2016) Gender identification from speech signal by examining the speech production characteristics. In: International conference on signal processing and communication (ICSC), Noida, India
10.
go back to reference Gupta M, Bharti SS, Agarwal S (2016) Support vector machine based gender identification using voiced speech frames. In: International conference on parallel, distributed and grid computing (PDGC), Waknaghat, India Gupta M, Bharti SS, Agarwal S (2016) Support vector machine based gender identification using voiced speech frames. In: International conference on parallel, distributed and grid computing (PDGC), Waknaghat, India
11.
go back to reference Levitan SI, Mishra T, Bangalore S (2016) Automatic identification of gender from speech. In: Proceeding of speech prosody Levitan SI, Mishra T, Bangalore S (2016) Automatic identification of gender from speech. In: Proceeding of speech prosody
12.
go back to reference Hansen JHL, Patil S (2007) Speech under stress: analysis, modeling and recognition. In: Müller C (eds) Speaker classification I. Lecture notes in computer science, vol 4343. Springer, Berlin Hansen JHL, Patil S (2007) Speech under stress: analysis, modeling and recognition. In: Müller C (eds) Speaker classification I. Lecture notes in computer science, vol 4343. Springer, Berlin
13.
go back to reference Godin KW, Hansen JHL (2015) Physical task stress and speaker variability in voice quality. EURASIP J Audio Speech Music Process 29:2015 Godin KW, Hansen JHL (2015) Physical task stress and speaker variability in voice quality. EURASIP J Audio Speech Music Process 29:2015
14.
go back to reference Marten J (2005) Culture, gender and the recognition of the basic emotions. Psychologia 2005(48):306–316CrossRef Marten J (2005) Culture, gender and the recognition of the basic emotions. Psychologia 2005(48):306–316CrossRef
15.
go back to reference Grzybowska J, Ziolko M (2015) I-vectors in gender recognition from telephone speech. In: Proceedings of the twenty-first national conference on applications of mathematics in biology and medicine, pp 57–62 Grzybowska J, Ziolko M (2015) I-vectors in gender recognition from telephone speech. In: Proceedings of the twenty-first national conference on applications of mathematics in biology and medicine, pp 57–62
16.
go back to reference Wang M, Chen Y, Tang Z, Zhang E (2015) I-vector based speaker gender recognition. In: IEEE advanced information technology, electronic and automation control conference (IAEAC), Chongqing, China Wang M, Chen Y, Tang Z, Zhang E (2015) I-vector based speaker gender recognition. In: IEEE advanced information technology, electronic and automation control conference (IAEAC), Chongqing, China
17.
go back to reference Xia R, Liu Y (2012) Using i-vector space model for emotion recognition. INTERSPEECH, Portland Xia R, Liu Y (2012) Using i-vector space model for emotion recognition. INTERSPEECH, Portland
18.
go back to reference Gomes J, EL-Sharkawy M (2015) i-vector algorithm with Gaussian mixture model for efficient speech emotion recognition. In: International conference on computational science and computational intelligence (CSCI), Las Vegas, NV, USA Gomes J, EL-Sharkawy M (2015) i-vector algorithm with Gaussian mixture model for efficient speech emotion recognition. In: International conference on computational science and computational intelligence (CSCI), Las Vegas, NV, USA
19.
go back to reference Xia R, Liu Y (2016) DBN-ivector framework for acoustic emotion recognition. INTERSPEECH, San FranciscoCrossRef Xia R, Liu Y (2016) DBN-ivector framework for acoustic emotion recognition. INTERSPEECH, San FranciscoCrossRef
20.
go back to reference Verma P, Das PK (2015) i-Vectors in speech processing applications: a survey. Int J Speech Technol 18(4):529–546CrossRef Verma P, Das PK (2015) i-Vectors in speech processing applications: a survey. Int J Speech Technol 18(4):529–546CrossRef
21.
go back to reference Coutinho E, Schuller B (2017) Shared acoustic codes underlie emotional communication in music and speech-evidence from deep transfer learning. PLoS One 12:6 Coutinho E, Schuller B (2017) Shared acoustic codes underlie emotional communication in music and speech-evidence from deep transfer learning. PLoS One 12:6
22.
go back to reference Hansen JHL (1999) Composer, SUSAS LDC99S78. Web download. [Sound Recording]. Linguistic Data Consortium, Philadelphia Hansen JHL (1999) Composer, SUSAS LDC99S78. Web download. [Sound Recording]. Linguistic Data Consortium, Philadelphia
23.
go back to reference Hansen JHL (1999) Composer, SUSAS transcripts LDC99T33. [Sound Recording]. Linguistic Data Consortium, Philadelphia Hansen JHL (1999) Composer, SUSAS transcripts LDC99T33. [Sound Recording]. Linguistic Data Consortium, Philadelphia
24.
go back to reference Son HH (2017) Toward a proposed framework for mood recognition using LSTM recurrent neuron network. Proced Comput Sci 109:1028–1034CrossRef Son HH (2017) Toward a proposed framework for mood recognition using LSTM recurrent neuron network. Proced Comput Sci 109:1028–1034CrossRef
25.
go back to reference Son G, Kwon S, Park N (2019) Gender classification based on the non-lexical cues of emergency calls with recurrent neural networks (RNN). Symmetry 11(4):15CrossRef Son G, Kwon S, Park N (2019) Gender classification based on the non-lexical cues of emergency calls with recurrent neural networks (RNN). Symmetry 11(4):15CrossRef
26.
go back to reference Nammous MK, Saeed K (2019) Natural language processing: speaker, language, and gender identification with LSTM, advanced computing and systems for security, advances in intelligent systems and computing. Springer, Singapore, p 883 Nammous MK, Saeed K (2019) Natural language processing: speaker, language, and gender identification with LSTM, advanced computing and systems for security, advances in intelligent systems and computing. Springer, Singapore, p 883
27.
go back to reference Serizel R, Bisot V, Essid S, Richard G (2017) Acoustic features for environmental sound analysis. Computational analysis of sound scenes and events. Springer, Berlin, pp 71–101 Serizel R, Bisot V, Essid S, Richard G (2017) Acoustic features for environmental sound analysis. Computational analysis of sound scenes and events. Springer, Berlin, pp 71–101
28.
go back to reference Rabiner LR, Schafer RW (2010) Theory and applications of digital speech processing. Pearson, Upper Saddle River Rabiner LR, Schafer RW (2010) Theory and applications of digital speech processing. Pearson, Upper Saddle River
29.
go back to reference Zewoudie AW, Luque J, Hernandi J (2018) The use of long-term features for GMM-and i-vector-based speaker diarization systems. EURASIP J Audio Speech Music Process 14:2018 Zewoudie AW, Luque J, Hernandi J (2018) The use of long-term features for GMM-and i-vector-based speaker diarization systems. EURASIP J Audio Speech Music Process 14:2018
30.
go back to reference Zazo R, Nidadavolu PS, Chen N, Gonzalez-Rodriguez J, Dehak N (2018) Age estimation in short speech utterances based on LSTM recurrent neural networks. IEEE Access 6:22524–22530CrossRef Zazo R, Nidadavolu PS, Chen N, Gonzalez-Rodriguez J, Dehak N (2018) Age estimation in short speech utterances based on LSTM recurrent neural networks. IEEE Access 6:22524–22530CrossRef
31.
go back to reference Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
32.
go back to reference Alsulami B, Dauber E, Harang RE, Mancoridis S, Greenstadt R (2017) Source code authorship attribution using long short-term memory based networks. In: 22nd European symposium on research in computer security (ESORICS), Oslo, Norway Alsulami B, Dauber E, Harang RE, Mancoridis S, Greenstadt R (2017) Source code authorship attribution using long short-term memory based networks. In: 22nd European symposium on research in computer security (ESORICS), Oslo, Norway
34.
go back to reference Mesnil G, He X, Deng L, Bengio Y (2013) Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. INTERSPEECH, Lyon Mesnil G, He X, Deng L, Bengio Y (2013) Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. INTERSPEECH, Lyon
35.
go back to reference Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning, vol 28, no 3, pp 1310–1318 Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning, vol 28, no 3, pp 1310–1318
36.
go back to reference Sak H, Senior A, Beaufays F (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv:1402.1128v1 [cs.NE] Sak H, Senior A, Beaufays F (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv:​1402.​1128v1 [cs.NE]
37.
go back to reference Yokoyama N, Azuma D, Tsukiyama S (2016) An efficient Gaussian mixture reduction to two components. In: The 20th workshop on synthesis and system integration of mixed information technologies, Kyoto, Japan, pp 236–241 Yokoyama N, Azuma D, Tsukiyama S (2016) An efficient Gaussian mixture reduction to two components. In: The 20th workshop on synthesis and system integration of mixed information technologies, Kyoto, Japan, pp 236–241
Metadata
Title
The long short-term memory based on i-vector extraction for conversational speech gender identification approach
Authors
Barlian Henryranu Prasetio
Hiroki Tamura
Koichi Tanno
Publication date
23-01-2020
Publisher
Springer Japan
Published in
Artificial Life and Robotics / Issue 2/2020
Print ISSN: 1433-5298
Electronic ISSN: 1614-7456
DOI
https://doi.org/10.1007/s10015-020-00582-x

Other articles of this Issue 2/2020

Artificial Life and Robotics 2/2020 Go to the issue