Skip to main content
Erschienen in: The Journal of Supercomputing 10/2020

13.06.2020

Comparative studies on machine learning for paralinguistic signal compression and classification

verfasst von: Seokhyun Byun, Seunghyun Yoon, Kyomin Jung

Erschienen in: The Journal of Supercomputing | Ausgabe 10/2020

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we focus on various compression and classification algorithms for three different paralinguistic signal classification tasks. These tasks are quite difficult for humans because the sound information from such signals is difficult to distinguish. Therefore, when machine learning techniques are applied to analyze paralinguistic signals, several different aspects of speech-related information, such as prosody, energy, and cepstral information, are usually considered for feature extraction. However, when the size of the training corpus is not sufficiently large, it is extremely difficult to directly apply machine learning to classify such signals due to their high feature dimensions; this problem is also known as the curse of dimensionality. This paper proposes to address this limitation by means of feature compression. First, we present experimental results obtained by using various compression algorithms to compress signals to eliminate redundancy of the signal features. We observe that compared with the original features, the compressed signal features still provide a comparable ability to distinguish the signals, especially when using a fully connected neural network classifier. Second, we calculate the output distribution of the F1-score for each emotion in the speech emotion recognition problem and show that the fully connected neural network classifier performs more stably than other classical methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aldeneh Z, Provost EM (2017) Using regional saliency for speech emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 2741–2745 Aldeneh Z, Provost EM (2017) Using regional saliency for speech emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 2741–2745
2.
Zurück zum Zitat Amiriparian S, Gerczuk M, Ottl S, Cummins N, Freitag M, Pugachevskiy S, Baird A, Schuller BW (2017) Snore sound classification using image-based deep spectrum features. In: INTERSPEECH, pp 3512–3516 Amiriparian S, Gerczuk M, Ottl S, Cummins N, Freitag M, Pugachevskiy S, Baird A, Schuller BW (2017) Snore sound classification using image-based deep spectrum features. In: INTERSPEECH, pp 3512–3516
3.
Zurück zum Zitat Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al. (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: International Conference on Machine Learning, pp 173–182 Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al. (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: International Conference on Machine Learning, pp 173–182
4.
Zurück zum Zitat Bandela SR, Kishpre KT (2019) Speech emotion recognition using semi-NMF feature optimization. Turk J Electr Eng Comput Sci 27(5):3741–3757CrossRef Bandela SR, Kishpre KT (2019) Speech emotion recognition using semi-NMF feature optimization. Turk J Electr Eng Comput Sci 27(5):3741–3757CrossRef
5.
Zurück zum Zitat Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. ACM, pp 144–152 Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. ACM, pp 144–152
6.
Zurück zum Zitat Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335CrossRef Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335CrossRef
7.
Zurück zum Zitat Byun S, Yoon S, Jung K (2019) Neural networks for compressing and classifying speaker-independent paralinguistic signals. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE, pp 1–4 Byun S, Yoon S, Jung K (2019) Neural networks for compressing and classifying speaker-independent paralinguistic signals. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE, pp 1–4
8.
Zurück zum Zitat Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. ACM, pp 785–794 Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. ACM, pp 785–794
9.
Zurück zum Zitat Chiou BC, Chen CP (2013) Feature space dimension reduction in speech emotion recognition using support vector machine. In: 2013 Asia–Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, pp 1–6 Chiou BC, Chen CP (2013) Feature space dimension reduction in speech emotion recognition using support vector machine. In: 2013 Asia–Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, pp 1–6
10.
Zurück zum Zitat Cho J, Pappagari R, Kulkarni P, Villalba J, Carmiel Y, Dehak N (2018) Deep neural networks for emotion recognition combining audio and transcripts. Proc Interspeech 2018:247–251CrossRef Cho J, Pappagari R, Kulkarni P, Villalba J, Carmiel Y, Dehak N (2018) Deep neural networks for emotion recognition combining audio and transcripts. Proc Interspeech 2018:247–251CrossRef
11.
Zurück zum Zitat Eyben F, Weninger F, Gross F, Schuller B (2013) Recent developments in opensmile, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia. ACM, pp 835–838 Eyben F, Weninger F, Gross F, Schuller B (2013) Recent developments in opensmile, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia. ACM, pp 835–838
12.
Zurück zum Zitat Fewzee P, Karray F (2012) Dimensionality reduction for emotional speech recognition. In: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Conference on Social Computing. IEEE, pp 532–537 Fewzee P, Karray F (2012) Dimensionality reduction for emotional speech recognition. In: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Conference on Social Computing. IEEE, pp 532–537
13.
Zurück zum Zitat Gamage KW, Sethu V, Ambikairajah E (2017) Salience based lexical features for emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5830–5834 Gamage KW, Sethu V, Ambikairajah E (2017) Salience based lexical features for emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5830–5834
14.
Zurück zum Zitat Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth Annual Conference of the International Speech Communication Association Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth Annual Conference of the International Speech Communication Association
15.
Zurück zum Zitat Hantke S, Eyben F, Appel T, Schuller B (2015) ihearu-play: introducing a game for crowdsourced data collection for affective computing. In: 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, pp 891–897 Hantke S, Eyben F, Appel T, Schuller B (2015) ihearu-play: introducing a game for crowdsourced data collection for affective computing. In: 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, pp 891–897
16.
Zurück zum Zitat Hantke S, Sagha H, Cummins N, Schuller B (2017) Emotional speech of mentally and physically disabled individuals: introducing the emotass database and first findings. Proc Interspeech 2017:3137–3141CrossRef Hantke S, Sagha H, Cummins N, Schuller B (2017) Emotional speech of mentally and physically disabled individuals: introducing the emotass database and first findings. Proc Interspeech 2017:3137–3141CrossRef
17.
Zurück zum Zitat Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp 448–456 Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp 448–456
18.
Zurück zum Zitat Klambauer G, Unterthiner T, Mayr A, Hochreiter S (2017) Self-normalizing neural networks. In: Advances in Neural Information Processing Systems, pp 971–980 Klambauer G, Unterthiner T, Mayr A, Hochreiter S (2017) Self-normalizing neural networks. In: Advances in Neural Information Processing Systems, pp 971–980
19.
Zurück zum Zitat Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth Annual Conference of the International Speech Communication Association Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth Annual Conference of the International Speech Communication Association
20.
Zurück zum Zitat Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. 2017 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP). IEEE, pp 2227–2231 Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. 2017 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP). IEEE, pp 2227–2231
21.
Zurück zum Zitat Neumann M, Vu NT (2017) Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. Proc Interspeech 2017:1263–1267CrossRef Neumann M, Vu NT (2017) Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. Proc Interspeech 2017:1263–1267CrossRef
22.
Zurück zum Zitat Pan Y, Shen P, Shen L (2012) Speech emotion recognition using support vector machine. Int J Smart Home 6(2):101–108 Pan Y, Shen P, Shen L (2012) Speech emotion recognition using support vector machine. Int J Smart Home 6(2):101–108
23.
Zurück zum Zitat Panwar S, Rad P, Choo KKR, Roopaei M (2019) Are you emotional or depressed? Learning about your emotional state from your music using machine learning. J Supercomput 75(6):2986–3009CrossRef Panwar S, Rad P, Choo KKR, Roopaei M (2019) Are you emotional or depressed? Learning about your emotional state from your music using machine learning. J Supercomput 75(6):2986–3009CrossRef
24.
Zurück zum Zitat Quan C, Wan D, Zhang B, Ren F (2013) Reduce the dimensions of emotional features by principal component analysis for speech emotion recognition. In: Proceedings of the 2013 IEEE/SICE International Symposium on System Integration. IEEE, pp 222–226 Quan C, Wan D, Zhang B, Ren F (2013) Reduce the dimensions of emotional features by principal component analysis for speech emotion recognition. In: Proceedings of the 2013 IEEE/SICE International Symposium on System Integration. IEEE, pp 222–226
25.
Zurück zum Zitat Sahu S, Gupta R, Sivaraman G, AbdAlmageed W, Espy-Wilson C (2017) Adversarial auto-encoders for speech based emotion recognition. Proc Interspeech 2017:1243–1247CrossRef Sahu S, Gupta R, Sivaraman G, AbdAlmageed W, Espy-Wilson C (2017) Adversarial auto-encoders for speech based emotion recognition. Proc Interspeech 2017:1243–1247CrossRef
26.
Zurück zum Zitat Sahu S, Gupta R, Espy-Wilson C (2018) On enhancing speech emotion recognition using generative adversarial networks. Proc Interspeech 2018:3693–3697CrossRef Sahu S, Gupta R, Espy-Wilson C (2018) On enhancing speech emotion recognition using generative adversarial networks. Proc Interspeech 2018:3693–3697CrossRef
27.
Zurück zum Zitat Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings (ICASSP’03), vol 2. IEEE, pp II-1 Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings (ICASSP’03), vol 2. IEEE, pp II-1
28.
Zurück zum Zitat Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings (ICASSP’04), vol 1. IEEE, pp I-577 Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings (ICASSP’04), vol 1. IEEE, pp I-577
29.
Zurück zum Zitat Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, MüLler C, Narayanan S (2013) Paralinguistics in speech and language-state-of-the-art and the challenge. Comput Speech Lang 27(1):4–39CrossRef Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, MüLler C, Narayanan S (2013) Paralinguistics in speech and language-state-of-the-art and the challenge. Comput Speech Lang 27(1):4–39CrossRef
30.
Zurück zum Zitat Schuller B, Steidl S, Batliner A, Hirschberg J, Burgoon JK, Baird A, Elkins A, Zhang Y, Coutinho E, Evanini K (2016) The interspeech 2016 computational paralinguistics challenge: deception, sincerity & native language. Interspeech 2016:2001–2005CrossRef Schuller B, Steidl S, Batliner A, Hirschberg J, Burgoon JK, Baird A, Elkins A, Zhang Y, Coutinho E, Evanini K (2016) The interspeech 2016 computational paralinguistics challenge: deception, sincerity & native language. Interspeech 2016:2001–2005CrossRef
31.
Zurück zum Zitat Schuller B, Steidl S, Batliner A, Bergelson E, Krajewski J, Janott C, Amatuni A, Casillas M, Seidl A, Soderstrom M, et al (2017) The interspeech 2017 computational paralinguistics challenge: addressee, cold & snoring. In: Computational Paralinguistics Challenge (ComParE), Interspeech 2017, pp 3442–3446 Schuller B, Steidl S, Batliner A, Bergelson E, Krajewski J, Janott C, Amatuni A, Casillas M, Seidl A, Soderstrom M, et al (2017) The interspeech 2017 computational paralinguistics challenge: addressee, cold & snoring. In: Computational Paralinguistics Challenge (ComParE), Interspeech 2017, pp 3442–3446
32.
Zurück zum Zitat Schuller B, Steidl S, Batliner A, Marschik PB, Baumeister H, Dong F, Hantke S, Pokorny FB, Rathner EM, Bartl-Pokorny KD et al (2018) The interspeech 2018 computational paralinguistics challenge: atypical & self-assessed affect, crying & heart beats. Proc Interspeech 2018:122–126CrossRef Schuller B, Steidl S, Batliner A, Marschik PB, Baumeister H, Dong F, Hantke S, Pokorny FB, Rathner EM, Bartl-Pokorny KD et al (2018) The interspeech 2018 computational paralinguistics challenge: atypical & self-assessed affect, crying & heart beats. Proc Interspeech 2018:122–126CrossRef
33.
Zurück zum Zitat Schuller BW, Batliner A, Bergler C, Pokorny FB, Krajewski J, Cychosz M, Schmitt M, et al (2019) The interspeech 2019 computational paralinguistics challenge: styrian dialects, continuous sleepiness, baby sounds & orca activity. In: Proceedings of Interspeech Schuller BW, Batliner A, Bergler C, Pokorny FB, Krajewski J, Cychosz M, Schmitt M, et al (2019) The interspeech 2019 computational paralinguistics challenge: styrian dialects, continuous sleepiness, baby sounds & orca activity. In: Proceedings of Interspeech
34.
Zurück zum Zitat Senoussaoui M, Cardinal P, Dehak N, Koerich AL (2016) Native language detection using the i-vector framework. Interspeech 2016:2398–2402CrossRef Senoussaoui M, Cardinal P, Dehak N, Koerich AL (2016) Native language detection using the i-vector framework. Interspeech 2016:2398–2402CrossRef
35.
Zurück zum Zitat Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, pp 112–118 Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, pp 112–118
36.
Zurück zum Zitat Yoon S, Byun S, Dey S, Jung K (2019) Speech emotion recognition using multi-hop attention mechanism. In: ICASSP 2019–2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP). IEEE, pp 2822–2826 Yoon S, Byun S, Dey S, Jung K (2019) Speech emotion recognition using multi-hop attention mechanism. In: ICASSP 2019–2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP). IEEE, pp 2822–2826
37.
Zurück zum Zitat Yu D, Deng L (2016) Automatic speech recognition. Springer, BerlinMATH Yu D, Deng L (2016) Automatic speech recognition. Springer, BerlinMATH
Metadaten
Titel
Comparative studies on machine learning for paralinguistic signal compression and classification
verfasst von
Seokhyun Byun
Seunghyun Yoon
Kyomin Jung
Publikationsdatum
13.06.2020
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 10/2020
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-020-03346-3

Weitere Artikel der Ausgabe 10/2020

The Journal of Supercomputing 10/2020 Zur Ausgabe