Skip to main content
Top

2021 | OriginalPaper | Chapter

Audio Classification - Feature Dimensional Analysis

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

An audio signal is an analogue signal representation in one-dimensional function x(t) with t the continual variable depicting time. Such signals, generated from diverse sources, can be discerned as music, speech, noise or any combination. For machines to understand, these audio signals must be represented such as the extraction of its features which are representations of the composition of the audio signal and behavior over time. Audio feature extraction can enhance the efficacy of audio processing and hence a benefit for numerous applications. We are presenting an emotion classification analysis with reference to audio representation (1 Dimensional and 2 Dimensional) with focus on audio recordings obtainable in Ryerson Audio-Visual Database of Emotion Speech and Song (RAVDESS) dataset, classification is based on eight (8) different emotions. We scrutinize the accuracy evaluation metric on the average of five (5) iterations for each audio signal (raw audio, normalized raw audio and spectrogram) representation. This presents the extraction of features in 1D and 2D as input using the Convolutional Neutral Network (CNN). A Variance of analysis (ANOVA - single factor) analysis was done to test the hypotheses on obtained accuracy values to show significance between the different audio signal representations of the dataset. Results obtained on F-ratio is greater than the critical F-ratio hence this value lies in the critical region. Thus, a shred of evidence that at 0.05 significance level, the true mean of the varied dataset does differ.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Vrysis, L., Tsipas, N., Thoidis, I., Dimoulas, C.: 1D/2D Deep CNNs vs. temporal feature integration for general audio classification. J. Audio Eng. Soc. 68(1/2), 66–77 (2020) Vrysis, L., Tsipas, N., Thoidis, I., Dimoulas, C.: 1D/2D Deep CNNs vs. temporal feature integration for general audio classification. J. Audio Eng. Soc. 68(1/2), 66–77 (2020)
2.
go back to reference Lee, J., Park, J., Kim, K.L., Nam, J.: SampleCNN: end-to-end deep convolutional neural networks using very small filters for music classification. Appl. Sci. 8(1), 150 (2018)CrossRef Lee, J., Park, J., Kim, K.L., Nam, J.: SampleCNN: end-to-end deep convolutional neural networks using very small filters for music classification. Appl. Sci. 8(1), 150 (2018)CrossRef
3.
go back to reference Kostek, B.: Soft computing in acoustics: applications of neural networks, fuzzy logic and rough sets to musical acoustics. Physica, 31 (2013) Kostek, B.: Soft computing in acoustics: applications of neural networks, fuzzy logic and rough sets to musical acoustics. Physica, 31 (2013)
4.
go back to reference Ntalampiras, S., Potamitis, I., Fakotakis, N.: Exploiting temporal feature integration for generalized sound recognition. EURASIP J. Adv. Signal Process. 2009(1), 807162 (2009)CrossRefMATH Ntalampiras, S., Potamitis, I., Fakotakis, N.: Exploiting temporal feature integration for generalized sound recognition. EURASIP J. Adv. Signal Process. 2009(1), 807162 (2009)CrossRefMATH
6.
go back to reference de Pinto, M.G., Polignano, M., Lops, P., Semeraro, G.: Emotions understanding model from spoken language using deep neural networks and mel-frequency cepstral coefficients. In: 2020 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS), pp. 1–5. IEEE (2020) de Pinto, M.G., Polignano, M., Lops, P., Semeraro, G.: Emotions understanding model from spoken language using deep neural networks and mel-frequency cepstral coefficients. In: 2020 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS), pp. 1–5. IEEE (2020)
7.
go back to reference Dai, W., Dai, C., Qu, S., Li, J., Das, S.: Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 421–425. IEEE (2017) Dai, W., Dai, C., Qu, S., Li, J., Das, S.: Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 421–425. IEEE (2017)
8.
go back to reference Eghbal-Zadeh, H., Lehner, B., Dorfer, M., & Widmer, G.: CP-JKU submissions for DCASE-2016: a hybrid approach using binaural i-vectors and deep convolutional neural networks. In: IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), vol. 6, pp. 5024–5028 (2016) Eghbal-Zadeh, H., Lehner, B., Dorfer, M., & Widmer, G.: CP-JKU submissions for DCASE-2016: a hybrid approach using binaural i-vectors and deep convolutional neural networks. In: IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), vol. 6, pp. 5024–5028 (2016)
9.
go back to reference Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Ng, A.Y:. Deep speech: Scaling up end-to-end speech recognition (2014) Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Ng, A.Y:. Deep speech: Scaling up end-to-end speech recognition (2014)
10.
go back to reference Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O:. Learning the speech front-end with raw waveform CLDNNs. In: Sixteenth Annual Conference of the International Speech Communication Association (2015) Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O:. Learning the speech front-end with raw waveform CLDNNs. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
11.
go back to reference Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015) Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
12.
go back to reference Taigman, Y., Yang, M., Ranzato, M.A., Wolf, L:. DeepFace: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708 (2014) Taigman, Y., Yang, M., Ranzato, M.A., Wolf, L:. DeepFace: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708 (2014)
13.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
14.
go back to reference Golik, P., Tüske, Z., Schlüter, R., Ney, H.: Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. In: Sixteenth Annual Conference of the International Speech Communication Association (2015) Golik, P., Tüske, Z., Schlüter, R., Ney, H.: Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
15.
go back to reference Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Fifteenth Annual Conference of the International Speech Communication Association (2014) Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
16.
go back to reference Muda, L., Begam, M., Elamvazuthi, I.: Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:1003.4083 (2010) Muda, L., Begam, M., Elamvazuthi, I.: Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:​1003.​4083 (2010)
17.
go back to reference Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968, IEEE (2014) Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968, IEEE (2014)
18.
go back to reference Palaz, D., Doss, M.M., Collobert, R.: Convolutional neural networks-based continuous speech recognition using raw speech signal. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4295–4299. IEEE (2015) Palaz, D., Doss, M.M., Collobert, R.: Convolutional neural networks-based continuous speech recognition using raw speech signal. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4295–4299. IEEE (2015)
19.
go back to reference Zhao, J., Mao, X., Chen, L.: Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Proc. 12(6), 713–721 (2018)CrossRef Zhao, J., Mao, X., Chen, L.: Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Proc. 12(6), 713–721 (2018)CrossRef
20.
go back to reference Chao, L., Tao, J., Yang, M., Li, Y., Wen, Z.: Long short-term memory recurrent neural network-based encoding method for emotion recognition in video. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2752–2756. IEEE (2016) Chao, L., Tao, J., Yang, M., Li, Y., Wen, Z.: Long short-term memory recurrent neural network-based encoding method for emotion recognition in video. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2752–2756. IEEE (2016)
21.
go back to reference Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., Zafeiriou, S.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204. IEEE (2016) Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., Zafeiriou, S.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204. IEEE (2016)
22.
go back to reference Chen, C., Li, Q.: A multimodal music emotion classification method based on multifeature combined network classifier. Math. Prob. Eng. (2020) Chen, C., Li, Q.: A multimodal music emotion classification method based on multifeature combined network classifier. Math. Prob. Eng. (2020)
23.
go back to reference Palaz, D., Collobert, R., Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. arXiv preprint arXiv:1304.1018 (2013) Palaz, D., Collobert, R., Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. arXiv preprint arXiv:​1304.​1018 (2013)
24.
go back to reference Chang, J., Scherer, S.: Learning representations of emotional speech with deep convolutional generative adversarial networks. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2746–2750. IEEE (2017) Chang, J., Scherer, S.: Learning representations of emotional speech with deep convolutional generative adversarial networks. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2746–2750. IEEE (2017)
25.
go back to reference Ghosh, S., Laksana, E., Morency, L.P., Scherer, S.: Representation Learning for Speech Emotion Recognition. In: Interspeech, pp. 3603–3607 (2016) Ghosh, S., Laksana, E., Morency, L.P., Scherer, S.: Representation Learning for Speech Emotion Recognition. In: Interspeech, pp. 3603–3607 (2016)
26.
go back to reference Huang, J., Li, Y., Tao, J., Lian, Z., Niu, M., Yi, J.: Speech emotion recognition using semi-supervised learning with ladder networks. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), pp. 1–5. IEEE (2018) Huang, J., Li, Y., Tao, J., Lian, Z., Niu, M., Yi, J.: Speech emotion recognition using semi-supervised learning with ladder networks. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), pp. 1–5. IEEE (2018)
27.
go back to reference Wieser, I., Barros, P., Heinrich, S., Wermter, S.: Understanding auditory representations of emotional expressions with neural networks. Neural Comput. Appl. 32(4), 1007–1022 (2020)CrossRef Wieser, I., Barros, P., Heinrich, S., Wermter, S.: Understanding auditory representations of emotional expressions with neural networks. Neural Comput. Appl. 32(4), 1007–1022 (2020)CrossRef
28.
go back to reference LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRef LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRef
29.
go back to reference Bergstra, J., Yamins, D., Cox, D.: Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In: International Conference on Machine Learning, pp. 115–123 (2013) Bergstra, J., Yamins, D., Cox, D.: Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In: International Conference on Machine Learning, pp. 115–123 (2013)
30.
go back to reference Humphrey, E.J., Bello, J.P., LeCun, Y.: Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In: ISMIR pp. 403–408 (2012) Humphrey, E.J., Bello, J.P., LeCun, Y.: Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In: ISMIR pp. 403–408 (2012)
31.
go back to reference Henaff, M., Jarrett, K., Kavukcuoglu, K., LeCun, Y.: Unsupervised learning of sparse features for scalable audio classification. ISMIR 11(445), 2011 (2011) Henaff, M., Jarrett, K., Kavukcuoglu, K., LeCun, Y.: Unsupervised learning of sparse features for scalable audio classification. ISMIR 11(445), 2011 (2011)
32.
go back to reference LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)CrossRef LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)CrossRef
33.
go back to reference Hubel, D.H., Wiesel, T.N.: Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology 148(3), 574 (1959)CrossRef Hubel, D.H., Wiesel, T.N.: Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology 148(3), 574 (1959)CrossRef
34.
go back to reference Lee, H., Pham, P., Largman, Y., Ng, A.Y.: Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in Neural Information Processing Systems, pp. 1096–1104 (2009) Lee, H., Pham, P., Largman, Y., Ng, A.Y.: Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in Neural Information Processing Systems, pp. 1096–1104 (2009)
35.
go back to reference Ciregan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649. IEEE (2012) Ciregan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649. IEEE (2012)
36.
go back to reference Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
37.
go back to reference Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 609–616 (2009) Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 609–616 (2009)
38.
go back to reference LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRef LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRef
39.
go back to reference Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)CrossRef Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)CrossRef
Metadata
Title
Audio Classification - Feature Dimensional Analysis
Authors
Olukayode Ayodele Onasoga
Nooraini Yusof
Nor Hazlyna Harun
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-69221-6_59

Premium Partners