Skip to main content
Top
Published in: Multimedia Systems 2/2021

06-01-2021 | Regular Paper

Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks

Authors: Yannam Vasantha Koteswararao, C. B. Rama Rao

Published in: Multimedia Systems | Issue 2/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Speech signal is commonly debased by room reverberation and included noises in genuine climates. This paper focuses on disengaging objective speech signals in reverberant conditions from multichannel input signals. To overcome all the existing drawbacks, this work proposes an efficient technique like, multichannel speech signal separation using a new hybrid method that combines grasshopper optimization-based matrix factorization (GOMF) and enthalpy-based DNN (EDNN). To predict and remove the unwanted noise into the multichannel input signal, this paper presents a narrative classification framework in the manner of following steps namely, STFT, GOMF-based rank estimation, identify signal eigenvalues, noise removal, feature extraction well as classification. At first, STFT is utilized to plan the multichannel blend waveforms to complex spectrograms. Then, GOMF is used to estimates the obvious speech signals and noise. After the estimation, important features are extracted. Feature extraction is based on spatial feature, spectral feature, and directional features. To attain the enhanced classification outcomes, the spectrogram is reconstructed based on enthalpy-based deep neural network (EDNN). At last, convert the resultant speech spectrogram back to the extracted output signal based on inverse STFT. Experimental results show that our proposed approach accomplishes the most extreme SNR outcome of − 6dB of 24.0523. Comparable to the DNN-JAT, which achieves 18.50032. The RNN and NMF-DNN had the worst SNR 13.45434 and 12.29991. The proposed outcome is compared with various algorithms and some existing works. Compared with other existing works our proposed methodology achieves higher outcomes.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)CrossRef Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)CrossRef
2.
go back to reference Ding, Y., Xu, Y., Zhang, S., Cong, Y., Wang, L.: Self-Supervised Learning for Audio-Visual Speaker Diarization, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 4367–4371 Ding, Y., Xu, Y., Zhang, S., Cong, Y., Wang, L.: Self-Supervised Learning for Audio-Visual Speaker Diarization, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 4367–4371
3.
go back to reference Bahmaninezhad, F., Jian W., Rongzhi G., Shi-Xiong Z., Yong X., Meng Y., Dong Y.: A comprehensive study of speech separation: spectrogram vs waveform separation. arXiv preprint arXiv: 1905.07497 (2019) Bahmaninezhad, F., Jian W., Rongzhi G., Shi-Xiong Z., Yong X., Meng Y., Dong Y.: A comprehensive study of speech separation: spectrogram vs waveform separation. arXiv preprint arXiv: 1905.07497 (2019)
4.
go back to reference Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 161, 124–133 (2018)CrossRef Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 161, 124–133 (2018)CrossRef
5.
go back to reference Chakrabarty, S., Habets, E.A.: Multi-speaker doa estimation using deep convolutional networks trained with noise signals. IEEE J. Select. Topics Signal Process. 13(1), 8–21 (2019)CrossRef Chakrabarty, S., Habets, E.A.: Multi-speaker doa estimation using deep convolutional networks trained with noise signals. IEEE J. Select. Topics Signal Process. 13(1), 8–21 (2019)CrossRef
6.
go back to reference Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition, IEEE transactions on pattern analysis and machine intelligence, 2018 Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition, IEEE transactions on pattern analysis and machine intelligence, 2018
7.
go back to reference Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)CrossRef Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)CrossRef
8.
go back to reference Luo, Y., Han, C., Mesgarani, N., Ceolini, E., Liu, S.C.: (2019). FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 260–267). IEEE Luo, Y., Han, C., Mesgarani, N., Ceolini, E., Liu, S.C.: (2019). FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 260–267). IEEE
9.
go back to reference Rongzhi G., Shi-Xiong Z.: Multi-modal Multi-channel Target Speech Separation. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020 Rongzhi G., Shi-Xiong Z.: Multi-modal Multi-channel Target Speech Separation. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020
10.
go back to reference Luo, Yi., Chen, Z., Mesgarani, N.: Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)CrossRef Luo, Yi., Chen, Z., Mesgarani, N.: Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)CrossRef
11.
go back to reference Fan, C., Jianhua T., Bin, L., Jiangyan Y., Zhengqi W.: Gated recurrent fusion of spatial and spectral features for multi-channel speech separation with deep embedding representations. In Proc. Interspeech, vol. 2020. 2020 Fan, C., Jianhua T., Bin, L., Jiangyan Y., Zhengqi W.: Gated recurrent fusion of spatial and spectral features for multi-channel speech separation with deep embedding representations. In Proc. Interspeech, vol. 2020. 2020
12.
go back to reference Shimada, K., Bando, Y., Mimura, M., Itoyama, K., Yoshii, K., Kawahara, T.: Unsupervised speech enhancement based on multichannel NMF-informed beamforming for noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 27(5), 960–971 (2019)CrossRef Shimada, K., Bando, Y., Mimura, M., Itoyama, K., Yoshii, K., Kawahara, T.: Unsupervised speech enhancement based on multichannel NMF-informed beamforming for noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 27(5), 960–971 (2019)CrossRef
13.
go back to reference Li, X., Girin, L., Gannot, S., Horaud, R.: Multichannel speech separation and enhancement using the convolutive transfer function. IEEE/ACM Trans. Audio Speech Lang. Process. 27(3), 645–659 (2019)CrossRef Li, X., Girin, L., Gannot, S., Horaud, R.: Multichannel speech separation and enhancement using the convolutive transfer function. IEEE/ACM Trans. Audio Speech Lang. Process. 27(3), 645–659 (2019)CrossRef
14.
go back to reference Gu, R., Shi-Xiong Z., Lianwu C., Yong X., Meng Y., Dan S., Yuexian Z., Dong Y.: Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7319–7323. IEEE, 2020 Gu, R., Shi-Xiong Z., Lianwu C., Yong X., Meng Y., Dan S., Yuexian Z., Dong Y.: Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7319–7323. IEEE, 2020
15.
go back to reference Luo, Yi., Mesgarani, N.: Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)CrossRef Luo, Yi., Mesgarani, N.: Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)CrossRef
16.
go back to reference Luo, Y., Zhuo C., Nima M., Takuya Yoshioka.: End-to-end microphone permutation and number invariant multi-channel speech separation. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6394–6398. IEEE, 2020. Luo, Y., Zhuo C., Nima M., Takuya Yoshioka.: End-to-end microphone permutation and number invariant multi-channel speech separation. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6394–6398. IEEE, 2020.
17.
go back to reference Gu, R., Shi-Xiong Z., Yong X., Lianwu C., Yuexian Z., Dong Y.: Multi-modal multi-channel target speech separation. IEEE J. Select. Topics Signal Process. (2020). Gu, R., Shi-Xiong Z., Yong X., Lianwu C., Yuexian Z., Dong Y.: Multi-modal multi-channel target speech separation. IEEE J. Select. Topics Signal Process. (2020).
18.
go back to reference Seki, S., Kameoka, H., Li, Li., Toda, T., Takeda, K.: Underdetermined source separation based on generalized multichannel variationalautoencoder. IEEE Access 7, 168104–168115 (2019)CrossRef Seki, S., Kameoka, H., Li, Li., Toda, T., Takeda, K.: Underdetermined source separation based on generalized multichannel variationalautoencoder. IEEE Access 7, 168104–168115 (2019)CrossRef
19.
go back to reference Yan, C., Biao G., Yuxuan W., Yue G. Deep multi-view enhancement hashing for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. (2020). Yan, C., Biao G., Yuxuan W., Yue G. Deep multi-view enhancement hashing for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. (2020).
20.
go back to reference Yan, C., Biyao S., Hao Z., Ruixin N., Yongdong Z., Feng X.: 3d room layout estimation from a single rgb image. IEEE Trans. Multimed. (2020). Yan, C., Biyao S., Hao Z., Ruixin N., Yongdong Z., Feng X.: 3d room layout estimation from a single rgb image. IEEE Trans. Multimed. (2020).
21.
go back to reference Yan, C., Zhisheng L., Yongbing Z., Yutao L., Xiangyang J., Yongdong Z.: Depth image denoising using nuclear norm and learning graph model. arXiv preprint:2008.03741(2020). Yan, C., Zhisheng L., Yongbing Z., Yutao L., Xiangyang J., Yongdong Z.: Depth image denoising using nuclear norm and learning graph model. arXiv preprint:2008.03741(2020).
22.
go back to reference Yicheng, D., Kouhei, S.: Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms, National Institute of Advanced Industrial Science and Technology, 2020. Yicheng, D., Kouhei, S.: Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms, National Institute of Advanced Industrial Science and Technology, 2020.
23.
go back to reference Ozerov, A., Fevotte, C.: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 18(3), 550–563 (2010)CrossRef Ozerov, A., Fevotte, C.: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 18(3), 550–563 (2010)CrossRef
24.
go back to reference Saito, Y., Takamichi, S., Saruwatari, H.: Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. Comput. Speech Lang. 1(58), 347–363 (2019)CrossRef Saito, Y., Takamichi, S., Saruwatari, H.: Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. Comput. Speech Lang. 1(58), 347–363 (2019)CrossRef
25.
go back to reference Mitsufuji, Y., Uhlich, S., Takamune, N., Kitamura, D., Koyama, S., Saruwatari, H.: Multichannel non-negative matrix factorization using banded spatial covariance matrices in wavenumber domain. IEEE Trans. Audio Speech Lang. Process. 28, 49–60 (2020)CrossRef Mitsufuji, Y., Uhlich, S., Takamune, N., Kitamura, D., Koyama, S., Saruwatari, H.: Multichannel non-negative matrix factorization using banded spatial covariance matrices in wavenumber domain. IEEE Trans. Audio Speech Lang. Process. 28, 49–60 (2020)CrossRef
26.
go back to reference Wang, Z.-Q., Wang, P., Wang, D.: Complex spectral mapping for single- and multi-channel speech enhancement and robust ASR. IEEE Trans. Audio Speech Lang. Process. 28, 1778–1787 (2020)CrossRef Wang, Z.-Q., Wang, P., Wang, D.: Complex spectral mapping for single- and multi-channel speech enhancement and robust ASR. IEEE Trans. Audio Speech Lang. Process. 28, 1778–1787 (2020)CrossRef
27.
go back to reference Yoshiki, M.T., Tatsuya K.: Consistency-aware multi-channel speech enhancement using deep neural networks. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 821–825. IEEE, 2020. Yoshiki, M.T., Tatsuya K.: Consistency-aware multi-channel speech enhancement using deep neural networks. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 821–825. IEEE, 2020.
28.
go back to reference Martinez, A.M., Gerlach, L., Payá-Vayá, G., Hermansky, H., Ooster, J., Meyer, B.T.: DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters. Speech Commun. 1(106), 44–56 (2019)CrossRef Martinez, A.M., Gerlach, L., Payá-Vayá, G., Hermansky, H., Ooster, J., Meyer, B.T.: DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters. Speech Commun. 1(106), 44–56 (2019)CrossRef
29.
go back to reference Ravi Kishore, T., Sai Sidharth, D.: Analysis of Linear and Non-Linear Frequency Modulated Signals Using STFT and Hough Transform, Signal Process. Inf. Technol., 2015. Ravi Kishore, T., Sai Sidharth, D.: Analysis of Linear and Non-Linear Frequency Modulated Signals Using STFT and Hough Transform, Signal Process. Inf. Technol., 2015.
30.
go back to reference Ewees, A. A., Mohamed A.E., Essam H.H.: Improved grasshopper optimization algorithm using opposition-based learning. Expert Syst. Appl. 112 (2018): 156–172. Ewees, A. A., Mohamed A.E., Essam H.H.: Improved grasshopper optimization algorithm using opposition-based learning. Expert Syst. Appl. 112 (2018): 156–172.
31.
go back to reference Zhang, X., Wang, D.L.: Deep learning based binaural speech separation in reverberant environments. IEEE Trans. Audio Speech Lang. Process. 25(5), 1075–1084 (2017)CrossRef Zhang, X., Wang, D.L.: Deep learning based binaural speech separation in reverberant environments. IEEE Trans. Audio Speech Lang. Process. 25(5), 1075–1084 (2017)CrossRef
32.
go back to reference Gu, R., Lianwu C., Shi-Xiong Z., Jimeng Z., Yong X., Meng Y., Dan S., Yuexian Z., Dong Y.: Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information. In Interspeech, pp. 4290–4294. 2019. Gu, R., Lianwu C., Shi-Xiong Z., Jimeng Z., Yong X., Meng Y., Dan S., Yuexian Z., Dong Y.: Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information. In Interspeech, pp. 4290–4294. 2019.
33.
go back to reference Lozano-Diez, A., Zazo, R., Toledano, D.T., Gonzalez-Rodriguez, J.: An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition. PLoS ONE 12(8), e0182580 (2017)CrossRef Lozano-Diez, A., Zazo, R., Toledano, D.T., Gonzalez-Rodriguez, J.: An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition. PLoS ONE 12(8), e0182580 (2017)CrossRef
34.
go back to reference Narayanan, A., Wang, D.: Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE Trans. Audio Speech Lang. Process. 23(1), 92–101 (2015) Narayanan, A., Wang, D.: Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE Trans. Audio Speech Lang. Process. 23(1), 92–101 (2015)
35.
go back to reference Geiger, J. T., Weninger, F., Gemmeke, J., Wollmer, M., Schuller, B., Rigoll, G.: Memory-enhanced neural networks and NMF for robust ASR, IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 6, pp. 1037–1046, Jun. 2014. Geiger, J. T., Weninger, F., Gemmeke, J., Wollmer, M., Schuller, B., Rigoll, G.: Memory-enhanced neural networks and NMF for robust ASR, IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 6, pp. 1037–1046, Jun. 2014.
36.
go back to reference Weng, S. W. C., Yu, D., Juang, B.-H.: Recurrent deep neural networks for robust speech recognition, in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 5532–5536. Weng, S. W. C., Yu, D., Juang, B.-H.: Recurrent deep neural networks for robust speech recognition, in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 5532–5536.
37.
go back to reference Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., Rigoll, G.: The Munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks, in Proc. 2nd CHiME Workshop Mach. Listening Multisource Environ., 2013, pp. 86–90. Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., Rigoll, G.: The Munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks, in Proc. 2nd CHiME Workshop Mach. Listening Multisource Environ., 2013, pp. 86–90.
38.
go back to reference Jin, Y.L., Chen J.T., QianHong L., Yan W.: Multi-Head Self-Attention Based Deep Clustering For Single-Channel Speech Separation. IEEE Access (2020). Jin, Y.L., Chen J.T., QianHong L., Yan W.: Multi-Head Self-Attention Based Deep Clustering For Single-Channel Speech Separation. IEEE Access (2020).
Metadata
Title
Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks
Authors
Yannam Vasantha Koteswararao
C. B. Rama Rao
Publication date
06-01-2021
Publisher
Springer Berlin Heidelberg
Published in
Multimedia Systems / Issue 2/2021
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-020-00740-y

Other articles of this Issue 2/2021

Multimedia Systems 2/2021 Go to the issue