Skip to main content
Top
Published in:

28-05-2024

A Multi-scale Subconvolutional U-Net with Time-Frequency Attention Mechanism for Single Channel Speech Enhancement

Authors: Sivaramakrishna Yechuri, Thirupathi Rao Komati, Rama Krishna Yellapragada, Sunnydaya Vanambathina

Published in: Circuits, Systems, and Signal Processing | Issue 9/2024

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Recent advancements in deep learning-based speech enhancement models have extensively used attention mechanisms to achieve state-of-the-art methods by demonstrating their effectiveness. This paper proposes a novel time-frequency attention (TFA) for speech enhancement that includes a multi-scale subconvolutional U-Net (MSCUNet). The TFA extracts valuable channels, frequencies, and time information from the feature sets and improves speech intelligibility and quality. Channel attention is first performed in TFA to learn weights representing the channels’ importance in the input feature set, followed by frequency and time attention mechanisms that are performed simultaneously, using learned weights, to capture both frequency and time attention. Additionally, a U-Net based multi-scale subconvolutional encoder-decoder model used different kernel sizes to extract local and contextual features from the noisy speech. The MSCUNet uses a feature calibration block acting as a gating network to control the information flow among the layers. This enables the scaled features to be weighted in order to retain speech and suppress the noise. Additionally, central layers are employed to exploit the interdependency among the past, current, and future frames to improve predictions. The experimental results show that the proposed TFAMSCUNet mode outperforms several state-of-the-art methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

ATZelektronik

Die Fachzeitschrift ATZelektronik bietet für Entwickler und Entscheider in der Automobil- und Zulieferindustrie qualitativ hochwertige und fundierte Informationen aus dem gesamten Spektrum der Pkw- und Nutzfahrzeug-Elektronik. 

Lassen Sie sich jetzt unverbindlich 2 kostenlose Ausgabe zusenden.

ATZelectronics worldwide

ATZlectronics worldwide is up-to-speed on new trends and developments in automotive electronics on a scientific level with a high depth of information. 

Order your 30-days-trial for free and without any commitment.

Show more products
Literature
1.
go back to reference J. Chen, D. Wang, Long short-term memory for speaker generalization in supervised speech separation. J. Acoust. Soc. Am. 141(6), 4705–4714 (2017)CrossRef J. Chen, D. Wang, Long short-term memory for speaker generalization in supervised speech separation. J. Acoust. Soc. Am. 141(6), 4705–4714 (2017)CrossRef
2.
go back to reference K. Cho, B. Van Merriënboer, C. Gulcehre et al., Learning phrase representations using RNN encoder–decoder for statistical machine translation (2014). arXiv preprint arXiv:1406.1078 K. Cho, B. Van Merriënboer, C. Gulcehre et al., Learning phrase representations using RNN encoder–decoder for statistical machine translation (2014). arXiv preprint arXiv:​1406.​1078
3.
go back to reference J. Chung, C. Gulcehre, K. Cho et al., Empirical evaluation of gated recurrent neural networks on sequence modeling (2014a). arXiv preprint arXiv:1412.3555 J. Chung, C. Gulcehre, K. Cho et al., Empirical evaluation of gated recurrent neural networks on sequence modeling (2014a). arXiv preprint arXiv:​1412.​3555
4.
go back to reference J. Chung, C. Gulcehre, K. Cho et al., Empirical evaluation of gated recurrent neural networks on sequence modeling (2014b). arXiv preprint arXiv:1412.3555 J. Chung, C. Gulcehre, K. Cho et al., Empirical evaluation of gated recurrent neural networks on sequence modeling (2014b). arXiv preprint arXiv:​1412.​3555
5.
go back to reference Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)CrossRef Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)CrossRef
6.
go back to reference E.M. Grais, D. Ward, M.D. Plumbley, Raw multi-channel audio source separation using multi-resolution convolutional auto-encoders, in 2018 26th European Signal Processing Conference (EUSIPCO) (IEEE, 2018), pp. 1577–1581 E.M. Grais, D. Ward, M.D. Plumbley, Raw multi-channel audio source separation using multi-resolution convolutional auto-encoders, in 2018 26th European Signal Processing Conference (EUSIPCO) (IEEE, 2018), pp. 1577–1581
7.
go back to reference K. Han, Y. Wang, D. Wang et al., Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 23(6), 982–992 (2015)CrossRef K. Han, Y. Wang, D. Wang et al., Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 23(6), 982–992 (2015)CrossRef
8.
go back to reference C. Haruta, N. Ono, A low-computational DNN-based speech enhancement for hearing aids based on element selection, in 2021 29th European Signal Processing Conference (EUSIPCO) (IEEE, 2021), pp 1025–1029 C. Haruta, N. Ono, A low-computational DNN-based speech enhancement for hearing aids based on element selection, in 2021 29th European Signal Processing Conference (EUSIPCO) (IEEE, 2021), pp 1025–1029
9.
go back to reference K. He, X. Zhang, S. Ren et al., Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 (2016) K. He, X. Zhang, S. Ren et al., Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 (2016)
10.
go back to reference S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef
11.
go back to reference J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7132–7141 (2018) J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7132–7141 (2018)
12.
go back to reference Y. Hu, Y. Liu, S. Lv et al., DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement (2020). arXiv preprint arXiv:2008.00264 Y. Hu, Y. Liu, S. Lv et al., DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement (2020). arXiv preprint arXiv:​2008.​00264
13.
go back to reference C. Jannu, S.D. Vanambathina, Multi-stage progressive learning-based speech enhancement using time-frequency attentive squeezed temporal convolutional networks. Circuits Syst. Signal Process. 42(12), 7467–7493 (2023)CrossRef C. Jannu, S.D. Vanambathina, Multi-stage progressive learning-based speech enhancement using time-frequency attentive squeezed temporal convolutional networks. Circuits Syst. Signal Process. 42(12), 7467–7493 (2023)CrossRef
14.
go back to reference Y. Jiang, D. Wang, R. Liu et al., Binaural classification for reverberant speech segregation using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 2112–2121 (2014)CrossRef Y. Jiang, D. Wang, R. Liu et al., Binaural classification for reverberant speech segregation using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 2112–2121 (2014)CrossRef
15.
go back to reference R. Jozefowicz, W. Zaremba, I. Sutskever, An empirical exploration of recurrent network architectures, in International Conference on Machine Learning (PMLR, 2015), pp 2342–2350 R. Jozefowicz, W. Zaremba, I. Sutskever, An empirical exploration of recurrent network architectures, in International Conference on Machine Learning (PMLR, 2015), pp 2342–2350
17.
go back to reference Y. Li, Y. Sun, K. Horoshenkov et al., Domain adaptation and autoencoder-based unsupervised speech enhancement. IEEE Trans. Artif. Intell. 3(1), 43–52 (2022)CrossRef Y. Li, Y. Sun, K. Horoshenkov et al., Domain adaptation and autoencoder-based unsupervised speech enhancement. IEEE Trans. Artif. Intell. 3(1), 43–52 (2022)CrossRef
19.
go back to reference J. Lim, A. Oppenheim, All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Signal Process. 26(3), 197–210 (1978)CrossRef J. Lim, A. Oppenheim, All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Signal Process. 26(3), 197–210 (1978)CrossRef
20.
go back to reference P. Loizou, Y. Hu, NOIZEUS: a noisy speech corpus for evaluation of speech enhancement algorithms. Speech Commun. 49, 588–601 (2017) P. Loizou, Y. Hu, NOIZEUS: a noisy speech corpus for evaluation of speech enhancement algorithms. Speech Commun. 49, 588–601 (2017)
21.
go back to reference P.C. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2007)CrossRef P.C. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2007)CrossRef
22.
go back to reference A.L. Maas, A.Y. Hannun, A.Y. Ng et al., Rectifier nonlinearities improve neural network acoustic models, in Proc. ICML, (Atlanta, 2013), p 3 A.L. Maas, A.Y. Hannun, A.Y. Ng et al., Rectifier nonlinearities improve neural network acoustic models, in Proc. ICML, (Atlanta, 2013), p 3
24.
go back to reference V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines, in ICML (2010) V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines, in ICML (2010)
25.
go back to reference S.M. Naqvi, M. Yu, J.A. Chambers, A multimodal approach to blind source separation of moving sources. IEEE J. Sel. Top. Signal Process. 4(5), 895–910 (2010)CrossRef S.M. Naqvi, M. Yu, J.A. Chambers, A multimodal approach to blind source separation of moving sources. IEEE J. Sel. Top. Signal Process. 4(5), 895–910 (2010)CrossRef
26.
go back to reference A. Narayanan, D. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. (IEEE, 2013), pp.7092–7096CrossRef A. Narayanan, D. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. (IEEE, 2013), pp.7092–7096CrossRef
27.
go back to reference A. Pandey, D. Wang, TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain, in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (IEEE, 2019), pp.6875–6879CrossRef A. Pandey, D. Wang, TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain, in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (IEEE, 2019), pp.6875–6879CrossRef
29.
go back to reference Recommendation IT, Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec ITU-T P 862 (2001) Recommendation IT, Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec ITU-T P 862 (2001)
30.
go back to reference D. Rethage, J. Pons, X. Serra, A wavenet for speech denoising, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (IEEE, 2018), pp.5069–5073CrossRef D. Rethage, J. Pons, X. Serra, A wavenet for speech denoising, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (IEEE, 2018), pp.5069–5073CrossRef
31.
go back to reference S. Rickard, O. Yilmaz, On the approximate w-disjoint orthogonality of speech, in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. (IEEE, 2002), pp I–529 S. Rickard, O. Yilmaz, On the approximate w-disjoint orthogonality of speech, in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. (IEEE, 2002), pp I–529
32.
go back to reference B. Rivet, W. Wang, S.M. Naqvi et al., Audiovisual speech source separation: an overview of key methodologies. IEEE Signal Process. Mag. 31(3), 125–134 (2014)CrossRef B. Rivet, W. Wang, S.M. Naqvi et al., Audiovisual speech source separation: an overview of key methodologies. IEEE Signal Process. Mag. 31(3), 125–134 (2014)CrossRef
33.
go back to reference Y. Sun, Y. Xian, W. Wang et al., Monaural source separation in complex domain with long short-term memory neural network. IEEE J. Sel. Top. Signal Process. 13(2), 359–369 (2019)CrossRef Y. Sun, Y. Xian, W. Wang et al., Monaural source separation in complex domain with long short-term memory neural network. IEEE J. Sel. Top. Signal Process. 13(2), 359–369 (2019)CrossRef
34.
go back to reference C. Szegedy, W. Liu, Y. Jia et al., Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9 (2015) C. Szegedy, W. Liu, Y. Jia et al., Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9 (2015)
35.
go back to reference C. Szegedy, V. Vanhoucke, S. Ioffe, et al., Rethinking the inception architecture for computer vision, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2818–2826 (2016) C. Szegedy, V. Vanhoucke, S. Ioffe, et al., Rethinking the inception architecture for computer vision, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2818–2826 (2016)
36.
go back to reference C.H. Taal, R.C. Hendriks, R. Heusdens et al., An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)CrossRef C.H. Taal, R.C. Hendriks, R. Heusdens et al., An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)CrossRef
37.
go back to reference K. Tan, D. Wang, A convolutional recurrent neural network for real-time speech enhancement, in Interspeech, pp 3229–3233 (2018) K. Tan, D. Wang, A convolutional recurrent neural network for real-time speech enhancement, in Interspeech, pp 3229–3233 (2018)
38.
go back to reference K. Tan, J. Chen, D. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 189–198 (2018)CrossRef K. Tan, J. Chen, D. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 189–198 (2018)CrossRef
39.
go back to reference M. Tu, X. Zhang, Speech enhancement based on deep neural networks with skip connections, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (IEEE, 2017), pp 5565–5569 M. Tu, X. Zhang, Speech enhancement based on deep neural networks with skip connections, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (IEEE, 2017), pp 5565–5569
40.
go back to reference S. Velliangiri, S. Alagumuthukrishnan et al., A review of dimensionality reduction techniques for efficient computation. Procedia Comput. Sci. 165, 104–111 (2019)CrossRef S. Velliangiri, S. Alagumuthukrishnan et al., A review of dimensionality reduction techniques for efficient computation. Procedia Comput. Sci. 165, 104–111 (2019)CrossRef
41.
go back to reference E. Vincent, R. Gribonval, C. Févotte, Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)CrossRef E. Vincent, R. Gribonval, C. Févotte, Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)CrossRef
42.
go back to reference D. Wang, Deep learning reinvents the hearing aid. IEEE Spectr. 54(3), 32–37 (2017)CrossRef D. Wang, Deep learning reinvents the hearing aid. IEEE Spectr. 54(3), 32–37 (2017)CrossRef
43.
go back to reference Y. Wang, D. Wang, Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)CrossRef Y. Wang, D. Wang, Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)CrossRef
44.
go back to reference F. Weninger, H. Erdogan, S. Watanabe et al., Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, in International conference on latent variable analysis and signal separation. (Springer, 2015), pp. 91–99 F. Weninger, H. Erdogan, S. Watanabe et al., Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, in International conference on latent variable analysis and signal separation. (Springer, 2015), pp. 91–99
45.
go back to reference S. Woo, J. Park, J. Lee et al., CBAM: convolutional block attention module, in Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018) S. Woo, J. Park, J. Lee et al., CBAM: convolutional block attention module, in Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
46.
go back to reference Y. Xian, Y. Sun, W. Wang et al., A multi-scale feature recalibration network for end-to-end single channel speech enhancement. IEEE J. Sel. Top. Signal Process. 15(1), 143–155 (2020)CrossRef Y. Xian, Y. Sun, W. Wang et al., A multi-scale feature recalibration network for end-to-end single channel speech enhancement. IEEE J. Sel. Top. Signal Process. 15(1), 143–155 (2020)CrossRef
47.
go back to reference X. Xiang, X. Zhang, H. Chen, A convolutional network with multi-scale and attention mechanisms for end-to-end single-channel speech enhancement. IEEE Signal Process. Lett. 28, 1455–1459 (2021)CrossRef X. Xiang, X. Zhang, H. Chen, A convolutional network with multi-scale and attention mechanisms for end-to-end single-channel speech enhancement. IEEE Signal Process. Lett. 28, 1455–1459 (2021)CrossRef
48.
go back to reference X. Xiang, X. Zhang, H. Chen, A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Process. Lett. 29, 105–109 (2021)CrossRef X. Xiang, X. Zhang, H. Chen, A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Process. Lett. 29, 105–109 (2021)CrossRef
49.
go back to reference Y. Xu, J. Du, L.R. Dai et al., A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2014)CrossRef Y. Xu, J. Du, L.R. Dai et al., A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2014)CrossRef
50.
go back to reference X. Zhang, X. Ren, X. Zheng et al., Low-delay speech enhancement using perceptually motivated target and loss, in Interspeech, pp. 2826–2830 (2021) X. Zhang, X. Ren, X. Zheng et al., Low-delay speech enhancement using perceptually motivated target and loss, in Interspeech, pp. 2826–2830 (2021)
Metadata
Title
A Multi-scale Subconvolutional U-Net with Time-Frequency Attention Mechanism for Single Channel Speech Enhancement
Authors
Sivaramakrishna Yechuri
Thirupathi Rao Komati
Rama Krishna Yellapragada
Sunnydaya Vanambathina
Publication date
28-05-2024
Publisher
Springer US
Published in
Circuits, Systems, and Signal Processing / Issue 9/2024
Print ISSN: 0278-081X
Electronic ISSN: 1531-5878
DOI
https://doi.org/10.1007/s00034-024-02721-2