Skip to main content
Top

2020 | OriginalPaper | Chapter

FurcaNeXt: End-to-End Monaural Speech Separation with Dynamic Gated Dilated Temporal Convolutional Networks

Authors : Liwen Zhang, Ziqiang Shi, Jiqing Han, Anyan Shi, Ding Ma

Published in: MultiMedia Modeling

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Deep dilated temporal convolutional networks (TCN) have been proved to be very effective in sequence modeling. In this paper we propose several improvements of TCN for end-to-end approach to monaural speech separation, which consists of (1) multi-scale dynamic weighted gated TCN with a pyramidal structure (FurcaPy), (2) gated TCN with intra-parallel convolutional components (FurcaPa), (3) weight-shared multi-scale gated TCN (FurcaSh) and (4) dilated TCN with gated subtractive-convolutional component (FurcaSu). All these networks take the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speaker’s voice. For the objective, we propose to train the networks by directly optimizing utterance-level signal-to-distortion ratio (SDR) in a permutation invariant training (PIT) style. Our experiments on the public WSJ0-2mix data corpus result in 18.4 dB SDR improvement, which shows our proposed networks can lead to performance improvement on the speaker separation task.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
2.
go back to reference Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018) Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:​1803.​01271 (2018)
3.
go back to reference Chen, Z., Luo, Y., Mesgarani, N.: Deep attractor network for single-microphone speaker separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 246–250. IEEE (2017) Chen, Z., Luo, Y., Mesgarani, N.: Deep attractor network for single-microphone speaker separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 246–250. IEEE (2017)
4.
go back to reference Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: International Conference on Machine Learning, pp. 933–941 (2016) Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: International Conference on Machine Learning, pp. 933–941 (2016)
5.
go back to reference Févotte, C., Gribonval, R., Vincent, E.: Bss\(\_\)eval toolbox user guide-revision 2.0 (2005) Févotte, C., Gribonval, R., Vincent, E.: Bss\(\_\)eval toolbox user guide-revision 2.0 (2005)
6.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015) He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
7.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
8.
go back to reference Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. IEEE (2016) Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. IEEE (2016)
9.
go back to reference Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:​1704.​04861 (2017)
10.
go back to reference Hu, K., Wang, D.: An unsupervised approach to cochannel speech separation. IEEE Trans. Audio, Speech, Language Processing 21(1), 122–131 (2013)MathSciNetCrossRef Hu, K., Wang, D.: An unsupervised approach to cochannel speech separation. IEEE Trans. Audio, Speech, Language Processing 21(1), 122–131 (2013)MathSciNetCrossRef
11.
go back to reference Isik, Y., Roux, J.L., Chen, Z., Watanabe, S., Hershey, J.R.: Single-channel multi-speaker separation using deep clustering. arXiv preprint arXiv:1607.02173 (2016) Isik, Y., Roux, J.L., Chen, Z., Watanabe, S., Hershey, J.R.: Single-channel multi-speaker separation using deep clustering. arXiv preprint arXiv:​1607.​02173 (2016)
12.
go back to reference Kolbæk, M., et al.: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 25(10), 1901–1913 (2017)CrossRef Kolbæk, M., et al.: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 25(10), 1901–1913 (2017)CrossRef
13.
go back to reference Le Roux, J., Weninger, F.J., Hershey, J.R.: Sparse nmf half-baked or well done? Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA, Technical Report, no. TR2015-023 (2015) Le Roux, J., Weninger, F.J., Hershey, J.R.: Sparse nmf half-baked or well done? Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA, Technical Report, no. TR2015-023 (2015)
15.
go back to reference Li, C., Zhu, L., Xu, S., Gao, P., Xu, B.: CBLDNN-based speaker-independent speech separation via generative adversarial training (2018) Li, C., Zhu, L., Xu, S., Gao, P., Xu, B.: CBLDNN-based speaker-independent speech separation via generative adversarial training (2018)
16.
go back to reference Luo, Y., Chen, Z., Mesgarani, N.: Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)CrossRef Luo, Y., Chen, Z., Mesgarani, N.: Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)CrossRef
17.
go back to reference Luo, Y., Mesgarani, N.: Tasnet: time-domain audio separation network for real-time, single-channel speech separation. arXiv preprint arXiv:1711.00541 (2017) Luo, Y., Mesgarani, N.: Tasnet: time-domain audio separation network for real-time, single-channel speech separation. arXiv preprint arXiv:​1711.​00541 (2017)
18.
20.
go back to reference Shao, Y., Wang, D.: Model-based sequential organization in cochannel speech. IEEE Trans. Audio Speech Lang. Process. 14(1), 289–298 (2006)MathSciNetCrossRef Shao, Y., Wang, D.: Model-based sequential organization in cochannel speech. IEEE Trans. Audio Speech Lang. Process. 14(1), 289–298 (2006)MathSciNetCrossRef
21.
go back to reference Shi, Z., Lin, H., Liu, L., Liu, R., Hayakawa, S., Han, J.: Furcax: end-to-end monaural speech separation based on deep gated (de)convolutional neural networks with adversarial example training. In: Proceedings of the ICASSP (2019) Shi, Z., Lin, H., Liu, L., Liu, R., Hayakawa, S., Han, J.: Furcax: end-to-end monaural speech separation based on deep gated (de)convolutional neural networks with adversarial example training. In: Proceedings of the ICASSP (2019)
22.
go back to reference Shi, Z., et al.: Furcanet: An end-to-end deep gated convolutional, long short-term memory, deep neural networks for single channel speech separation. arXiv preprint arXiv:1902.00651 (2019) Shi, Z., et al.: Furcanet: An end-to-end deep gated convolutional, long short-term memory, deep neural networks for single channel speech separation. arXiv preprint arXiv:​1902.​00651 (2019)
23.
go back to reference Smaragdis, P., et al.: Convolutive speech bases and their application to supervised speech separation. IEEE Trans. Audio Speech Lang. Process. 15(1), 1 (2007)CrossRef Smaragdis, P., et al.: Convolutive speech bases and their application to supervised speech separation. IEEE Trans. Audio Speech Lang. Process. 15(1), 1 (2007)CrossRef
25.
go back to reference Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4214–4217. IEEE (2010) Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4214–4217. IEEE (2010)
26.
go back to reference Van Den Oord, A., et al.: Wavenet: a generative model for raw audio. CoRR abs/1609.03499 (2016) Van Den Oord, A., et al.: Wavenet: a generative model for raw audio. CoRR abs/1609.03499 (2016)
27.
go back to reference Venkataramani, S., Casebeer, J., Smaragdis, P.: Adaptive front-ends for end-to-end source separation. In: Proceedings of the NIPS (2017) Venkataramani, S., Casebeer, J., Smaragdis, P.: Adaptive front-ends for end-to-end source separation. In: Proceedings of the NIPS (2017)
28.
go back to reference Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)CrossRef Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)CrossRef
29.
go back to reference Virtanen, T.: Speech recognition using factorial hidden markov models for separation in the feature space. In: Ninth International Conference on Spoken Language Processing (2006) Virtanen, T.: Speech recognition using factorial hidden markov models for separation in the feature space. In: Ninth International Conference on Spoken Language Processing (2006)
30.
go back to reference Wang, D., Brown, G.J.: Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, New York (2006)CrossRef Wang, D., Brown, G.J.: Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, New York (2006)CrossRef
31.
go back to reference Wang, Z.Q., Le Roux, J., Hershey, J.R.: Alternative objective functions for deep clustering. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018) Wang, Z.Q., Le Roux, J., Hershey, J.R.: Alternative objective functions for deep clustering. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)
32.
go back to reference Xu, C., et al.: Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM (2018) Xu, C., et al.: Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM (2018)
33.
go back to reference Yang, W., Benbouchta, M., Yantorno, R.: Performance of the modified bark spectral distortion as an objective speech quality measure. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 541–544. IEEE (1998) Yang, W., Benbouchta, M., Yantorno, R.: Performance of the modified bark spectral distortion as an objective speech quality measure. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 541–544. IEEE (1998)
34.
go back to reference Yousef, M., Hussain, K.F., Mohammed, U.S.: Accurate, data-efficient, unconstrained text recognition with convolutional neural networks. arXiv preprint arXiv:1812.11894 (2018) Yousef, M., Hussain, K.F., Mohammed, U.S.: Accurate, data-efficient, unconstrained text recognition with convolutional neural networks. arXiv preprint arXiv:​1812.​11894 (2018)
35.
go back to reference Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245. IEEE (2017) Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245. IEEE (2017)
Metadata
Title
FurcaNeXt: End-to-End Monaural Speech Separation with Dynamic Gated Dilated Temporal Convolutional Networks
Authors
Liwen Zhang
Ziqiang Shi
Jiqing Han
Anyan Shi
Ding Ma
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-37731-1_53