Top

Published in:

2020 | OriginalPaper | Chapter

FurcaNeXt: End-to-End Monaural Speech Separation with Dynamic Gated Dilated Temporal Convolutional Networks

Authors : Liwen Zhang, Ziqiang Shi, Jiqing Han, Anyan Shi, Ding Ma

Published in: MultiMedia Modeling

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Deep dilated temporal convolutional networks (TCN) have been proved to be very effective in sequence modeling. In this paper we propose several improvements of TCN for end-to-end approach to monaural speech separation, which consists of (1) multi-scale dynamic weighted gated TCN with a pyramidal structure (FurcaPy), (2) gated TCN with intra-parallel convolutional components (FurcaPa), (3) weight-shared multi-scale gated TCN (FurcaSh) and (4) dilated TCN with gated subtractive-convolutional component (FurcaSu). All these networks take the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speaker’s voice. For the objective, we propose to train the networks by directly optimizing utterance-level signal-to-distortion ratio (SDR) in a permutation invariant training (PIT) style. Our experiments on the public WSJ0-2mix data corpus result in 18.4 dB SDR improvement, which shows our proposed networks can lead to performance improvement on the speaker separation task.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter LDSNE: Learning Structural Network Embeddings by Encoding Local Distances

next chapter Multi-step Coding Structure of Spatial Audio Object Coding

Assmann, P., Summerfield, Q.: The perception of speech under adverse conditions. In: Speech Processing in the Auditory System, pp. 231–308. Springer, New York (2004). https://doi.org/10.1007/0-387-21575-1_5

Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)

Chen, Z., Luo, Y., Mesgarani, N.: Deep attractor network for single-microphone speaker separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 246–250. IEEE (2017)

Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: International Conference on Machine Learning, pp. 933–941 (2016)

Févotte, C., Gribonval, R., Vincent, E.: Bss\(\_\)eval toolbox user guide-revision 2.0 (2005)

He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. IEEE (2016)

Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

10.

Hu, K., Wang, D.: An unsupervised approach to cochannel speech separation. IEEE Trans. Audio, Speech, Language Processing 21(1), 122–131 (2013)MathSciNetCrossRef

11.

Isik, Y., Roux, J.L., Chen, Z., Watanabe, S., Hershey, J.R.: Single-channel multi-speaker separation using deep clustering. arXiv preprint arXiv:1607.02173 (2016)

12.

Kolbæk, M., et al.: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 25(10), 1901–1913 (2017)CrossRef

13.

Le Roux, J., Weninger, F.J., Hershey, J.R.: Sparse nmf half-baked or well done? Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA, Technical Report, no. TR2015-023 (2015)

14.

Lea, C., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks: a unified approach to action segmentation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 47–54. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_7CrossRef

15.

Li, C., Zhu, L., Xu, S., Gao, P., Xu, B.: CBLDNN-based speaker-independent speech separation via generative adversarial training (2018)

16.

Luo, Y., Chen, Z., Mesgarani, N.: Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)CrossRef

17.

Luo, Y., Mesgarani, N.: Tasnet: time-domain audio separation network for real-time, single-channel speech separation. arXiv preprint arXiv:1711.00541 (2017)

18.

Luo, Y., Mesgarani, N.: Tasnet: Surpassing ideal time-frequency masking for speech separation. arXiv preprint arXiv:1809.07454 (2018)

19.

Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR-half-baked or well done? arXiv preprint arXiv:1811.02508 (2018)

20.

Shao, Y., Wang, D.: Model-based sequential organization in cochannel speech. IEEE Trans. Audio Speech Lang. Process. 14(1), 289–298 (2006)MathSciNetCrossRef

21.

Shi, Z., Lin, H., Liu, L., Liu, R., Hayakawa, S., Han, J.: Furcax: end-to-end monaural speech separation based on deep gated (de)convolutional neural networks with adversarial example training. In: Proceedings of the ICASSP (2019)

22.

Shi, Z., et al.: Furcanet: An end-to-end deep gated convolutional, long short-term memory, deep neural networks for single channel speech separation. arXiv preprint arXiv:1902.00651 (2019)

23.

Smaragdis, P., et al.: Convolutive speech bases and their application to supervised speech separation. IEEE Trans. Audio Speech Lang. Process. 15(1), 1 (2007)CrossRef

24.

Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv preprint arXiv:1505.00387 (2015)

25.

Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4214–4217. IEEE (2010)

26.

Van Den Oord, A., et al.: Wavenet: a generative model for raw audio. CoRR abs/1609.03499 (2016)

27.

Venkataramani, S., Casebeer, J., Smaragdis, P.: Adaptive front-ends for end-to-end source separation. In: Proceedings of the NIPS (2017)

28.

Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)CrossRef

29.

Virtanen, T.: Speech recognition using factorial hidden markov models for separation in the feature space. In: Ninth International Conference on Spoken Language Processing (2006)

30.

Wang, D., Brown, G.J.: Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, New York (2006)CrossRef

31.

Wang, Z.Q., Le Roux, J., Hershey, J.R.: Alternative objective functions for deep clustering. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)

32.

Xu, C., et al.: Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM (2018)

33.

Yang, W., Benbouchta, M., Yantorno, R.: Performance of the modified bark spectral distortion as an objective speech quality measure. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 541–544. IEEE (1998)

34.

Yousef, M., Hussain, K.F., Mohammed, U.S.: Accurate, data-efficient, unconstrained text recognition with convolutional neural networks. arXiv preprint arXiv:1812.11894 (2018)

35.

Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245. IEEE (2017)

Title: FurcaNeXt: End-to-End Monaural Speech Separation with Dynamic Gated Dilated Temporal Convolutional Networks
Authors: Liwen Zhang
Ziqiang Shi
Jiqing Han
Anyan Shi
Ding Ma
Publisher: Springer International Publishing
Book: MultiMedia Modeling
Print ISBN: 978-3-030-37730-4

Electronic ISBN: 978-3-030-37731-1

Copyright Year: 2020
DOI: https://doi.org/10.1007/978-3-030-37731-1_53

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"