Skip to main content
Erschienen in: Neural Processing Letters 1/2023

10.06.2022

Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments

verfasst von: Shanfa Ke, Zhongyuan Wang, Ruimin Hu, Xiaochen Wang

Erschienen in: Neural Processing Letters | Ausgabe 1/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In a real multi-speaker scenario, the signal collected by the microphone contains a large number of time periods with only one speaker’s speech which were called isolated speech segments. In view of this fact, this paper proposes a single-channel multi-speaker speech separation method based on the similarity between the speaker feature center and the mixture feature in the deep embedding space. In particular, the isolated speech segments extracted from the observed signal are converted to deep embedding vectors, and then a speaker feature center will be created. The similarity between this center and the deep embedding feature of mixture is constructed as a mask of the corresponding speaker, which is used to separate the speaker’s speech. A residual-based deep embedding network with stacked 2-D convolutional blocks instead of bi-directional long short-term memory is proposed for faster speed and better feature extraction. In addition, an isolated speech segment extraction method based on Chimera++ has been proposed, because the previous experiments showed that Chimera++ algorithm owns good separation performance for segments from only one speaker. The evaluation results on the general datasets show that the proposed method substantially outperforms competing algorithms up to 0.94 dB in Signal-to-Distortion Ratio.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Wang Z-Q, Wang DL (2016) A joint training framework for robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 24(4):796–806CrossRef Wang Z-Q, Wang DL (2016) A joint training framework for robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 24(4):796–806CrossRef
2.
Zurück zum Zitat Li J, Deng L, Gong Y, HaebUmbach R (2014) An overview of noise-robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(4):745–777CrossRef Li J, Deng L, Gong Y, HaebUmbach R (2014) An overview of noise-robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(4):745–777CrossRef
3.
Zurück zum Zitat Narayanan A, Wang DL (2014) Investigation of speech separation as a front-end for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(4):826–835CrossRef Narayanan A, Wang DL (2014) Investigation of speech separation as a front-end for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(4):826–835CrossRef
4.
Zurück zum Zitat Pedersen MS (2006) Source separation for hearing aid applications. IMM, Informatik og Matematisk Modelling, DTU, Lyngby Pedersen MS (2006) Source separation for hearing aid applications. IMM, Informatik og Matematisk Modelling, DTU, Lyngby
5.
Zurück zum Zitat Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13(4–5):411–430CrossRef Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13(4–5):411–430CrossRef
6.
Zurück zum Zitat Aarabi P, Shi G, Jahromi O (2003) Robust speech separation using time-frequency masking. In: 2003 International conference on multimedia and expo. ICME’03. Proceedings (Cat. No. 03TH8698), vol 1. IEEE, pp I–741 Aarabi P, Shi G, Jahromi O (2003) Robust speech separation using time-frequency masking. In: 2003 International conference on multimedia and expo. ICME’03. Proceedings (Cat. No. 03TH8698), vol 1. IEEE, pp I–741
7.
Zurück zum Zitat Alinaghi A, Jackson Philip JB, Liu Q, Wang W (2014) Joint mixing vector and binaural model based stereo source separation. IEEE/ACM Trans Audio Speech Lang Process 22(9):1434–1448CrossRef Alinaghi A, Jackson Philip JB, Liu Q, Wang W (2014) Joint mixing vector and binaural model based stereo source separation. IEEE/ACM Trans Audio Speech Lang Process 22(9):1434–1448CrossRef
8.
Zurück zum Zitat Wang Y, Han K, Wang DL (2012) Exploring monaural features for classification-based speech segregation. IEEE Trans Audio Speech Lang Process 21(2):270–279CrossRef Wang Y, Han K, Wang DL (2012) Exploring monaural features for classification-based speech segregation. IEEE Trans Audio Speech Lang Process 21(2):270–279CrossRef
9.
Zurück zum Zitat Wang Y, Narayanan A, Wang DL (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858CrossRef Wang Y, Narayanan A, Wang DL (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858CrossRef
10.
Zurück zum Zitat Virtanen T (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074CrossRef Virtanen T (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074CrossRef
11.
Zurück zum Zitat Virtanen T, Gemmeke JF, Raj B (2013) Active-set newton algorithm for overcomplete non-negative representations of audio. IEEE/ACM Trans Audio Speech Lang Process 21(11):2277–2289CrossRef Virtanen T, Gemmeke JF, Raj B (2013) Active-set newton algorithm for overcomplete non-negative representations of audio. IEEE/ACM Trans Audio Speech Lang Process 21(11):2277–2289CrossRef
12.
Zurück zum Zitat Huang P-S, Kim M, Hasegawa-Johnson M, Smaragdis P (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans Audio Speech Lang Process 23(12):2136–2147CrossRef Huang P-S, Kim M, Hasegawa-Johnson M, Smaragdis P (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans Audio Speech Lang Process 23(12):2136–2147CrossRef
13.
Zurück zum Zitat Yu D, Kolbæk M, Tan Z-H, Jensen J (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 241–245 Yu D, Kolbæk M, Tan Z-H, Jensen J (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 241–245
14.
Zurück zum Zitat Kolbæk M, Dong Yu, Tan Z-H, Jensen J (2017) Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(10):1901–1913CrossRef Kolbæk M, Dong Yu, Tan Z-H, Jensen J (2017) Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(10):1901–1913CrossRef
15.
Zurück zum Zitat Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35 Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35
16.
Zurück zum Zitat Chen Z, Luo Y, Mesgarani N (2017) Deep attractor network for single-microphone speaker separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 246–250 Chen Z, Luo Y, Mesgarani N (2017) Deep attractor network for single-microphone speaker separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 246–250
17.
Zurück zum Zitat Wang Z-Q, Le Roux J, Hershey JR (2018) Alternative objective functions for deep clustering. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 686–690 Wang Z-Q, Le Roux J, Hershey JR (2018) Alternative objective functions for deep clustering. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 686–690
18.
Zurück zum Zitat Wang Z-Q, Le Roux J, Wang DL, Hershey JR (2018) End-to-end speech separation with unfolded iterative phase reconstruction. arXiv:1804.10204 Wang Z-Q, Le Roux J, Wang DL, Hershey JR (2018) End-to-end speech separation with unfolded iterative phase reconstruction. arXiv:​1804.​10204
19.
Zurück zum Zitat Wang DL, Wang ZQ, Tan K (2019) Deep learning-based phase reconstruction for speaker separation: a trigonometric perspective, pp 71–75 Wang DL, Wang ZQ, Tan K (2019) Deep learning-based phase reconstruction for speaker separation: a trigonometric perspective, pp 71–75
20.
Zurück zum Zitat Luo Yi, Mesgarani Nima (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 696–700 Luo Yi, Mesgarani Nima (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 696–700
21.
Zurück zum Zitat Luo Y, Mesgarani N (2019) Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process 27(8):1256–1266CrossRef Luo Y, Mesgarani N (2019) Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process 27(8):1256–1266CrossRef
22.
Zurück zum Zitat Tzinis E, Venkataramani S, Wang Z, Subakan C, Smaragdis P (2020) Two-step sound source separation: training on learned latent targets. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35 Tzinis E, Venkataramani S, Wang Z, Subakan C, Smaragdis P (2020) Two-step sound source separation: training on learned latent targets. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35
23.
Zurück zum Zitat Luo Y, Chen Z, Yoshioka T (2020) Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 46–50 Luo Y, Chen Z, Yoshioka T (2020) Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 46–50
25.
Zurück zum Zitat Aihara R, Wichern G, Le Roux J (2020) Deep clustering-based single-channel speech separation and recent advances. Acoust Sci Technol 41(2):465–471CrossRef Aihara R, Wichern G, Le Roux J (2020) Deep clustering-based single-channel speech separation and recent advances. Acoust Sci Technol 41(2):465–471CrossRef
26.
Zurück zum Zitat Zheng X, Ritz C, Xi J (2012) Encoding navigable speech sources: a psychoacoustic-based analysis-by-synthesis approach. IEEE Trans Audio Speech Lang Process 21(1):29–38CrossRef Zheng X, Ritz C, Xi J (2012) Encoding navigable speech sources: a psychoacoustic-based analysis-by-synthesis approach. IEEE Trans Audio Speech Lang Process 21(1):29–38CrossRef
27.
Zurück zum Zitat Vincent E, Gribonval R, Févotte C (2006) Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process 14(4):1462–1469CrossRef Vincent E, Gribonval R, Févotte C (2006) Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process 14(4):1462–1469CrossRef
28.
Zurück zum Zitat Raffel C, McFee B, Humphrey EJ, Salamon J, Nieto O, Liang D, Ellis DPW, Raffel CC (2014) mir_eval: a transparent implementation of common mir metrics. In: Proceedings of the 15th international society for music information retrieval conference, ISMIR. Citeseer Raffel C, McFee B, Humphrey EJ, Salamon J, Nieto O, Liang D, Ellis DPW, Raffel CC (2014) mir_eval: a transparent implementation of common mir metrics. In: Proceedings of the 15th international society for music information retrieval conference, ISMIR. Citeseer
30.
Zurück zum Zitat Ke S, Hu R, Li G, Wu T, Wang X, Wang Z (2019) Multi-speakers speech separation based on modified attractor points estimation and GMM clustering. In: 2019 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1414–1419 Ke S, Hu R, Li G, Wu T, Wang X, Wang Z (2019) Multi-speakers speech separation based on modified attractor points estimation and GMM clustering. In: 2019 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1414–1419
Metadaten
Titel
Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments
verfasst von
Shanfa Ke
Zhongyuan Wang
Ruimin Hu
Xiaochen Wang
Publikationsdatum
10.06.2022
Verlag
Springer US
Erschienen in
Neural Processing Letters / Ausgabe 1/2023
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-022-10887-6

Weitere Artikel der Ausgabe 1/2023

Neural Processing Letters 1/2023 Zur Ausgabe

Neuer Inhalt