Skip to main content
Erschienen in: International Journal of Speech Technology 1/2022

13.01.2022

Exploring single channel speech separation for short-time text-dependent speaker verification

verfasst von: Jiangyu Han, Yan Shi, Yanhua Long, Jiaen Liang

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The automatic speaker verification (ASV) has recently achieved great progress. However, the performance of ASV degrades significantly when the test speech is corrupted by interference speakers, especially when multi-talkers speak at the same time. Although the target speech extraction (TSE) has also attracted increasing attention in recent years, its TSE ability is constrained by the required pre-saved anchor speech examples of the target speaker. It becomes impossible to directly use existing TSE methods to extract the desired test speech in an ASV test trial, because the speaker identity of each test speech is unknown. Therefore, based on the state-of-the-art single channel speech separation technique—Conv-TasNet, this paper aims to design a test speech extraction mechanism for building short-time text-dependent speaker verification systems. Instead of providing a pre-saved anchor speech for each training or test speaker, we extract the desired test speech from a mixture by computing the pairwise dynamic time warping between each output of Conv-TasNet and the enrollment utterance of speaker model in each test trial in the ASV task. The acoustic domain mismatch between ASV and TSE training data, the behaviors of speech separation in different stages of ASV system building, such as, the voiceprint enrollment, test and PLDA backend are all investigated in detail. Experimental results show that the proposed test speech extraction mechanism in ASV brings significant relative improvements (36.3%) in overlapped multi-talker speaker verification, benefits can be found not only in ASV test stage, but also in target speaker modeling.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Bahmaninezhad, F., Wu, J., Gu, R., Zhang, S.-X., Xu, Y., Yu, M., & Yu, D. (2019). A comprehensive study of speech separation: Spectrogram vs waveform separation. In Proc. Interspeech, (pp. 4574–4578). Bahmaninezhad, F., Wu, J., Gu, R., Zhang, S.-X., Xu, Y., Yu, M., & Yu, D. (2019). A comprehensive study of speech separation: Spectrogram vs waveform separation. In Proc. Interspeech, (pp. 4574–4578).
Zurück zum Zitat Berndt, D. J., & Clifford, J. (1994). Using dynamic time warping to find patterns in time series. in Proc. KDD Workshop, 10(16), 359–370. Berndt, D. J., & Clifford, J. (1994). Using dynamic time warping to find patterns in time series. in Proc. KDD Workshop, 10(16), 359–370.
Zurück zum Zitat Bu, H., Du, J., Na, X., Wu, B., & Zheng, H. (2017). Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proc. Oriental COCOSDA, (pp. 1–5). Bu, H., Du, J., Na, X., Wu, B., & Zheng, H. (2017). Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proc. Oriental COCOSDA, (pp. 1–5).
Zurück zum Zitat Chen, Z., Luo, Y., & Mesgarani, N. (2017). Deep attractor network for single-microphone speaker separation. In Proc. ICASSP, (pp. 246–250). Chen, Z., Luo, Y., & Mesgarani, N. (2017). Deep attractor network for single-microphone speaker separation. In Proc. ICASSP, (pp. 246–250).
Zurück zum Zitat Delcroix, M., Ochiai, T., Zmolikova, K., Kinoshita, K., Tawara, N., Nakatani, T., & Araki, S. (2020). Improving speaker discrimination of target speech extraction with time-domain speakerbeam. In Proc. ICASSP, (pp. 691–695). Delcroix, M., Ochiai, T., Zmolikova, K., Kinoshita, K., Tawara, N., Nakatani, T., & Araki, S. (2020). Improving speaker discrimination of target speech extraction with time-domain speakerbeam. In Proc. ICASSP, (pp. 691–695).
Zurück zum Zitat Delcroix, M., Watanabe, S., Ochiai, T., & Kinoshita, K., et al. (2019). End-to-end SpeakerBeam for single channel target speech recognition. In Proc. Interspeech, (pp. 451–455). Delcroix, M., Watanabe, S., Ochiai, T., & Kinoshita, K., et al. (2019). End-to-end SpeakerBeam for single channel target speech recognition. In Proc. Interspeech, (pp. 451–455).
Zurück zum Zitat Ge, M., Xu, C., Wang, L., Chng, E.S., Dang, J., & Li, H. (2020). Spex+: A complete time domain speaker extraction network. In Proc. Interspeech, (pp. 1406–1410). Ge, M., Xu, C., Wang, L., Chng, E.S., Dang, J., & Li, H. (2020). Spex+: A complete time domain speaker extraction network. In Proc. Interspeech, (pp. 1406–1410).
Zurück zum Zitat Han, J., Zhou, X., Long, Y., & Li, Y. (2021). Multi-channel target speech extraction with channel decorrelation and target speaker adaptation. In Proc. ICASSP. Han, J., Zhou, X., Long, Y., & Li, Y. (2021). Multi-channel target speech extraction with channel decorrelation and target speaker adaptation. In Proc. ICASSP.
Zurück zum Zitat Isik, Y., Roux, J. L., Chen, Z., Watanabe, S., & Hershey, J. R. (2016). Single-channel multi-speaker separation using deep clustering. In Proc. Interspeech, (pp. 545–549). Isik, Y., Roux, J. L., Chen, Z., Watanabe, S., & Hershey, J. R. (2016). Single-channel multi-speaker separation using deep clustering. In Proc. Interspeech,  (pp. 545–549).
Zurück zum Zitat Jang, G.-J., Lee, T.-W., & Oh, Y.-H. (2003). Single-channel signal separation using time-domain basis functions. IEEE Signal Processing Letters, 10(6), 168–171.CrossRef Jang, G.-J., Lee, T.-W., & Oh, Y.-H. (2003). Single-channel signal separation using time-domain basis functions. IEEE Signal Processing Letters, 10(6), 168–171.CrossRef
Zurück zum Zitat Kanda, N., Boeddeker, C., Heitkaemper, J., & Fujita, Y., et al. (2019). Guided source separation meets a strong ASR Backend: Hitachi/Paderborn University joint investigation for dinner party ASR. In Proc. Interspeech, (pp. 1248–1251). Kanda, N., Boeddeker, C., Heitkaemper, J., & Fujita, Y., et al. (2019). Guided source separation meets a strong ASR Backend: Hitachi/Paderborn University joint investigation for dinner party ASR. In Proc. Interspeech, (pp. 1248–1251).
Zurück zum Zitat Kolbæk, M., Yu, D., Tan, Z.-H., & Jensen, J. (2017). Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 25(10), 1901–1913.CrossRef Kolbæk, M., Yu, D., Tan, Z.-H., & Jensen, J. (2017). Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 25(10), 1901–1913.CrossRef
Zurück zum Zitat Luo, Y., Chen, Z., & Yoshioka, T. (2020). Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In proc. ICASSP, (pp. 46–50). Luo, Y., Chen, Z., & Yoshioka, T. (2020). Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In proc. ICASSP, (pp. 46–50).
Zurück zum Zitat Luo, Y., & Mesgarani, N. (2018). TasNet: Time-domain audio separation network for real-time, single-channel speech separation. In Proc. ICASSP, (pp. 696–700). Luo, Y., & Mesgarani, N. (2018). TasNet: Time-domain audio separation network for real-time, single-channel speech separation. In Proc. ICASSP, (pp. 696–700).
Zurück zum Zitat Luo, Y., & Mesgarani, N. (2019). Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8), 1256–1266.CrossRef Luo, Y., & Mesgarani, N. (2019). Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8), 1256–1266.CrossRef
Zurück zum Zitat Ochiai, T., Delcroix, M., Kinoshita, K., Ogawa, A., & Nakatani, T. (2019). Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues. In Proc. Interspeech, (pp. 2718–2722). Ochiai, T., Delcroix, M., Kinoshita, K., Ogawa, A., & Nakatani, T. (2019). Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues. In Proc. Interspeech, (pp. 2718–2722).
Zurück zum Zitat Qin, X., Bu, H., & Li, M. (2019). HI-MIA: A far-field text-dependent speaker verification database and the baselines. arXiv:1912.01231. Qin, X., Bu, H., & Li, M. (2019). HI-MIA: A far-field text-dependent speaker verification database and the baselines. arXiv:​1912.​01231.
Zurück zum Zitat Rao, W., Xu, C., Chng, E.S., & Li, H. (2019). Target speaker extraction for multi-talker speaker verification. In Proc. Interspeech, (pp. 1273–1277). Rao, W., Xu, C., Chng, E.S., & Li, H. (2019). Target speaker extraction for multi-talker speaker verification. In Proc. Interspeech, (pp. 1273–1277).
Zurück zum Zitat Salvador, S., & Chan, P. (2007). Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11(5), 561–580.CrossRef Salvador, S., & Chan, P. (2007). Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11(5), 561–580.CrossRef
Zurück zum Zitat Schmidt, M.N., & Olsson, R.K. (2006). Single-channel speech separation using sparse non-negative matrix factorization. In Proc. Ninth International Conference on Spoken Language Processing. Schmidt, M.N., & Olsson, R.K. (2006). Single-channel speech separation using sparse non-negative matrix factorization. In Proc. Ninth International Conference on Spoken Language Processing.
Zurück zum Zitat Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpu, S. (2017). Deep neural network embeddings for text-independent speaker verification. In Proc. Interspeech, (pp. 999–1003). Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpu, S. (2017). Deep neural network embeddings for text-independent speaker verification. In Proc. Interspeech, (pp. 999–1003).
Zurück zum Zitat Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-Vectors: Robust DNN embeddings for speaker recognition. In Proc. ICASSP, (pp. 5329–5333). Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-Vectors: Robust DNN embeddings for speaker recognition. In Proc. ICASSP, (pp. 5329–5333).
Zurück zum Zitat Stark, M., Wohlmayr, M., & Pernkopf, F. (2011). Source-filter-based single-channel speech separation using pitch information. IEEE Transactions on Audio, Speech and Language Processing, 19(2), 242–255.CrossRef Stark, M., Wohlmayr, M., & Pernkopf, F. (2011). Source-filter-based single-channel speech separation using pitch information. IEEE Transactions on Audio, Speech and Language Processing, 19(2), 242–255.CrossRef
Zurück zum Zitat Vincent, E., Gribonval, R., & Févotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 14(4), 1462–1469.CrossRef Vincent, E., Gribonval, R., & Févotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 14(4), 1462–1469.CrossRef
Zurück zum Zitat Wang, S., Naithani, G., & Virtanen, T. (2019). Low-latency deep clustering for speech separation. In Proc. ICASSP, (pp. 76–80). Wang, S., Naithani, G., & Virtanen, T. (2019). Low-latency deep clustering for speech separation. In Proc. ICASSP, (pp. 76–80).
Zurück zum Zitat Wang, Q., Muckenhirn, H., Wilson, K., & Sridhar, P., et al. (2019). VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking. In Proc. Interspeech, (pp. 2728–2732). Wang, Q., Muckenhirn, H., Wilson, K., & Sridhar, P., et al. (2019). VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking. In Proc. Interspeech, (pp. 2728–2732).
Zurück zum Zitat Wu, J., Xu, Y., Zhang, S.-X., Chen, L.-W., Yu, M., Xie, L., & Yu, D. (2019). Time domain audio visual speech separation. In Proc. ASRU, (pp. 667–673). Wu, J., Xu, Y., Zhang, S.-X., Chen, L.-W., Yu, M., Xie, L., & Yu, D. (2019). Time domain audio visual speech separation. In Proc. ASRU, (pp. 667–673).
Zurück zum Zitat Xiao, X., Chen, Z., Yoshioka, T., Erdogan, H., & Liu, C. et al. (2019). Single-channel speech extraction using speaker inventory and attention network. In Proc. ICASSP, (pp. 86–90). Xiao, X., Chen, Z., Yoshioka, T., Erdogan, H., & Liu, C. et al. (2019). Single-channel speech extraction using speaker inventory and attention network. In Proc. ICASSP, (pp. 86–90).
Zurück zum Zitat Yousefi, M., Khorram, S., & Hansen, J. (2019). Probabilistic permutation invariant training for speech separation. In Proc. Interspeech, (pp. 4604–4608). Yousefi, M., Khorram, S., & Hansen, J. (2019). Probabilistic permutation invariant training for speech separation. In Proc. Interspeech, (pp. 4604–4608).
Zurück zum Zitat Zhang, X.-L., & Wang, D. (2016). A deep ensemble learning method for monaural speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(5), 967–977.MathSciNetCrossRef Zhang, X.-L., & Wang, D. (2016). A deep ensemble learning method for monaural speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(5), 967–977.MathSciNetCrossRef
Metadaten
Titel
Exploring single channel speech separation for short-time text-dependent speaker verification
verfasst von
Jiangyu Han
Yan Shi
Yanhua Long
Jiaen Liang
Publikationsdatum
13.01.2022
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 1/2022
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-022-09959-8

Weitere Artikel der Ausgabe 1/2022

International Journal of Speech Technology 1/2022 Zur Ausgabe

Neuer Inhalt