Skip to main content
Erschienen in: International Journal of Speech Technology 4/2022

04.08.2022

Robust acoustic domain identification with its application to speaker diarization

verfasst von: A Kishore Kumar, Shefali Waldekar, Md Sahidullah, Goutam Saha

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

With the rise in multimedia content over the years, more variety is observed in the recording environments of audio. An audio processing system might benefit when it has a module to identify the acoustic domain at its front-end. In this paper, we demonstrate the idea of acoustic domain identification (ADI) for speaker diarization. For this, we first present a detailed study of the various domains of the third DIHARD challenge highlighting the factors that differentiated them from each other. Our main contribution is to develop a simple and efficient solution for ADI. In the present work, we explore speaker embeddings for this task. Next, we integrate the ADI module with the speaker diarization framework of the DIHARD III challenge. The performance substantially improved over that of the baseline when the thresholds for agglomerative hierarchical clustering were optimized according to the respective domains. We achieved a relative improvement of more than \(5\%\) and \(8\%\) in DER for core and full conditions, respectively, on Track 1 of the DIHARD III evaluation set.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356–370.CrossRef Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356–370.CrossRef
Zurück zum Zitat Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In Proceedings of the IEEE international conference on computer vision (pp. 609–617). Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In Proceedings of the IEEE international conference on computer vision (pp. 609–617).
Zurück zum Zitat Assmann, P., & Summerfield, Q. (2004). The perception of speech under adverse conditions. Speech Processing in the Auditory System, 18, 231–308.CrossRef Assmann, P., & Summerfield, Q. (2004). The perception of speech under adverse conditions. Speech Processing in the Auditory System, 18, 231–308.CrossRef
Zurück zum Zitat Bai, Z., & Zhang, X.-L. (2021). Speaker recognition based on deep learning: An overview. Neural Networks, 140, 65–99.CrossRef Bai, Z., & Zhang, X.-L. (2021). Speaker recognition based on deep learning: An overview. Neural Networks, 140, 65–99.CrossRef
Zurück zum Zitat Barchiesi, D., Giannoulis, D., Stowell, D., & Plumbley, M. D. (2015). Acoustic scene classification: Classifying environments from the sounds they produce. IEEE Signal Processing Magazine, 32(3), 16–34.CrossRef Barchiesi, D., Giannoulis, D., Stowell, D., & Plumbley, M. D. (2015). Acoustic scene classification: Classifying environments from the sounds they produce. IEEE Signal Processing Magazine, 32(3), 16–34.CrossRef
Zurück zum Zitat Bisot, V., Serizel, R., Essid, S., & Richard, G. (2017). Feature learning with matrix factorization applied to acoustic scene classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6), 1216–1229.CrossRef Bisot, V., Serizel, R., Essid, S., & Richard, G. (2017). Feature learning with matrix factorization applied to acoustic scene classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6), 1216–1229.CrossRef
Zurück zum Zitat Chen, H., Liu, Z., Liu, Z., Zhang, P., & Yan, Y. (2019). Integrating the data augmentation scheme with various classifiers for acoustic scene modeling. Technical report, DCASE2019 Challenge Chen, H., Liu, Z., Liu, Z., Zhang, P., & Yan, Y. (2019). Integrating the data augmentation scheme with various classifiers for acoustic scene modeling. Technical report, DCASE2019 Challenge
Zurück zum Zitat Chung, J. S., Nagrani, A., & Zisserman, A. (2018). VoxCeleb2: Deep speaker recognition. In Proceedings of INTERSPEECH, ISCA, (pp. 1086–1090). Chung, J. S., Nagrani, A., & Zisserman, A. (2018). VoxCeleb2: Deep speaker recognition. In Proceedings of INTERSPEECH, ISCA, (pp. 1086–1090).
Zurück zum Zitat Cramer, J., Wu, H.-H. Salamon, J., & Bello, J. P. (2019). Look, listen, and learn more: Design choices for deep audio embeddings. In Proceedings of ICASSP (pp. 3852–3856). IEEE. Cramer, J., Wu, H.-H. Salamon, J., & Bello, J. P. (2019). Look, listen, and learn more: Design choices for deep audio embeddings. In Proceedings of ICASSP (pp. 3852–3856). IEEE.
Zurück zum Zitat Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRef Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRef
Zurück zum Zitat Diez, M., Burget, L., Landini, F., & Černockỳ, J. (2019). Analysis of speaker diarization based on Bayesian HMM with Eigenvoice priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 355–368.CrossRef Diez, M., Burget, L., Landini, F., & Černockỳ, J. (2019). Analysis of speaker diarization based on Bayesian HMM with Eigenvoice priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 355–368.CrossRef
Zurück zum Zitat Dorfer, M., Lehner, B., Eghbal-zadeh, H., Christop, H., Fabian, P., & Gerhard, W. (2018). Acoustic scene classification with fully convolutional neural networks and i-vectors. Technical report, DCASE2018 Challenge Dorfer, M., Lehner, B., Eghbal-zadeh, H., Christop, H., Fabian, P., & Gerhard, W. (2018). Acoustic scene classification with fully convolutional neural networks and i-vectors. Technical report, DCASE2018 Challenge
Zurück zum Zitat Eghbal-Zadeh, H., Lehner, B., Dorfer, M., & Widmer, G. (2016). CP-JKU submissions for DCASE-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks. Technical report, DCASE2016 Challenge Eghbal-Zadeh, H., Lehner, B., Dorfer, M., & Widmer, G. (2016). CP-JKU submissions for DCASE-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks. Technical report, DCASE2016 Challenge
Zurück zum Zitat Fennir, T., Habib, F., & Macaire, C. (2020). Acoustic scene classification for speaker diarization. Technical report, Université de Lorraine. Fennir, T., Habib, F., & Macaire, C. (2020). Acoustic scene classification for speaker diarization. Technical report, Université de Lorraine.
Zurück zum Zitat Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., & McCree, A. (2017). Speaker diarization using deep neural network embeddings. In Proceedings of ICASSP, IEEE, 2017 (pp. 4930–4934). Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., & McCree, A. (2017). Speaker diarization using deep neural network embeddings. In Proceedings of ICASSP, IEEE, 2017 (pp. 4930–4934).
Zurück zum Zitat Giannoulis, D., Stowell, D., Benetos, E., Rossignol, M., Lagrange, M., & Plumbley, M. D. (2013). A database and challenge for acoustic scene classification and event detection. In Proceedings of EUSIPCO, IEEE, 2013 (pp. 1–5). Giannoulis, D., Stowell, D., Benetos, E., Rossignol, M., Lagrange, M., & Plumbley, M. D. (2013). A database and challenge for acoustic scene classification and event detection. In Proceedings of EUSIPCO, IEEE, 2013 (pp. 1–5).
Zurück zum Zitat Han, K. J., Kim, S., & Narayanan, S. S. (2008). Strategies to improve the robustness of agglomerative hierarchical clustering under data source variation for speaker diarization. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1590–1601.CrossRef Han, K. J., Kim, S., & Narayanan, S. S. (2008). Strategies to improve the robustness of agglomerative hierarchical clustering under data source variation for speaker diarization. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1590–1601.CrossRef
Zurück zum Zitat Heittola, T., Mesaros, A., & Virtanen, T. (2020). Acoustic scene classification in DCASE 2020 challenge: Generalization across devices and low complexity solutions. In Proceedings of the detection and classification of acoustic scenes and events 2020 workshop, Tokyo, Japan (pp. 56–60). Heittola, T., Mesaros, A., & Virtanen, T. (2020). Acoustic scene classification in DCASE 2020 challenge: Generalization across devices and low complexity solutions. In Proceedings of the detection and classification of acoustic scenes and events 2020 workshop, Tokyo, Japan (pp. 56–60).
Zurück zum Zitat Huijbregts, M., & Wooters, C. (2007). The blame game: Performance analysis of speaker diarization system components. In Proceedings of INTERSPEECH, ISCA, 2007, (pp. 1857–1860). Huijbregts, M., & Wooters, C. (2007). The blame game: Performance analysis of speaker diarization system components. In Proceedings of INTERSPEECH, ISCA, 2007, (pp. 1857–1860).
Zurück zum Zitat Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447.CrossRef Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447.CrossRef
Zurück zum Zitat Kim, C., & Stern, R. M. (2008). Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In Proceedings of INTERSPEECH, ISCA, 2008 (pp. 2598–2601). Kim, C., & Stern, R. M. (2008). Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In Proceedings of INTERSPEECH, ISCA, 2008 (pp. 2598–2601).
Zurück zum Zitat Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017). A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of ICASSP, IEEE, 2017 (pp. 5220–5224). Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017). A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of ICASSP, IEEE, 2017 (pp. 5220–5224).
Zurück zum Zitat Landini, F., Wang, S., Diez, M., Burget, L., Matějka, P., Žmolíková, K., Mošner, L., Silnova, A., Plchot, O., Novotný, O., Zeinali, H., & Rohdin, J. (2020). BUT system for the second Dihard Speech Diarization challenge. In Proceedings of ICASSP 2020, IEEE (pp. 6529–6533). Landini, F., Wang, S., Diez, M., Burget, L., Matějka, P., Žmolíková, K., Mošner, L., Silnova, A., Plchot, O., Novotný, O., Zeinali, H., & Rohdin, J. (2020). BUT system for the second Dihard Speech Diarization challenge. In Proceedings of ICASSP 2020, IEEE (pp. 6529–6533).
Zurück zum Zitat Löfqvist, A. (1986). The long-time-average spectrum as a tool in voice research. Journal of Phonetics, 14(3–4), 471–475.CrossRef Löfqvist, A. (1986). The long-time-average spectrum as a tool in voice research. Journal of Phonetics, 14(3–4), 471–475.CrossRef
Zurück zum Zitat Mesaros, A., Heittola, T., & Virtanen, T. (2016). TUT database for acoustic scene classification and sound event detection. In Proceedings of EUSIPCO, IEEE, 2016 (pp. 1128–1132). Mesaros, A., Heittola, T., & Virtanen, T. (2016). TUT database for acoustic scene classification and sound event detection. In Proceedings of EUSIPCO, IEEE, 2016 (pp. 1128–1132).
Zurück zum Zitat Mesaros, A., Heittola, T., & Virtanen, T. (2018a). Acoustic scene classification: An overview of DCASE 2017 challenge entries. In Proceedings of 2018, 16th International workshop on acoustic signal enhancement (IWAENC), IEEE (pp. 411–415). Mesaros, A., Heittola, T., & Virtanen, T. (2018a). Acoustic scene classification: An overview of DCASE 2017 challenge entries. In Proceedings of 2018, 16th International workshop on acoustic signal enhancement (IWAENC), IEEE (pp. 411–415).
Zurück zum Zitat Mesaros, A., Heittola, T., & Virtanen, T. (2018b). A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and classification of acoustic scenes and events 2018 Workshop (pp. 9–13). Mesaros, A., Heittola, T., & Virtanen, T. (2018b). A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and classification of acoustic scenes and events 2018 Workshop (pp. 9–13).
Zurück zum Zitat Mesaros, A., Heittola, T., & Virtanen, T. (2019). Acoustic scene classification in DCASE 2019 challenge: Closed and open set classification and data mismatch setups. In Proceedings of the detection and classification of acoustic scenes and events 2019 Workshop Oct. 2019, (pp. 164–168). Mesaros, A., Heittola, T., & Virtanen, T. (2019). Acoustic scene classification in DCASE 2019 challenge: Closed and open set classification and data mismatch setups. In Proceedings of the detection and classification of acoustic scenes and events 2019 Workshop Oct. 2019, (pp. 164–168).
Zurück zum Zitat Mirghafori, N., & Wooters, C. (2006). Nuts and flakes: A study of data characteristics in speaker diarization. In Proc.eedings of CASSP (Vol. 1). IEEE. Mirghafori, N., & Wooters, C. (2006). Nuts and flakes: A study of data characteristics in speaker diarization. In Proc.eedings of CASSP (Vol. 1). IEEE.
Zurück zum Zitat Mun, S., Park, S., Han, D., & Ko, H. (2017). Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane. Technical report, DCASE2017 Challenge. Mun, S., Park, S., Han, D., & Ko, H. (2017). Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane. Technical report, DCASE2017 Challenge.
Zurück zum Zitat Nagrani, A., Chung, J. S., & Zisserman, A. (2017). VoxCeleb: A large-scale speaker identification dataset. In Proceedings of INTERSPEECH, ISCA (pp. 2616–2620). Nagrani, A., Chung, J. S., & Zisserman, A. (2017). VoxCeleb: A large-scale speaker identification dataset. In Proceedings of INTERSPEECH, ISCA (pp. 2616–2620).
Zurück zum Zitat Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S. (2022). A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 72, 101317.CrossRef Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S. (2022). A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 72, 101317.CrossRef
Zurück zum Zitat Prince, S. J., & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1–8). Prince, S. J., & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1–8).
Zurück zum Zitat Raj, D., Snyder, D., Povey, D. & Khudanpur, S. (2019). Probing the information encoded in x-vectors. In Proceedings of the IEEE ASRU (pp. 726–733). Raj, D., Snyder, D., Povey, D. & Khudanpur, S. (2019). Probing the information encoded in x-vectors. In Proceedings of the IEEE ASRU (pp. 726–733).
Zurück zum Zitat Rakotomamonjy, A., & Gasso, G. (2015). Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(1), 142–153. Rakotomamonjy, A., & Gasso, G. (2015). Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(1), 142–153.
Zurück zum Zitat Ren, Z., Qian, K., Zhang, Z., Pandit, V., Baird, A., & Schuller, B. (2018). Deep scalogram representations for acoustic scene classification. IEEE/CAA Journal of Automatica Sinica, 5(3), 662–669.CrossRef Ren, Z., Qian, K., Zhang, Z., Pandit, V., Baird, A., & Schuller, B. (2018). Deep scalogram representations for acoustic scene classification. IEEE/CAA Journal of Automatica Sinica, 5(3), 662–669.CrossRef
Zurück zum Zitat Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRef Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRef
Zurück zum Zitat Roma, G., Nogueira, W. & Herrera, P. (2013). Recurrence quantification analysis features for auditory scene classification. Technical report, DCASE2013 Challenge Roma, G., Nogueira, W. & Herrera, P. (2013). Recurrence quantification analysis features for auditory scene classification. Technical report, DCASE2013 Challenge
Zurück zum Zitat Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S. & Liberman, M. (2018). First DIHARD challenge evaluation plan. Technical report, DIHARD challenge. Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S. & Liberman, M. (2018). First DIHARD challenge evaluation plan. Technical report, DIHARD challenge.
Zurück zum Zitat Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., & Liberman, M. (2019). The second DIHARD diarization challenge: Dataset, task, and baselines. In Proceedings of INTERSPEECH 2019, ISCA, (pp. 978–982). Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., & Liberman, M. (2019). The second DIHARD diarization challenge: Dataset, task, and baselines. In Proceedings of INTERSPEECH 2019, ISCA, (pp. 978–982).
Zurück zum Zitat Ryant, N., Church, K., Cieri, C., Du, J., Ganapathy, S. & Liberman, M. (2020). Third DIHARD challenge evaluation plan. Technical report, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA Ryant, N., Church, K., Cieri, C., Du, J., Ganapathy, S. & Liberman, M. (2020). Third DIHARD challenge evaluation plan. Technical report, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA
Zurück zum Zitat Ryant, N., Singh, P., Krishnamohan, V., Varma, R., Church, K., Cieri, C., Du, J., Ganapathy, S., & Liberman, M. (2021). The third DIHARD diarization challenge. In Proceedings of INTERSPEECH, ISCA (pp. 3570–3574). Ryant, N., Singh, P., Krishnamohan, V., Varma, R., Church, K., Cieri, C., Du, J., Ganapathy, S., & Liberman, M. (2021). The third DIHARD diarization challenge. In Proceedings of INTERSPEECH, ISCA (pp. 3570–3574).
Zurück zum Zitat Sahidullah, M., Patino, J., Cornell, S., Yin, R., Sivasankaran, S., Bredin, H., Korshunov, P., Brutti, A., Serizel, R., Vincent, E., Evans, N. S. Marcel, S. Squartini, & C. Barras, (2019). The Speed submission to DIHARD II: Contributions & lessons learned. arXiv:1911.02388 Sahidullah, M., Patino, J., Cornell, S., Yin, R., Sivasankaran, S., Bredin, H., Korshunov, P., Brutti, A., Serizel, R., Vincent, E., Evans, N. S. Marcel, S. Squartini, & C. Barras, (2019). The Speed submission to DIHARD II: Contributions & lessons learned. arXiv:​1911.​02388
Zurück zum Zitat Sakashita, Y., & Aono, M. (2018). Acoustic scene classification by ensemble of spectrograms based on adaptive temporal divisions. Technical report, DCASE2018 Challenge Sakashita, Y., & Aono, M. (2018). Acoustic scene classification by ensemble of spectrograms based on adaptive temporal divisions. Technical report, DCASE2018 Challenge
Zurück zum Zitat Sell, G., & Garcia-Romero, D. (2014). Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In Proceedings of the 2014 IEEE spoken language technology workshop (SLT), IEEE (pp. 413–417). Sell, G., & Garcia-Romero, D. (2014). Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In Proceedings of the 2014 IEEE spoken language technology workshop (SLT), IEEE (pp. 413–417).
Zurück zum Zitat Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., Manohar, V., Dehak, N., Povey, D., Watanabe, S., & Khudanpur, S. (2018). Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Proceedings of INTERSPEECH, ISCA (pp. 2808–2812). Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., Manohar, V., Dehak, N., Povey, D., Watanabe, S., & Khudanpur, S. (2018). Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Proceedings of INTERSPEECH, ISCA (pp. 2808–2812).
Zurück zum Zitat Shum, S. H., Dehak, N., Dehak, R., & Glass, J. R. (2013). Unsupervised methods for speaker diarization: An integrated and iterative approach. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2015–2028.CrossRef Shum, S. H., Dehak, N., Dehak, R., & Glass, J. R. (2013). Unsupervised methods for speaker diarization: An integrated and iterative approach. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2015–2028.CrossRef
Zurück zum Zitat Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., & Glass, J. (2011). Exploiting intra-conversation variability for speaker diarization. In Proceedings of INTERSPEECH, ISCA (pp. 945–948). Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., & Glass, J. (2011). Exploiting intra-conversation variability for speaker diarization. In Proceedings of INTERSPEECH, ISCA (pp. 945–948).
Zurück zum Zitat Sinclair, M., & King, S. (2013). Where are the challenges in speaker diarization? In Proceedings of the ICASSP, IEEE (pp. 7741–7745). Sinclair, M., & King, S. (2013). Where are the challenges in speaker diarization? In Proceedings of the ICASSP, IEEE (pp. 7741–7745).
Zurück zum Zitat Singh, P., Vardhan, H., Ganapathy, S., & Kanagasundaram, A. (2019). LEAP diarization system for the second DIHARD challenge. In Proceedings of INTERSPEECH, ISCA (pp. 983–987). Singh, P., Vardhan, H., Ganapathy, S., & Kanagasundaram, A. (2019). LEAP diarization system for the second DIHARD challenge. In Proceedings of INTERSPEECH, ISCA (pp. 983–987).
Zurück zum Zitat Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. In Proceedings of INTERSPEECH, ISCA (pp. 999–1003). Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. In Proceedings of INTERSPEECH, ISCA (pp. 999–1003).
Zurück zum Zitat Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of ICASSP, IEEE, (pp. 5329–5333). Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of ICASSP, IEEE, (pp. 5329–5333).
Zurück zum Zitat Suh, S., Park, S., Jeong, Y., & Lee, T. (2020). Designing acoustic scene classification models with CNN variants. Technical report, DCASE2020 Challenge Suh, S., Park, S., Jeong, Y., & Lee, T. (2020). Designing acoustic scene classification models with CNN variants. Technical report, DCASE2020 Challenge
Zurück zum Zitat Waldekar, S., & Saha, G. (2018). Classification of audio scenes with novel features in a fused system framework. Digital Signal Processing, 75, 71–82.MathSciNetCrossRef Waldekar, S., & Saha, G. (2018). Classification of audio scenes with novel features in a fused system framework. Digital Signal Processing, 75, 71–82.MathSciNetCrossRef
Zurück zum Zitat Waldekar, S., & Saha, G. (2020). Analysis and classification of acoustic scenes with wavelet transform-based mel-scaled features. Multimedia Tools and Applications, 79(11), 7911–7926.CrossRef Waldekar, S., & Saha, G. (2020). Analysis and classification of acoustic scenes with wavelet transform-based mel-scaled features. Multimedia Tools and Applications, 79(11), 7911–7926.CrossRef
Zurück zum Zitat Wang, Q., Downey, C., Wan, L., Mansfield, P. A., & Moreno, I. L. (2018). Speaker diarization with LSTM. In Proceedings of ICASSP, IEEE (pp. 5239–5243). Wang, Q., Downey, C., Wan, L., Mansfield, P. A., & Moreno, I. L. (2018). Speaker diarization with LSTM. In Proceedings of ICASSP, IEEE (pp. 5239–5243).
Zurück zum Zitat Wang, S., Qian, Y., & Yu, K. (2017). What does the speaker embedding encode? In Proceedings of INTERSPEECH, ISCA (pp. 1497–1501). Wang, S., Qian, Y., & Yu, K. (2017). What does the speaker embedding encode? In Proceedings of INTERSPEECH, ISCA (pp. 1497–1501).
Zurück zum Zitat Wang, Y., He, M., Niu, S., Sun, L., Gao, T., Fang, X., Pan, J., Du, J., & Lee, C.-H. (2021). USTC-NELSLIP system description for DIHARD-III challenge. arXiv:2103.10661 Wang, Y., He, M., Niu, S., Sun, L., Gao, T., Fang, X., Pan, J., Du, J., & Lee, C.-H. (2021). USTC-NELSLIP system description for DIHARD-III challenge. arXiv:​2103.​10661
Zurück zum Zitat Zeinali, H., Burget, L., & Cernocky, J. H. (2018). Convolutional neural networks and x-vector embedding for DCASE2018 Acoustic Scene Classification challenge. In Proceedings of the detection and classification of acoustic scenes and events 2018 workshop (pp. 202–206). Zeinali, H., Burget, L., & Cernocky, J. H. (2018). Convolutional neural networks and x-vector embedding for DCASE2018 Acoustic Scene Classification challenge. In Proceedings of the detection and classification of acoustic scenes and events 2018 workshop (pp. 202–206).
Zurück zum Zitat Zhang, A., Wang, Q., Zhu, Z., Paisley, J., & Wang, C. (2019). Fully supervised speaker diarization. In Proceedings of ICASSP, IEEE (pp. 6301–6305). Zhang, A., Wang, Q., Zhu, Z., Paisley, J., & Wang, C. (2019). Fully supervised speaker diarization. In Proceedings of ICASSP, IEEE (pp. 6301–6305).
Zurück zum Zitat Zhu, W., & Pelecanos, J. (2016). Online speaker diarization using adapted i-vector transforms. In Proceedings of ICASSP, IEEE (pp. 5045–5049). Zhu, W., & Pelecanos, J. (2016). Online speaker diarization using adapted i-vector transforms. In Proceedings of ICASSP, IEEE (pp. 5045–5049).
Metadaten
Titel
Robust acoustic domain identification with its application to speaker diarization
verfasst von
A Kishore Kumar
Shefali Waldekar
Md Sahidullah
Goutam Saha
Publikationsdatum
04.08.2022
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 4/2022
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-022-09990-9

Weitere Artikel der Ausgabe 4/2022

International Journal of Speech Technology 4/2022 Zur Ausgabe

Neuer Inhalt