nach oben

International Journal of Speech Technology

Erschienen in:

04.08.2022

Robust acoustic domain identification with its application to speaker diarization

verfasst von: A Kishore Kumar, Shefali Waldekar, Md Sahidullah, Goutam Saha

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

With the rise in multimedia content over the years, more variety is observed in the recording environments of audio. An audio processing system might benefit when it has a module to identify the acoustic domain at its front-end. In this paper, we demonstrate the idea of acoustic domain identification (ADI) for speaker diarization. For this, we first present a detailed study of the various domains of the third DIHARD challenge highlighting the factors that differentiated them from each other. Our main contribution is to develop a simple and efficient solution for ADI. In the present work, we explore speaker embeddings for this task. Next, we integrate the ADI module with the speaker diarization framework of the DIHARD III challenge. The performance substantially improved over that of the baseline when the thresholds for agglomerative hierarchical clustering were optimized according to the respective domains. We achieved a relative improvement of more than \(5\%\) and \(8\%\) in DER for core and full conditions, respectively, on Track 1 of the DIHARD III evaluation set.

Vorheriger Artikel HTK-based speech recognition and corpus-based English vocabulary online guiding system

Nächster Artikel Non-intrusive speech quality assessment using context-aware neural networks

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation.

https://kaldi-asr.org/models/m7.

https://github.com/marl/openl3.

https://github.com/dihardchallenge/dihard3_baseline.

Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356–370.CrossRef

Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In Proceedings of the IEEE international conference on computer vision (pp. 609–617).

Assmann, P., & Summerfield, Q. (2004). The perception of speech under adverse conditions. Speech Processing in the Auditory System, 18, 231–308.CrossRef

Bai, Z., & Zhang, X.-L. (2021). Speaker recognition based on deep learning: An overview. Neural Networks, 140, 65–99.CrossRef

Barchiesi, D., Giannoulis, D., Stowell, D., & Plumbley, M. D. (2015). Acoustic scene classification: Classifying environments from the sounds they produce. IEEE Signal Processing Magazine, 32(3), 16–34.CrossRef

Bisot, V., Serizel, R., Essid, S., & Richard, G. (2017). Feature learning with matrix factorization applied to acoustic scene classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6), 1216–1229.CrossRef

Chen, H., Liu, Z., Liu, Z., Zhang, P., & Yan, Y. (2019). Integrating the data augmentation scheme with various classifiers for acoustic scene modeling. Technical report, DCASE2019 Challenge

Chung, J. S., Nagrani, A., & Zisserman, A. (2018). VoxCeleb2: Deep speaker recognition. In Proceedings of INTERSPEECH, ISCA, (pp. 1086–1090).

Cramer, J., Wu, H.-H. Salamon, J., & Bello, J. P. (2019). Look, listen, and learn more: Design choices for deep audio embeddings. In Proceedings of ICASSP (pp. 3852–3856). IEEE.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRef

Diez, M., Burget, L., Landini, F., & Černockỳ, J. (2019). Analysis of speaker diarization based on Bayesian HMM with Eigenvoice priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 355–368.CrossRef

Dorfer, M., Lehner, B., Eghbal-zadeh, H., Christop, H., Fabian, P., & Gerhard, W. (2018). Acoustic scene classification with fully convolutional neural networks and i-vectors. Technical report, DCASE2018 Challenge

Eghbal-Zadeh, H., Lehner, B., Dorfer, M., & Widmer, G. (2016). CP-JKU submissions for DCASE-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks. Technical report, DCASE2016 Challenge

Fennir, T., Habib, F., & Macaire, C. (2020). Acoustic scene classification for speaker diarization. Technical report, Université de Lorraine.

Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., & McCree, A. (2017). Speaker diarization using deep neural network embeddings. In Proceedings of ICASSP, IEEE, 2017 (pp. 4930–4934).

Giannoulis, D., Stowell, D., Benetos, E., Rossignol, M., Lagrange, M., & Plumbley, M. D. (2013). A database and challenge for acoustic scene classification and event detection. In Proceedings of EUSIPCO, IEEE, 2013 (pp. 1–5).

Han, K. J., Kim, S., & Narayanan, S. S. (2008). Strategies to improve the robustness of agglomerative hierarchical clustering under data source variation for speaker diarization. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1590–1601.CrossRef

Heittola, T., Mesaros, A., & Virtanen, T. (2020). Acoustic scene classification in DCASE 2020 challenge: Generalization across devices and low complexity solutions. In Proceedings of the detection and classification of acoustic scenes and events 2020 workshop, Tokyo, Japan (pp. 56–60).

Huijbregts, M., & Wooters, C. (2007). The blame game: Performance analysis of speaker diarization system components. In Proceedings of INTERSPEECH, ISCA, 2007, (pp. 1857–1860).

Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447.CrossRef

Kim, C., & Stern, R. M. (2008). Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In Proceedings of INTERSPEECH, ISCA, 2008 (pp. 2598–2601).

Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017). A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of ICASSP, IEEE, 2017 (pp. 5220–5224).

Landini, F., Wang, S., Diez, M., Burget, L., Matějka, P., Žmolíková, K., Mošner, L., Silnova, A., Plchot, O., Novotný, O., Zeinali, H., & Rohdin, J. (2020). BUT system for the second Dihard Speech Diarization challenge. In Proceedings of ICASSP 2020, IEEE (pp. 6529–6533).

Löfqvist, A. (1986). The long-time-average spectrum as a tool in voice research. Journal of Phonetics, 14(3–4), 471–475.CrossRef

Mesaros, A., Heittola, T., & Virtanen, T. (2016). TUT database for acoustic scene classification and sound event detection. In Proceedings of EUSIPCO, IEEE, 2016 (pp. 1128–1132).

Mesaros, A., Heittola, T., & Virtanen, T. (2018a). Acoustic scene classification: An overview of DCASE 2017 challenge entries. In Proceedings of 2018, 16th International workshop on acoustic signal enhancement (IWAENC), IEEE (pp. 411–415).

Mesaros, A., Heittola, T., & Virtanen, T. (2018b). A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and classification of acoustic scenes and events 2018 Workshop (pp. 9–13).

Mesaros, A., Heittola, T., & Virtanen, T. (2019). Acoustic scene classification in DCASE 2019 challenge: Closed and open set classification and data mismatch setups. In Proceedings of the detection and classification of acoustic scenes and events 2019 Workshop Oct. 2019, (pp. 164–168).

Mirghafori, N., & Wooters, C. (2006). Nuts and flakes: A study of data characteristics in speaker diarization. In Proc.eedings of CASSP (Vol. 1). IEEE.

Mun, S., Park, S., Han, D., & Ko, H. (2017). Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane. Technical report, DCASE2017 Challenge.

Nagrani, A., Chung, J. S., & Zisserman, A. (2017). VoxCeleb: A large-scale speaker identification dataset. In Proceedings of INTERSPEECH, ISCA (pp. 2616–2620).

Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S. (2022). A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 72, 101317.CrossRef

Prince, S. J., & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1–8).

Raj, D., Snyder, D., Povey, D. & Khudanpur, S. (2019). Probing the information encoded in x-vectors. In Proceedings of the IEEE ASRU (pp. 726–733).

Rakotomamonjy, A., & Gasso, G. (2015). Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(1), 142–153.

Ren, Z., Qian, K., Zhang, Z., Pandit, V., Baird, A., & Schuller, B. (2018). Deep scalogram representations for acoustic scene classification. IEEE/CAA Journal of Automatica Sinica, 5(3), 662–669.CrossRef

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRef

Roma, G., Nogueira, W. & Herrera, P. (2013). Recurrence quantification analysis features for auditory scene classification. Technical report, DCASE2013 Challenge

Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S. & Liberman, M. (2018). First DIHARD challenge evaluation plan. Technical report, DIHARD challenge.

Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., & Liberman, M. (2019). The second DIHARD diarization challenge: Dataset, task, and baselines. In Proceedings of INTERSPEECH 2019, ISCA, (pp. 978–982).

Ryant, N., Church, K., Cieri, C., Du, J., Ganapathy, S. & Liberman, M. (2020). Third DIHARD challenge evaluation plan. Technical report, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA

Ryant, N., Singh, P., Krishnamohan, V., Varma, R., Church, K., Cieri, C., Du, J., Ganapathy, S., & Liberman, M. (2021). The third DIHARD diarization challenge. In Proceedings of INTERSPEECH, ISCA (pp. 3570–3574).

Sahidullah, M., Patino, J., Cornell, S., Yin, R., Sivasankaran, S., Bredin, H., Korshunov, P., Brutti, A., Serizel, R., Vincent, E., Evans, N. S. Marcel, S. Squartini, & C. Barras, (2019). The Speed submission to DIHARD II: Contributions & lessons learned. arXiv:1911.02388

Sakashita, Y., & Aono, M. (2018). Acoustic scene classification by ensemble of spectrograms based on adaptive temporal divisions. Technical report, DCASE2018 Challenge

Sell, G., & Garcia-Romero, D. (2014). Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In Proceedings of the 2014 IEEE spoken language technology workshop (SLT), IEEE (pp. 413–417).

Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., Manohar, V., Dehak, N., Povey, D., Watanabe, S., & Khudanpur, S. (2018). Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Proceedings of INTERSPEECH, ISCA (pp. 2808–2812).

Shum, S. H., Dehak, N., Dehak, R., & Glass, J. R. (2013). Unsupervised methods for speaker diarization: An integrated and iterative approach. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2015–2028.CrossRef

Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., & Glass, J. (2011). Exploiting intra-conversation variability for speaker diarization. In Proceedings of INTERSPEECH, ISCA (pp. 945–948).

Sinclair, M., & King, S. (2013). Where are the challenges in speaker diarization? In Proceedings of the ICASSP, IEEE (pp. 7741–7745).

Singh, P., Vardhan, H., Ganapathy, S., & Kanagasundaram, A. (2019). LEAP diarization system for the second DIHARD challenge. In Proceedings of INTERSPEECH, ISCA (pp. 983–987).

Snyder, D., Chen, G., & Povey, D. (2015). MUSAN: A music, speech, and noise corpus. arXiv:1510.08484

Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. In Proceedings of INTERSPEECH, ISCA (pp. 999–1003).

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of ICASSP, IEEE, (pp. 5329–5333).

Suh, S., Park, S., Jeong, Y., & Lee, T. (2020). Designing acoustic scene classification models with CNN variants. Technical report, DCASE2020 Challenge

Waldekar, S., & Saha, G. (2018). Classification of audio scenes with novel features in a fused system framework. Digital Signal Processing, 75, 71–82.MathSciNetCrossRef

Waldekar, S., & Saha, G. (2020). Analysis and classification of acoustic scenes with wavelet transform-based mel-scaled features. Multimedia Tools and Applications, 79(11), 7911–7926.CrossRef

Wang, Q., Downey, C., Wan, L., Mansfield, P. A., & Moreno, I. L. (2018). Speaker diarization with LSTM. In Proceedings of ICASSP, IEEE (pp. 5239–5243).

Wang, S., Qian, Y., & Yu, K. (2017). What does the speaker embedding encode? In Proceedings of INTERSPEECH, ISCA (pp. 1497–1501).

Wang, Y., He, M., Niu, S., Sun, L., Gao, T., Fang, X., Pan, J., Du, J., & Lee, C.-H. (2021). USTC-NELSLIP system description for DIHARD-III challenge. arXiv:2103.10661

Zeinali, H., Burget, L., & Cernocky, J. H. (2018). Convolutional neural networks and x-vector embedding for DCASE2018 Acoustic Scene Classification challenge. In Proceedings of the detection and classification of acoustic scenes and events 2018 workshop (pp. 202–206).

Zhang, A., Wang, Q., Zhu, Z., Paisley, J., & Wang, C. (2019). Fully supervised speaker diarization. In Proceedings of ICASSP, IEEE (pp. 6301–6305).

Zhu, W., & Pelecanos, J. (2016). Online speaker diarization using adapted i-vector transforms. In Proceedings of ICASSP, IEEE (pp. 5045–5049).

Titel: Robust acoustic domain identification with its application to speaker diarization
verfasst von: A Kishore Kumar
Shefali Waldekar
Md Sahidullah
Goutam Saha
Publikationsdatum: 04.08.2022
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 4/2022
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-022-09990-9

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Internationaler Motorenkongress/© [M] ATZlive | Chisnikov / Fotolia.com, Search Icon, Banner Hanser, Thorsten Mücke/© Alexandra Bachran, Gamification/© Sergey Shulgin / Getty Images / iStock, Benedikt Bonnmann von Adesso/© Adesso, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2022

Blind identification of the inverse of SIMO system and deconvolution with Kalman filter

An end-to-end TTS model with pronunciation predictor

To deploy trained speech with DNN-LSTM framework for controlling a smart wheeled-robot in limited learning circumstance

Construction of complex environment speech signal communication system based on 5G and AI driven feature extraction techniques

Speech emotion recognition using data augmentation

Speaker identification using hybrid neural network support vector machine classifier

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.