Top

International Journal of Speech Technology

Published in:

20-10-2016

Speaker diarization system using MKMFCC parameterization and WLI-fuzzy clustering

Authors: V. Subba Ramaiah, R. Rajeswara Rao

Published in: International Journal of Speech Technology | Issue 4/2016

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Speaker diarization is the process of determining “who speak when?” with appropriate speaker labels with respect to the time regions where they spoke. Accordingly, in the previous work, a model based speaker diarization using the tangential weighted Mel frequency cepstral coefficients as the feature parameter for the voice activity detection and Lion optimization algorithm for the clustering of the audio streams into speaker group was performed. In this paper, speaker diarization system is proposed using multiple kernel weighted Mel frequency cepstral coefficient (MKMFCC) parameterization and Wu-and-Li Index (WLI)-fuzzy clustering. First, a MKMFCC which utilizes the multiple kernels like the tangential and exponential for weighting the MFCC’s is proposed for the feature parameterization. Second, a clustering algorithm called the WLI-Fuzzy clustering is proposed for grouping the segments of the same speaker groups. The experimentation of the proposed speaker diarization system is carried out over the publically available ELSDSR corpus data set having the audio signal with seven different speakers. The performance evaluation of the proposed speaker diarization system is analysed using the measures such as diarization error rate, F-measure and false alarm rate. The results show that the proposed speaker diarization system proved better for tracking the active speakers from multiple speakers with improved tracking accuracy.

previous article Development of Standard Yorùbá speech-to-text system using HTK

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Ajmera, J., McCowan, I., & Bourlard, H. (2004). Robust Speaker Change Detection. IEEE Signal Processing Letters, 11(8), 649–651.CrossRef

Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., & Friedland, G. (2010). Speaker diarization: A review of recent research. In Proceedings of IEEE TASLP (pp. 1–14).

Bakis, R., Chen, S., Gopalakrishnan, P., Gopinath, R., Maes, S., Polymenakos, L., & Franz, M. (1997). Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. In Proceedings of the speech recognition workshop (pp. 67–72).

Barras, C., Zhu, X., Meignier, S., & Gauvain, J.-L. (2006). Multistage Speaker Diarization of Broadcast News. IEEE Transactions on Audio, Speech and Language Processing, 14(5), 1505–1512.CrossRef

Beigi, H., & Maes, S. (1998). Speaker, channel and environment change detection. In Proceedings of the world congress on automation.

Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2–3), 191–203.CrossRef

Chen, S., & Gopalakrishnan, P. (1998). Speaker, environment and channel change detection and clustering via the bayesian information criterion. Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 8, 127–132.

Chen, J., Shue, L., & Ser, W. (2002). A new approach for speaker tracking in reverberant environment. Signal Processing, 82(7), 1023–1028.CrossRefMATH

CSTR VCTK Corpus from http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html

Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transaction on Acoustic Speech Signal Processing, 28(4), 357–366.CrossRef

Delgado, H., Anguera, X., Fredouille, C., & Serrano, J. (2015). Fast single- and cross-show speaker diarization using binary key speaker modeling. IEEE Transactions on Audio, Speech and Language Processing, 23(12), 2286–2297.CrossRef

Dunn, R. B., Reynolds, D. A., & Quatieri, T. F. (2000). Approaches to speaker detection and tracking in conversational speech. Digital Signal Processing, 10(1–3), 93–112.CrossRef

ELSDSR database from http://cogsys.compute.dtu.dk/soundshare/elsdsr.zip

Evans, N., Bozonnet, S., & Wang, D. (2012). A comparative study of bottom-up and top-down approaches to speaker diarization. IEEE Transactions on Audio, Speech and Language Processing, 20(2), 382–392.CrossRef

Gish, H., & Schmidt, N. (1994). Text-independent speaker identification. IEEE Signal Processing Magazine, 11(4), 18–32.CrossRef

Huijbregts, M., & van Leeuwen, D. A. (2012). Large-scale speaker diarization for long recordings and small collections. IEEE Transaction on Audio, Speech, and Language Processing, 20(2), 404–413.CrossRef

Jiang, J.-Y., Liou, R.-J., & Lee, S.-J. (2011). A fuzzy self-constructing feature clustering algorithm for text classification. IEEE Transactions on Knowledge and Data Engineering, 23(3), 335–349.CrossRef

Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P. & Alam, J. (2014). Deep neural networks for Baum-Wech statistics for speaker Recognition. In Proceedings of neural networks for speaker and language modelling.

Kubala, F., Jin, H., Matsoukas, S., Nguyen, L., Schwartz, R., & Makhou, J. (1997). The 1996 BBN Byblos Hub-4 transcription system. In Proceedings of the speech recognition workshop (pp. 90–93).

Le, V. B., Mella, O., & Fohr, D. (2007). Speaker diarization using normalized cross likelihood ratio. Interspeech, 7, 1869–1872.

Madikeri, S., Himawan, I., Motlicek, P., & Ferras, M. (2015). Integrating online i-vector extractor with information bottleneck based speaker diarization system. In Interspeech-2015 16 th Annual Conference of the international speech communication association, Dresden, Germany, September 6–10 (pp. 3105–3109).

Makhoul, J. (1975). Linear prediction: a tutorial review. Proceedings of IEEE, 63(4), 561–580.CrossRef

Meignier, S., Bonastre, J.-F., & Igounet, S. (2001). E-HMM approach for learning and adapting sound models for speaker indexing. In Proceedings of Odyssey workshop (pp. 175–180).

Moattar, M. H., & Homayounpour, M. M. (2012). Variational conditional random fields for online speaker detection and tracking. Speech Communication, 54(6), 763–780.CrossRef

Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using Mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. Journal of Computing, 2(3), 138–153.

Nguyen, T. H., Chng, E. S., & Li, H. (2015). Speaker diarization: An emerging research. In T. Ogunfunmi (Ed.), Speech and audio processing for coding, enhancement and recognition, Part II (pp. 229–277). New York: Springer.

NIST. (2009). The NIST Rich Transcription 2009 (RT’09) evaluation. http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-val-plan-v2.pdf.

Oku, T., Sato, S., Kobayashi, A., Homma, S., & Imai, T. (2012). Low-latency speaker diarization based on Bayesian information criterion with multiple phoneme classes. In Proceedings of IEEE international conference on acoustics, speech and signal processing (pp. 4189–4192).

Pertila, P. (2013). Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking. Computer Speech & Language, 27(3), 683–702.CrossRef

Ramaiah, V. S., & Rao, R. R. (in press) A novel approach for speaker diarization system using tmfcc parameterization and lion optimization. Journal of Central South University of Technology.

Reynolds, D. (2009). Universal background models. In Encyclopedia of biometrics (pp. 1349–1352). New York: Springer.

Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.CrossRef

Siegler, M. A., Jain, U., Raj, B., & Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of DARPA speech recognition workshop (pp. 97–99).

Siegler, M., Jain, U., Ray, B., & Stern, R. (1997). Automatic segmentation, classifcation and clustering of broadcast news audio. In Proceedings of the speech recognition workshop (pp. 97–99).

Sohal, J. S., & Sukhvinder, K. (2015). Optimization of speaker diarization by reducing diarization error rate: A review. International Journal of Electronics and Communication Engineering, 4, 84–87.

Stevens, S., & Volkmann, J. (1940). The relation of pitch to frequency: a revised scale. The American Journal of Psychology, 53(3), 329–353.CrossRef

Vijayasenan, D., Valente, F., & Bourlard, H. (2012). Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Communication, 54, 55–67.CrossRef

Woodland, P., Gales, M., Pye, D., & Young, S. (1997). The development of the 1996 HTK broadcast news transcription system. In Proceedings of the speech recognition workshop (pp. 73–78).

Wu, C.-H., Ouyang, C.-S., Chen, L.-W., & Lu, L.-W. (2013). A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Transaction on Fuzzy System, 23(3), 1–16.CrossRef

Xu, Y., McLoughlin, I., Song, Y., & Wu, K. (2015). Improved i-vector representation for speaker diarization. Circuits, Systems, and Signal Processing, 35(9), 3393–3404.MathSciNetCrossRef

Title: Speaker diarization system using MKMFCC parameterization and WLI-fuzzy clustering
Authors: V. Subba Ramaiah
R. Rajeswara Rao
Publication date: 20-10-2016
Publisher: Springer US
Published in: International Journal of Speech Technology / Issue 4/2016
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-016-9384-y

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2016

Spectral analysis of infant cries and adult speech

Improved automatic English proficiency rating of unconstrained speech with multiple corpora

Maghrebian dialect recognition based on support vector machines and neural network classifiers

Voice assessments for detecting patients with Parkinson’s diseases using PCA and NPCA

Low complexity forward error correction for CELP-type speech coding over erasure channel transmission

Energy bands and spectral cues for Arabic vowels recognition