Skip to main content
Top
Published in: International Journal of Speech Technology 4/2016

20-10-2016

Speaker diarization system using MKMFCC parameterization and WLI-fuzzy clustering

Authors: V. Subba Ramaiah, R. Rajeswara Rao

Published in: International Journal of Speech Technology | Issue 4/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Speaker diarization is the process of determining “who speak when?” with appropriate speaker labels with respect to the time regions where they spoke. Accordingly, in the previous work, a model based speaker diarization using the tangential weighted Mel frequency cepstral coefficients as the feature parameter for the voice activity detection and Lion optimization algorithm for the clustering of the audio streams into speaker group was performed. In this paper, speaker diarization system is proposed using multiple kernel weighted Mel frequency cepstral coefficient (MKMFCC) parameterization and Wu-and-Li Index (WLI)-fuzzy clustering. First, a MKMFCC which utilizes the multiple kernels like the tangential and exponential for weighting the MFCC’s is proposed for the feature parameterization. Second, a clustering algorithm called the WLI-Fuzzy clustering is proposed for grouping the segments of the same speaker groups. The experimentation of the proposed speaker diarization system is carried out over the publically available ELSDSR corpus data set having the audio signal with seven different speakers. The performance evaluation of the proposed speaker diarization system is analysed using the measures such as diarization error rate, F-measure and false alarm rate. The results show that the proposed speaker diarization system proved better for tracking the active speakers from multiple speakers with improved tracking accuracy.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Ajmera, J., McCowan, I., & Bourlard, H. (2004). Robust Speaker Change Detection. IEEE Signal Processing Letters, 11(8), 649–651.CrossRef Ajmera, J., McCowan, I., & Bourlard, H. (2004). Robust Speaker Change Detection. IEEE Signal Processing Letters, 11(8), 649–651.CrossRef
go back to reference Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., & Friedland, G. (2010). Speaker diarization: A review of recent research. In Proceedings of IEEE TASLP (pp. 1–14). Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., & Friedland, G. (2010). Speaker diarization: A review of recent research. In Proceedings of IEEE TASLP (pp. 1–14).
go back to reference Bakis, R., Chen, S., Gopalakrishnan, P., Gopinath, R., Maes, S., Polymenakos, L., & Franz, M. (1997). Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. In Proceedings of the speech recognition workshop (pp. 67–72). Bakis, R., Chen, S., Gopalakrishnan, P., Gopinath, R., Maes, S., Polymenakos, L., & Franz, M. (1997). Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. In Proceedings of the speech recognition workshop (pp. 67–72).
go back to reference Barras, C., Zhu, X., Meignier, S., & Gauvain, J.-L. (2006). Multistage Speaker Diarization of Broadcast News. IEEE Transactions on Audio, Speech and Language Processing, 14(5), 1505–1512.CrossRef Barras, C., Zhu, X., Meignier, S., & Gauvain, J.-L. (2006). Multistage Speaker Diarization of Broadcast News. IEEE Transactions on Audio, Speech and Language Processing, 14(5), 1505–1512.CrossRef
go back to reference Beigi, H., & Maes, S. (1998). Speaker, channel and environment change detection. In Proceedings of the world congress on automation. Beigi, H., & Maes, S. (1998). Speaker, channel and environment change detection. In Proceedings of the world congress on automation.
go back to reference Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2–3), 191–203.CrossRef Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2–3), 191–203.CrossRef
go back to reference Chen, S., & Gopalakrishnan, P. (1998). Speaker, environment and channel change detection and clustering via the bayesian information criterion. Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 8, 127–132. Chen, S., & Gopalakrishnan, P. (1998). Speaker, environment and channel change detection and clustering via the bayesian information criterion. Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 8, 127–132.
go back to reference Chen, J., Shue, L., & Ser, W. (2002). A new approach for speaker tracking in reverberant environment. Signal Processing, 82(7), 1023–1028.CrossRefMATH Chen, J., Shue, L., & Ser, W. (2002). A new approach for speaker tracking in reverberant environment. Signal Processing, 82(7), 1023–1028.CrossRefMATH
go back to reference Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transaction on Acoustic Speech Signal Processing, 28(4), 357–366.CrossRef Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transaction on Acoustic Speech Signal Processing, 28(4), 357–366.CrossRef
go back to reference Delgado, H., Anguera, X., Fredouille, C., & Serrano, J. (2015). Fast single- and cross-show speaker diarization using binary key speaker modeling. IEEE Transactions on Audio, Speech and Language Processing, 23(12), 2286–2297.CrossRef Delgado, H., Anguera, X., Fredouille, C., & Serrano, J. (2015). Fast single- and cross-show speaker diarization using binary key speaker modeling. IEEE Transactions on Audio, Speech and Language Processing, 23(12), 2286–2297.CrossRef
go back to reference Dunn, R. B., Reynolds, D. A., & Quatieri, T. F. (2000). Approaches to speaker detection and tracking in conversational speech. Digital Signal Processing, 10(1–3), 93–112.CrossRef Dunn, R. B., Reynolds, D. A., & Quatieri, T. F. (2000). Approaches to speaker detection and tracking in conversational speech. Digital Signal Processing, 10(1–3), 93–112.CrossRef
go back to reference Evans, N., Bozonnet, S., & Wang, D. (2012). A comparative study of bottom-up and top-down approaches to speaker diarization. IEEE Transactions on Audio, Speech and Language Processing, 20(2), 382–392.CrossRef Evans, N., Bozonnet, S., & Wang, D. (2012). A comparative study of bottom-up and top-down approaches to speaker diarization. IEEE Transactions on Audio, Speech and Language Processing, 20(2), 382–392.CrossRef
go back to reference Gish, H., & Schmidt, N. (1994). Text-independent speaker identification. IEEE Signal Processing Magazine, 11(4), 18–32.CrossRef Gish, H., & Schmidt, N. (1994). Text-independent speaker identification. IEEE Signal Processing Magazine, 11(4), 18–32.CrossRef
go back to reference Huijbregts, M., & van Leeuwen, D. A. (2012). Large-scale speaker diarization for long recordings and small collections. IEEE Transaction on Audio, Speech, and Language Processing, 20(2), 404–413.CrossRef Huijbregts, M., & van Leeuwen, D. A. (2012). Large-scale speaker diarization for long recordings and small collections. IEEE Transaction on Audio, Speech, and Language Processing, 20(2), 404–413.CrossRef
go back to reference Jiang, J.-Y., Liou, R.-J., & Lee, S.-J. (2011). A fuzzy self-constructing feature clustering algorithm for text classification. IEEE Transactions on Knowledge and Data Engineering, 23(3), 335–349.CrossRef Jiang, J.-Y., Liou, R.-J., & Lee, S.-J. (2011). A fuzzy self-constructing feature clustering algorithm for text classification. IEEE Transactions on Knowledge and Data Engineering, 23(3), 335–349.CrossRef
go back to reference Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P. & Alam, J. (2014). Deep neural networks for Baum-Wech statistics for speaker Recognition. In Proceedings of neural networks for speaker and language modelling. Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P. & Alam, J. (2014). Deep neural networks for Baum-Wech statistics for speaker Recognition. In Proceedings of neural networks for speaker and language modelling.
go back to reference Kubala, F., Jin, H., Matsoukas, S., Nguyen, L., Schwartz, R., & Makhou, J. (1997). The 1996 BBN Byblos Hub-4 transcription system. In Proceedings of the speech recognition workshop (pp. 90–93). Kubala, F., Jin, H., Matsoukas, S., Nguyen, L., Schwartz, R., & Makhou, J. (1997). The 1996 BBN Byblos Hub-4 transcription system. In Proceedings of the speech recognition workshop (pp. 90–93).
go back to reference Le, V. B., Mella, O., & Fohr, D. (2007). Speaker diarization using normalized cross likelihood ratio. Interspeech, 7, 1869–1872. Le, V. B., Mella, O., & Fohr, D. (2007). Speaker diarization using normalized cross likelihood ratio. Interspeech, 7, 1869–1872.
go back to reference Madikeri, S., Himawan, I., Motlicek, P., & Ferras, M. (2015). Integrating online i-vector extractor with information bottleneck based speaker diarization system. In Interspeech-2015 16 th Annual Conference of the international speech communication association, Dresden, Germany, September 6–10 (pp. 3105–3109). Madikeri, S., Himawan, I., Motlicek, P., & Ferras, M. (2015). Integrating online i-vector extractor with information bottleneck based speaker diarization system. In Interspeech-2015 16 th Annual Conference of the international speech communication association, Dresden, Germany, September 6–10 (pp. 3105–3109).
go back to reference Makhoul, J. (1975). Linear prediction: a tutorial review. Proceedings of IEEE, 63(4), 561–580.CrossRef Makhoul, J. (1975). Linear prediction: a tutorial review. Proceedings of IEEE, 63(4), 561–580.CrossRef
go back to reference Meignier, S., Bonastre, J.-F., & Igounet, S. (2001). E-HMM approach for learning and adapting sound models for speaker indexing. In Proceedings of Odyssey workshop (pp. 175–180). Meignier, S., Bonastre, J.-F., & Igounet, S. (2001). E-HMM approach for learning and adapting sound models for speaker indexing. In Proceedings of Odyssey workshop (pp. 175–180).
go back to reference Moattar, M. H., & Homayounpour, M. M. (2012). Variational conditional random fields for online speaker detection and tracking. Speech Communication, 54(6), 763–780.CrossRef Moattar, M. H., & Homayounpour, M. M. (2012). Variational conditional random fields for online speaker detection and tracking. Speech Communication, 54(6), 763–780.CrossRef
go back to reference Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using Mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. Journal of Computing, 2(3), 138–153. Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using Mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. Journal of Computing, 2(3), 138–153.
go back to reference Nguyen, T. H., Chng, E. S., & Li, H. (2015). Speaker diarization: An emerging research. In T. Ogunfunmi (Ed.), Speech and audio processing for coding, enhancement and recognition, Part II (pp. 229–277). New York: Springer. Nguyen, T. H., Chng, E. S., & Li, H. (2015). Speaker diarization: An emerging research. In T. Ogunfunmi (Ed.), Speech and audio processing for coding, enhancement and recognition, Part II (pp. 229–277). New York: Springer.
go back to reference Oku, T., Sato, S., Kobayashi, A., Homma, S., & Imai, T. (2012). Low-latency speaker diarization based on Bayesian information criterion with multiple phoneme classes. In Proceedings of IEEE international conference on acoustics, speech and signal processing (pp. 4189–4192). Oku, T., Sato, S., Kobayashi, A., Homma, S., & Imai, T. (2012). Low-latency speaker diarization based on Bayesian information criterion with multiple phoneme classes. In Proceedings of IEEE international conference on acoustics, speech and signal processing (pp. 4189–4192).
go back to reference Pertila, P. (2013). Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking. Computer Speech & Language, 27(3), 683–702.CrossRef Pertila, P. (2013). Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking. Computer Speech & Language, 27(3), 683–702.CrossRef
go back to reference Ramaiah, V. S., & Rao, R. R. (in press) A novel approach for speaker diarization system using tmfcc parameterization and lion optimization. Journal of Central South University of Technology. Ramaiah, V. S., & Rao, R. R. (in press) A novel approach for speaker diarization system using tmfcc parameterization and lion optimization. Journal of Central South University of Technology.
go back to reference Reynolds, D. (2009). Universal background models. In Encyclopedia of biometrics (pp. 1349–1352). New York: Springer. Reynolds, D. (2009). Universal background models. In Encyclopedia of biometrics (pp. 1349–1352). New York: Springer.
go back to reference Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.CrossRef Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.CrossRef
go back to reference Siegler, M. A., Jain, U., Raj, B., & Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of DARPA speech recognition workshop (pp. 97–99). Siegler, M. A., Jain, U., Raj, B., & Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of DARPA speech recognition workshop (pp. 97–99).
go back to reference Siegler, M., Jain, U., Ray, B., & Stern, R. (1997). Automatic segmentation, classifcation and clustering of broadcast news audio. In Proceedings of the speech recognition workshop (pp. 97–99). Siegler, M., Jain, U., Ray, B., & Stern, R. (1997). Automatic segmentation, classifcation and clustering of broadcast news audio. In Proceedings of the speech recognition workshop (pp. 97–99).
go back to reference Sohal, J. S., & Sukhvinder, K. (2015). Optimization of speaker diarization by reducing diarization error rate: A review. International Journal of Electronics and Communication Engineering, 4, 84–87. Sohal, J. S., & Sukhvinder, K. (2015). Optimization of speaker diarization by reducing diarization error rate: A review. International Journal of Electronics and Communication Engineering, 4, 84–87.
go back to reference Stevens, S., & Volkmann, J. (1940). The relation of pitch to frequency: a revised scale. The American Journal of Psychology, 53(3), 329–353.CrossRef Stevens, S., & Volkmann, J. (1940). The relation of pitch to frequency: a revised scale. The American Journal of Psychology, 53(3), 329–353.CrossRef
go back to reference Vijayasenan, D., Valente, F., & Bourlard, H. (2012). Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Communication, 54, 55–67.CrossRef Vijayasenan, D., Valente, F., & Bourlard, H. (2012). Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Communication, 54, 55–67.CrossRef
go back to reference Woodland, P., Gales, M., Pye, D., & Young, S. (1997). The development of the 1996 HTK broadcast news transcription system. In Proceedings of the speech recognition workshop (pp. 73–78). Woodland, P., Gales, M., Pye, D., & Young, S. (1997). The development of the 1996 HTK broadcast news transcription system. In Proceedings of the speech recognition workshop (pp. 73–78).
go back to reference Wu, C.-H., Ouyang, C.-S., Chen, L.-W., & Lu, L.-W. (2013). A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Transaction on Fuzzy System, 23(3), 1–16.CrossRef Wu, C.-H., Ouyang, C.-S., Chen, L.-W., & Lu, L.-W. (2013). A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Transaction on Fuzzy System, 23(3), 1–16.CrossRef
go back to reference Xu, Y., McLoughlin, I., Song, Y., & Wu, K. (2015). Improved i-vector representation for speaker diarization. Circuits, Systems, and Signal Processing, 35(9), 3393–3404.MathSciNetCrossRef Xu, Y., McLoughlin, I., Song, Y., & Wu, K. (2015). Improved i-vector representation for speaker diarization. Circuits, Systems, and Signal Processing, 35(9), 3393–3404.MathSciNetCrossRef
Metadata
Title
Speaker diarization system using MKMFCC parameterization and WLI-fuzzy clustering
Authors
V. Subba Ramaiah
R. Rajeswara Rao
Publication date
20-10-2016
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 4/2016
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-016-9384-y

Other articles of this Issue 4/2016

International Journal of Speech Technology 4/2016 Go to the issue