Skip to main content
Top

2016 | OriginalPaper | Chapter

Bottleneck Based Front-End for Diarization Systems

Authors : Ignacio Viñals, Jesús Villalba, Alfonso Ortega, Antonio Miguel, Eduardo Lleida

Published in: Advances in Speech and Language Technologies for Iberian Languages

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The goal of this paper is to study the inclusion of deep learning into the diarization task. We propose some novel approaches at the feature extraction stage, substituting the classical usage of short-term features, such as MFCCs and PLPs, by Deep Learning based ones. These new features come from the hidden states at bottleneck layers in neural networks. Trained for ASR tasks.
These new features will be included in the University of Zaragoza ViVoLAB speaker diarization system, designed for the Multi-Genre Broadcast (MGB) challenge of the 2015 ASRU Workshop. This system, designed following the i-vector paradigm, uses the input features to segment the input audio and construct one i-vector per segment. These i-vectors will be clustered into speakers according to generative PLDA models.
The evaluation for our new approach will be carried out with broadcast audio from the 2015 MGB Challenge.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Miro, X.A., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)CrossRef Miro, X.A., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)CrossRef
2.
go back to reference Tranter, S.E., Reynolds, D.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)CrossRef Tranter, S.E., Reynolds, D.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)CrossRef
3.
go back to reference Chen, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, vol. 6, pp. 127–132 (1998) Chen, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, vol. 6, pp. 127–132 (1998)
4.
go back to reference Reynolds, D., Torres-Carrasquillo, P.: Approaches and applications of audio diarization. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. V, pp. 953–956 (2005) Reynolds, D., Torres-Carrasquillo, P.: Approaches and applications of audio diarization. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. V, pp. 953–956 (2005)
5.
go back to reference Kenny, P.: Joint factor analysis of speaker and session variability: theory and algorithms. (Report) CRIM-06/08-13, CRIM, Montreal, pp. 1–17 (2005) Kenny, P.: Joint factor analysis of speaker and session variability: theory and algorithms. (Report) CRIM-06/08-13, CRIM, Montreal, pp. 1–17 (2005)
6.
go back to reference Vaquero, C., Ortega, A., Miguel, A., Lleida, E.: Quality assessment of speaker diarization for speaker characterization. IEEE Trans. Acoust. Speech Lang. Process. 21(4), 816–827 (2013)CrossRef Vaquero, C., Ortega, A., Miguel, A., Lleida, E.: Quality assessment of speaker diarization for speaker characterization. IEEE Trans. Acoust. Speech Lang. Process. 21(4), 816–827 (2013)CrossRef
7.
go back to reference Reynolds, D., Kenny, P., Castaldo, F.: A study of new approaches to speaker diarization. In: Interspeech, pp. 1047–1050 (2009) Reynolds, D., Kenny, P., Castaldo, F.: A study of new approaches to speaker diarization. In: Interspeech, pp. 1047–1050 (2009)
8.
go back to reference Hinton, G., Deng, L., Dong, Y., Dahl, G., Mohamed, A., Jaitly, N., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)CrossRef Hinton, G., Deng, L., Dong, Y., Dahl, G., Mohamed, A., Jaitly, N., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)CrossRef
9.
go back to reference Ghalehjegh, S.H., Rose, R.: Deep bottleneck features for I-vector based text-independent speaker verification. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 555–560 (2015) Ghalehjegh, S.H., Rose, R.: Deep bottleneck features for I-vector based text-independent speaker verification. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 555–560 (2015)
10.
go back to reference Richardson, F., Reynolds, D., Dehak, N.: A unified deep neural network for speaker and language recognition. In: Interspeech, pp. 1146–1150 (2015) Richardson, F., Reynolds, D., Dehak, N.: A unified deep neural network for speaker and language recognition. In: Interspeech, pp. 1146–1150 (2015)
11.
go back to reference Lei, Y., Scheffer, N., Ferrer, L., McLaren, M.: A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1714–1718 (2014) Lei, Y., Scheffer, N., Ferrer, L., McLaren, M.: A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1714–1718 (2014)
12.
go back to reference Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRef Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRef
13.
go back to reference Bell, P., Gales, M.J.F., Thomas Hain, J., Kilgour, P Lanchantin Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., Woodland, P.C.: The MGB challenge: evaluating multi-genre broadcast media recognition. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015S, Scottsdale, Arizona, USA, December 2015, vol. 1, no. 1. IEEE (2015) Bell, P., Gales, M.J.F., Thomas Hain, J., Kilgour, P Lanchantin Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., Woodland, P.C.: The MGB challenge: evaluating multi-genre broadcast media recognition. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015S, Scottsdale, Arizona, USA, December 2015, vol. 1, no. 1. IEEE (2015)
14.
go back to reference Villalba, J., Ortega, A., Miguel, A., Lleida, E.: Variational Bayesian PLDA for speaker diarization in the MGB challenge. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 667–674 (2015) Villalba, J., Ortega, A., Miguel, A., Lleida, E.: Variational Bayesian PLDA for speaker diarization in the MGB challenge. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 667–674 (2015)
15.
go back to reference Garcia-Romero, D., Espy-Wilson, C.Y.: Analysis of i-vector length normalization in speaker recognition systems. In: Proceedings of Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 249–252 (2011) Garcia-Romero, D., Espy-Wilson, C.Y.: Analysis of i-vector length normalization in speaker recognition systems. In: Proceedings of Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 249–252 (2011)
16.
go back to reference Villalba, J., Lleida, E.: Unsupervised adaptation of PLDA by using variational Bayes methods. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 744–748 (2014) Villalba, J., Lleida, E.: Unsupervised adaptation of PLDA by using variational Bayes methods. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 744–748 (2014)
17.
go back to reference Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., Yu, K.: Deep feature for text-dependent speaker verification. Speech Commun. 73, 1–13 (2015)CrossRef Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., Yu, K.: Deep feature for text-dependent speaker verification. Speech Commun. 73, 1–13 (2015)CrossRef
18.
go back to reference Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011) Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)
19.
go back to reference ETSI. ETSI ES 202 050 Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Advanced Front-end Feature Extraction Algorithm; Compression (2002) ETSI. ETSI ES 202 050 Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Advanced Front-end Feature Extraction Algorithm; Compression (2002)
Metadata
Title
Bottleneck Based Front-End for Diarization Systems
Authors
Ignacio Viñals
Jesús Villalba
Alfonso Ortega
Antonio Miguel
Eduardo Lleida
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-49169-1_27

Premium Partner