Skip to main content
Top
Published in: International Journal of Speech Technology 4/2015

06-08-2015

i-Vectors in speech processing applications: a survey

Authors: Pulkit Verma, Pradip K. Das

Published in: International Journal of Speech Technology | Issue 4/2015

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In the domain of speech recognition many methods have been proposed over time like Gaussian mixture models (GMM), GMM with universal background model (GMM-UBM framework), joint factor analysis, etc. i-Vector subspace modeling is one of the recent methods that has become the state of the art technique in this domain. This method largely provides the benefit of modeling both the intra-domain and inter-domain variabilities into the same low dimensional space. In this survey, we present a comprehensive collection of research work related to i-vectors since its inception. Some recent trends of using i-vectors in combination with other approaches are also discussed. The application of i-vectors in various fields of speech recognition, viz speaker, language, accent recognition, etc. is also presented. This paper should serve as a good starting point for anyone interested in working with i-vectors for speech processing in general. We then conclude the paper with a brief discussion on the future of i-vectors.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Adami, A., Mihaescu, R., Reynolds, D., & Godfrey. J. (2003). Modeling prosodic dynamics for speaker recognition. 2003 IEEE international conference on, acoustics, speech, and signal processing, 2003, proceedings, (ICASSP ’03). (Vol. 4), pp. IV-788-91. doi:10.1109/ICASSP.2003.1202761. Adami, A., Mihaescu, R., Reynolds, D., & Godfrey. J. (2003). Modeling prosodic dynamics for speaker recognition. 2003 IEEE international conference on, acoustics, speech, and signal processing, 2003, proceedings, (ICASSP ’03). (Vol. 4), pp. IV-788-91. doi:10.​1109/​ICASSP.​2003.​1202761.
go back to reference Alam, M. J., Ouellet, P., Kenny, P., & O’Shaughnessy, D. D. (2011) Comparative evaluation of feature normalization techniques for speaker verification. Advances in nonlinear speech processing—proceedings of 5th international conference on nonlinear speech processing, NOLISP 2011, Las Palmas de Gran Canaria. Retrieved November 7–9, 2011, pp. 246–253. doi:10.1007/978-3-642-25020-0_32. Alam, M. J., Ouellet, P., Kenny, P., & O’Shaughnessy, D. D. (2011) Comparative evaluation of feature normalization techniques for speaker verification. Advances in nonlinear speech processing—proceedings of 5th international conference on nonlinear speech processing, NOLISP 2011, Las Palmas de Gran Canaria. Retrieved November 7–9, 2011, pp. 246–253. doi:10.​1007/​978-3-642-25020-0_​32.
go back to reference Aronowitz, H. (2014). Inter dataset variability compensation for speaker recognition. IEEE international conference on acoustics, speech and signal processing (ICASSP), 2014, pp. 4002–4006. doi:10.1109/ICASSP.2014.6854353. Aronowitz, H. (2014). Inter dataset variability compensation for speaker recognition. IEEE international conference on acoustics, speech and signal processing (ICASSP), 2014, pp. 4002–4006. doi:10.​1109/​ICASSP.​2014.​6854353.
go back to reference Aronowitz, H., & Barkan, O. (2012). Efficient approximated i-vector extraction. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4789–4792. doi:10.1109/ICASSP.2012.6288990. Aronowitz, H., & Barkan, O. (2012). Efficient approximated i-vector extraction. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4789–4792. doi:10.​1109/​ICASSP.​2012.​6288990.
go back to reference Bahari, M., Saeidi, R., Van hamme, H., & Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7344–7348. doi:10.1109/ICASSP.2013.6639089. Bahari, M., Saeidi, R., Van hamme, H., & Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7344–7348. doi:10.​1109/​ICASSP.​2013.​6639089.
go back to reference Behravan, H., Hautamäki, V., & Kinnunen, T. (2015). Factors affecting i-vector based foreign accent recognition: A case study in spoken finnish. Speech Communication, 66, 118–129. doi:10.1016/j.specom.2014.CrossRef Behravan, H., Hautamäki, V., & Kinnunen, T. (2015). Factors affecting i-vector based foreign accent recognition: A case study in spoken finnish. Speech Communication, 66, 118–129. doi:10.​1016/​j.​specom.​2014.CrossRef
go back to reference Biswas, S., Rohdin, J., & Shinoda, K. (2014). i-Vector selection for effective PLDA modeling in speaker recognition. Proceedings Odyssey 2014—The speaker and language recognition workshop. pp. 100–105. Biswas, S., Rohdin, J., & Shinoda, K. (2014). i-Vector selection for effective PLDA modeling in speaker recognition. Proceedings Odyssey 2014—The speaker and language recognition workshop. pp. 100–105.
go back to reference Burget, L., Plchot, O., Cumani, S., Glembek, O., Matejka, P., & Brümmer, N. (2011). Discriminatively trained probabilistic linear discriminant analysis for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011. Prague Congress Center, Prague. pp. 4832–4835, doi10.1109/ICASSP.2011.5947437. Burget, L., Plchot, O., Cumani, S., Glembek, O., Matejka, P., & Brümmer, N. (2011). Discriminatively trained probabilistic linear discriminant analysis for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011. Prague Congress Center, Prague. pp. 4832–4835, doi10.​1109/​ICASSP.​2011.​5947437.
go back to reference Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. doi:10.1007/s10579-008-9076-6.CrossRef Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. doi:10.​1007/​s10579-008-9076-6.CrossRef
go back to reference Chen, L., & Yang, Y. (2011). Applying emotional factor analysis and i-Vector to emotional speaker recognition. In Z. Sun, J. Lai, X. Chen, & T. Tan (Eds.), Biometric recognition, lecture notes in computer science (Vol. 7098, pp. 174–179). Berlin: Springer. doi:10.1007/978-3-642-25449-9-22.CrossRef Chen, L., & Yang, Y. (2011). Applying emotional factor analysis and i-Vector to emotional speaker recognition. In Z. Sun, J. Lai, X. Chen, & T. Tan (Eds.), Biometric recognition, lecture notes in computer science (Vol. 7098, pp. 174–179). Berlin: Springer. doi:10.​1007/​978-3-642-25449-9-22.CrossRef
go back to reference Chen, L., & Yang, Y. (2013). Emotional speaker recognition based on i-vector through atom aligned sparse representation. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7760–7764. doi:10.1109/ICASSP.2013.6639174. Chen, L., & Yang, Y. (2013). Emotional speaker recognition based on i-vector through atom aligned sparse representation. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7760–7764. doi:10.​1109/​ICASSP.​2013.​6639174.
go back to reference Chen, N., Shen, W., & Campbell, J. (2010). A linguistically-informative approach to dialect recognition using dialect-discriminating context-dependent phonetic models. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 5014–5017. doi10.1109/ICASSP.2010.5495068. Chen, N., Shen, W., & Campbell, J. (2010). A linguistically-informative approach to dialect recognition using dialect-discriminating context-dependent phonetic models. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 5014–5017. doi10.​1109/​ICASSP.​2010.​5495068.
go back to reference Cheng, Y. C., Hautamaki, V., Huang, Z., Li, K., & Lee, C. H. (2014). An i-vector based descriptor for alphabetical gesture recognition. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6593–6597. doi10.1109/ICASSP.2014.6854875. Cheng, Y. C., Hautamaki, V., Huang, Z., Li, K., & Lee, C. H. (2014). An i-vector based descriptor for alphabetical gesture recognition. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6593–6597. doi10.​1109/​ICASSP.​2014.​6854875.
go back to reference Cumani, S., Glembek, O., Brümmer, N., de Villiers, E., & Laface, P. (2012). Gender independent discriminative speaker recognition in i-vector space. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4361–4364. doi:10.1109/ICASSP.2012.6288885. Cumani, S., Glembek, O., Brümmer, N., de Villiers, E., & Laface, P. (2012). Gender independent discriminative speaker recognition in i-vector space. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4361–4364. doi:10.​1109/​ICASSP.​2012.​6288885.
go back to reference Dehak, N. (2009). Discriminative and generative approaches for long- and short-term speaker characteristics modeling: Application to speaker verification. PhD thesis, Ecole de Technologie Superieure (Canada), aAINR50490. Dehak, N. (2009). Discriminative and generative approaches for long- and short-term speaker characteristics modeling: Application to speaker verification. PhD thesis, Ecole de Technologie Superieure (Canada), aAINR50490.
go back to reference Dehak, N., & Shum, S. (2011). Low-dimensional speech representation based on factor analysis and its applications. Johns Hopkins CLSP Lecture. Dehak, N., & Shum, S. (2011). Low-dimensional speech representation based on factor analysis and its applications. Johns Hopkins CLSP Lecture.
go back to reference Dehak, N., Dumouchel, P., & Kenny, P. (2007). Modeling prosodic features with joint factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2095–2103. doi:10.1109/TASL.2007.902758.CrossRef Dehak, N., Dumouchel, P., & Kenny, P. (2007). Modeling prosodic features with joint factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2095–2103. doi:10.​1109/​TASL.​2007.​902758.CrossRef
go back to reference Dehak, N., Dehak, R., Glass, J. R., Reynolds, D. A., & Kenny, P. (2010). Cosine similarity scoring without score normalization techniques. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 15–19. Dehak, N., Dehak, R., Glass, J. R., Reynolds, D. A., & Kenny, P. (2010). Cosine similarity scoring without score normalization techniques. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 15–19.
go back to reference Dehak, N., Karam, Z. N., Reynolds, D. A., Dehak, R., Campbell, W. M., & Glass, J. R. (2011a). A channel-blind system for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011, Prague Congress Center, Prague. pp. 4536–4539. doi:10.1109/ICASSP.2011.5947363. Dehak, N., Karam, Z. N., Reynolds, D. A., Dehak, R., Campbell, W. M., & Glass, J. R. (2011a). A channel-blind system for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011, Prague Congress Center, Prague. pp. 4536–4539. doi:10.​1109/​ICASSP.​2011.​5947363.
go back to reference Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011b). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798. doi:10.1109/TASL.2010.2064307.CrossRef Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011b). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798. doi:10.​1109/​TASL.​2010.​2064307.CrossRef
go back to reference DeMarco, A., & Cox, S. J. (2012). Iterative classification of regional British accents in i-vector space. 2012 Symposium on machine learning in speech and language processing, MLSLP 2012, Portland. Retrieved September 14, 2012, pp. 1–4. DeMarco, A., & Cox, S. J. (2012). Iterative classification of regional British accents in i-vector space. 2012 Symposium on machine learning in speech and language processing, MLSLP 2012, Portland. Retrieved September 14, 2012, pp. 1–4.
go back to reference Ferrer, L., Scheffer, N., & Shriberg, E. (2010). A comparison of approaches for modeling prosodic features in speaker recognition. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP). pp. 4414–4417. doi:10.1109/ICASSP.2010.5495632. Ferrer, L., Scheffer, N., & Shriberg, E. (2010). A comparison of approaches for modeling prosodic features in speaker recognition. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP). pp. 4414–4417. doi:10.​1109/​ICASSP.​2010.​5495632.
go back to reference Foil, J. (1986). Language identification using noisy speech. IEEE international conference on ICASSP ’86. acoustics, speech, and signal processing (Vol. 11) Foil, J. (1986). Language identification using noisy speech. IEEE international conference on ICASSP ’86. acoustics, speech, and signal processing (Vol. 11)
go back to reference Garcia-Romero, D., Zhou, X., & Espy-Wilson, C. Y. (2012). Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4257–4260. doi:10.1109/ICASSP.2012.6288859. Garcia-Romero, D., Zhou, X., & Espy-Wilson, C. Y. (2012). Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4257–4260. doi:10.​1109/​ICASSP.​2012.​6288859.
go back to reference Ghahabi, O., & Hernando, J. (2014a). Deep belief networks for i-vector based speaker recognition. IEEE International conference on acoustics, speech and signal processing, ICASSP 2014, Florence. Retrieved May 4–9, 2014. pp. 1700–1704. doi:10.1109/ICASSP.2014.6853888. Ghahabi, O., & Hernando, J. (2014a). Deep belief networks for i-vector based speaker recognition. IEEE International conference on acoustics, speech and signal processing, ICASSP 2014, Florence. Retrieved May 4–9, 2014. pp. 1700–1704. doi:10.​1109/​ICASSP.​2014.​6853888.
go back to reference Ghahabi, O., & Hernando, J. (2014b). Global impostor selection for dbns in multi-session i-vector speaker recognition. Proceedings of advances in speech and language technologies for Iberian languages—Second international conference, IberSPEECH 2014, Las Palmas de Gran Canaria. Retrieved November 19–21, 2014. pp. 89–98, doi:10.1007/978-3-319-13623-3. Ghahabi, O., & Hernando, J. (2014b). Global impostor selection for dbns in multi-session i-vector speaker recognition. Proceedings of advances in speech and language technologies for Iberian languages—Second international conference, IberSPEECH 2014, Las Palmas de Gran Canaria. Retrieved November 19–21, 2014. pp. 89–98, doi:10.​1007/​978-3-319-13623-3.
go back to reference Glembek, O., Burget, L., Dehak, N., Brummer, N., & Kenny, P. (2009). Comparison of scoring methods used in speaker recognition with joint factor analysis. IEEE International conference on acoustics, speech and signal processing, ICASSP 2009. pp. 4057–4060. doi:10.1109/ICASSP.2009.4960519. Glembek, O., Burget, L., Dehak, N., Brummer, N., & Kenny, P. (2009). Comparison of scoring methods used in speaker recognition with joint factor analysis. IEEE International conference on acoustics, speech and signal processing, ICASSP 2009. pp. 4057–4060. doi:10.​1109/​ICASSP.​2009.​4960519.
go back to reference Glembek, O., Burget, L., Matejka, P., Karafiat, M., & Kenny, P. (2011). Simplification and optimization of i-vector extraction. IEEE International conference on acoustics, speech and signal processing (ICASSP), 2011. pp. 4516–4519. doi:10.1109/ICASSP.2011.5947358. Glembek, O., Burget, L., Matejka, P., Karafiat, M., & Kenny, P. (2011). Simplification and optimization of i-vector extraction. IEEE International conference on acoustics, speech and signal processing (ICASSP), 2011. pp. 4516–4519. doi:10.​1109/​ICASSP.​2011.​5947358.
go back to reference Glembek, O., Ma, J., Matejka, P., Zhang, B., Plchot, O., Burget, L., & Matsoukas, S. (2014). Domain adaptation via within-class covariance correction in i-vector based speaker recognition systems. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 4032–4036. doi10.1109/ICASSP.2014.6854359. Glembek, O., Ma, J., Matejka, P., Zhang, B., Plchot, O., Burget, L., & Matsoukas, S. (2014). Domain adaptation via within-class covariance correction in i-vector based speaker recognition systems. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 4032–4036. doi10.​1109/​ICASSP.​2014.​6854359.
go back to reference Gupta, V., Kenny, P., Ouellet, P., & Stafylakis, T. (2014). i-Vector-based speaker adaptation of deep neural networks for French broadcast audio transcription. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6334–6338. doi:10.1109/ICASSP.2014.6854823. Gupta, V., Kenny, P., Ouellet, P., & Stafylakis, T. (2014). i-Vector-based speaker adaptation of deep neural networks for French broadcast audio transcription. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6334–6338. doi:10.​1109/​ICASSP.​2014.​6854823.
go back to reference Hasan, T., Saeidi, R., Hansen, J., & van Leeuwen, D. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 7663–7667. doi:10.1109/ICASSP.2013.6639154. Hasan, T., Saeidi, R., Hansen, J., & van Leeuwen, D. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 7663–7667. doi:10.​1109/​ICASSP.​2013.​6639154.
go back to reference Huggins-Daines, D., Kumar, M., Chan, A., Black, A., Ravishankar, M., & Rudnicky, A. (2006), Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. 2006 IEEE international conference on acoustics, speech and signal processing, 2006, ICASSP 2006 proceedings (Vol. 1), pp. I-I. doi:10.1109/ICASSP.2006.1659988. Huggins-Daines, D., Kumar, M., Chan, A., Black, A., Ravishankar, M., & Rudnicky, A. (2006), Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. 2006 IEEE international conference on acoustics, speech and signal processing, 2006, ICASSP 2006 proceedings (Vol. 1), pp. I-I. doi:10.​1109/​ICASSP.​2006.​1659988.
go back to reference Jancik, Z., Plchot, O., Brummer, N., Burget, L., Glembek, O., Hubeika, V., et al. (2010). Data selection and calibration issues in automatic language recognition—investigation with BUT-AGNITIO NIST LRE 2009 system. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 215–221. Jancik, Z., Plchot, O., Brummer, N., Burget, L., Glembek, O., Hubeika, V., et al. (2010). Data selection and calibration issues in automatic language recognition—investigation with BUT-AGNITIO NIST LRE 2009 system. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 215–221.
go back to reference Kanagasundaram, A., Dean, D., Vogt, R., McLaren, M., Subramanian, S.,& Mason, M. (2012a). Weighted LDA techniques for i-vector based speaker verification. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4781–4784. doi:10.1109/ICASSP.2012.6288988. Kanagasundaram, A., Dean, D., Vogt, R., McLaren, M., Subramanian, S.,& Mason, M. (2012a). Weighted LDA techniques for i-vector based speaker verification. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4781–4784. doi:10.​1109/​ICASSP.​2012.​6288988.
go back to reference Kanagasundaram, A., Vogt, R., Dean, D., & Sridharan, S. (2012b). PLDA based speaker recognition on short utterances. Odyssey 2012: The speaker and language recognition workshop, Singapore. Retrieved June 25–28, 2012. pp 28–33. Kanagasundaram, A., Vogt, R., Dean, D., & Sridharan, S. (2012b). PLDA based speaker recognition on short utterances. Odyssey 2012: The speaker and language recognition workshop, Singapore. Retrieved June 25–28, 2012. pp 28–33.
go back to reference Kanagasundaram, A., Dean, D., Gonzalez-Dominguez, J., Sridharan, S., Ramos, D., & Gonzalez-Rodriguez, J. (2013). Improving short utterance based i-vector speaker recognition using source and utterance-duration normalization techniques. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, Retrieved August 25–29, 2013. pp. 2465–2469. http://www.isca-speech.org/archive/interspeech_2013/i13_2465.html. Kanagasundaram, A., Dean, D., Gonzalez-Dominguez, J., Sridharan, S., Ramos, D., & Gonzalez-Rodriguez, J. (2013). Improving short utterance based i-vector speaker recognition using source and utterance-duration normalization techniques. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, Retrieved August 25–29, 2013. pp. 2465–2469. http://​www.​isca-speech.​org/​archive/​interspeech_​2013/​i13_​2465.​html.
go back to reference Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.1016/j.specom.2014.01.004.CrossRef Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.​1016/​j.​specom.​2014.​01.​004.CrossRef
go back to reference Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., González-Rodríguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.1016/j.specom.2014.01.004.CrossRef Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., González-Rodríguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.​1016/​j.​specom.​2014.​01.​004.CrossRef
go back to reference Karafiát, M., Burget, L., Matejka, P., Glembek, O., & Cernocký, J. (2011). iVector-based discriminative adaptation for automatic speech recognition. 2011 IEEE workshop on automatic speech recognition & understanding, ASRU 2011, Waikoloa. Retrieved December 11–15, 2011. pp. 152–157. doi:10.1109/ASRU.2011.6163922. Karafiát, M., Burget, L., Matejka, P., Glembek, O., & Cernocký, J. (2011). iVector-based discriminative adaptation for automatic speech recognition. 2011 IEEE workshop on automatic speech recognition & understanding, ASRU 2011, Waikoloa. Retrieved December 11–15, 2011. pp. 152–157. doi:10.​1109/​ASRU.​2011.​6163922.
go back to reference Kenny, P. (2005). Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal, (Report) CRIM-06/08-13. Kenny, P. (2005). Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal, (Report) CRIM-06/08-13.
go back to reference Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007a). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447. doi:10.1109/TASL.2006.881693.CrossRef Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007a). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447. doi:10.​1109/​TASL.​2006.​881693.CrossRef
go back to reference Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007b). Speaker and session variability in GMM-based speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1448–1460. doi:10.1109/TASL.2007.894527.CrossRef Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007b). Speaker and session variability in GMM-based speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1448–1460. doi:10.​1109/​TASL.​2007.​894527.CrossRef
go back to reference Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 16(5), 980–988. doi:10.1109/TASL.2008.925147.CrossRef Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 16(5), 980–988. doi:10.​1109/​TASL.​2008.​925147.CrossRef
go back to reference Kenny, P., Stafylakis, T., Ouellet, P., Alam, M., & Dumouchel, P. (2013). PLDA for speaker verification with utterances of arbitrary duration. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7649–7653. doi:10.1109/ICASSP.2013.6639151. Kenny, P., Stafylakis, T., Ouellet, P., Alam, M., & Dumouchel, P. (2013). PLDA for speaker verification with utterances of arbitrary duration. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7649–7653. doi:10.​1109/​ICASSP.​2013.​6639151.
go back to reference Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., et al. (2003). The CMU SPHINX-4 speech recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP 2003). Hong Kong, 1, 2–5. Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., et al. (2003). The CMU SPHINX-4 speech recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP 2003). Hong Kong, 1, 2–5.
go back to reference Larcher, A., Bousquet, P., Lee, K. A., Matrouf, D., Li, H., & Bonastre, J. F. (2012) i-Vectors in the context of phonetically-constrained short utterances for speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4773–4776. doi:10.1109/ICASSP.2012.6288986 Larcher, A., Bousquet, P., Lee, K. A., Matrouf, D., Li, H., & Bonastre, J. F. (2012) i-Vectors in the context of phonetically-constrained short utterances for speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4773–4776. doi:10.​1109/​ICASSP.​2012.​6288986
go back to reference Lei, Y., Burget, L., Ferrer, L., Graciarena, M., & Scheffer, N. (2012a). Towards noise-robust speaker recognition using probabilistic linear discriminant analysis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4253–4256. 10.1109/ICASSP.2012.6288858 Lei, Y., Burget, L., Ferrer, L., Graciarena, M., & Scheffer, N. (2012a). Towards noise-robust speaker recognition using probabilistic linear discriminant analysis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4253–4256. 10.​1109/​ICASSP.​2012.​6288858
go back to reference Lei, Y., Burget, L., & Scheffer, N. (2013). A noise robust i-vector extractor using vector taylor series for speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6788–6791. doi:10.1109/ICASSP.2013.6638976 Lei, Y., Burget, L., & Scheffer, N. (2013). A noise robust i-vector extractor using vector taylor series for speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6788–6791. doi:10.​1109/​ICASSP.​2013.​6638976
go back to reference Lei, Y., McLaren, M., Ferrer, L., & Scheffer, N. (2014a). Simplified VTS-based i-Vector extraction in noise-robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4037–4041. doi:10.1109/ICASSP.2014.6854360. Lei, Y., McLaren, M., Ferrer, L., & Scheffer, N. (2014a). Simplified VTS-based i-Vector extraction in noise-robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4037–4041. doi:10.​1109/​ICASSP.​2014.​6854360.
go back to reference Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014b). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 1695–1699. doi:10.1109/ICASSP.2014.6853887 Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014b). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 1695–1699. doi:10.​1109/​ICASSP.​2014.​6853887
go back to reference Mandasari, M., McLaren, M., & van Leeuwen, D. (2012). The effect of noise on modern automatic speaker recognition systems. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4249–4252. doi:10.1109/ICASSP.2012.6288857 Mandasari, M., McLaren, M., & van Leeuwen, D. (2012). The effect of noise on modern automatic speaker recognition systems. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4249–4252. doi:10.​1109/​ICASSP.​2012.​6288857
go back to reference Martinez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based prosodic system for language identification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4861–4864. doi:10.1109/ICASSP.2012.6289008 Martinez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based prosodic system for language identification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4861–4864. doi:10.​1109/​ICASSP.​2012.​6289008
go back to reference Martinez, D., Lleida, E., Ortega, A., & Miguel, A. (2013). Prosodic features and formant modeling for an ivector-based language recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6847–6851. doi:10.1109/ICASSP.2013.6638988 Martinez, D., Lleida, E., Ortega, A., & Miguel, A. (2013). Prosodic features and formant modeling for an ivector-based language recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6847–6851. doi:10.​1109/​ICASSP.​2013.​6638988
go back to reference Martinez, D., Burget, L., Stafylakis, T., Lei, Y., Kenny, P., & Lleida, E. (2014). Unscented transform for ivector-based noisy speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4042–4046. doi:10.1109/ICASSP.2014.6854361 Martinez, D., Burget, L., Stafylakis, T., Lei, Y., Kenny, P., & Lleida, E. (2014). Unscented transform for ivector-based noisy speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4042–4046. doi:10.​1109/​ICASSP.​2014.​6854361
go back to reference Matejka, P., Glembek, O., Castaldo, F., Alam, M., Plchot, O., Kenny, P., et al. (2011). Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4828–4831. doi:10.1109/ICASSP.2011.5947436 Matejka, P., Glembek, O., Castaldo, F., Alam, M., Plchot, O., Kenny, P., et al. (2011). Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4828–4831. doi:10.​1109/​ICASSP.​2011.​5947436
go back to reference McLaren, M., & van Leeuwen, D. (2011a). Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5456–5459, DOI 10.1109/ICASSP.2011.5947593. McLaren, M., & van Leeuwen, D. (2011a). Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5456–5459, DOI 10.1109/ICASSP.2011.5947593.
go back to reference McLaren, M., & van Leeuwen, D. (2012a). Gender-independent speaker recognition using source normalisation. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4373–4376. doi:10.1109/ICASSP.2012.6288888 McLaren, M., & van Leeuwen, D. (2012a). Gender-independent speaker recognition using source normalisation. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4373–4376. doi:10.​1109/​ICASSP.​2012.​6288888
go back to reference McLaren, M., & van Leeuwen, D. (2012b). Source-normalized LDA for Robust speaker recognition using i-vectors from multiple speech sources. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 755–766. doi:10.1109/TASL.2011.2164533.CrossRef McLaren, M., & van Leeuwen, D. (2012b). Source-normalized LDA for Robust speaker recognition using i-vectors from multiple speech sources. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 755–766. doi:10.​1109/​TASL.​2011.​2164533.CrossRef
go back to reference Meignier, S., & Merlin, T. (2010). LIUM SpkDiarization: An open source toolkit for diarization. In: CMU SPUD workshop (Vol. 2010) Meignier, S., & Merlin, T. (2010). LIUM SpkDiarization: An open source toolkit for diarization. In: CMU SPUD workshop (Vol. 2010)
go back to reference Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, iEEE Catalog No.: CFP11SRW-USB. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, iEEE Catalog No.: CFP11SRW-USB.
go back to reference Reynolds, D. A. (1992). A Gaussian mixture modeling approach to text-independent speaker identification. Ph.D. dissertation, Georgia Institute of Technology. Reynolds, D. A. (1992). A Gaussian mixture modeling approach to text-independent speaker identification. Ph.D. dissertation, Georgia Institute of Technology.
go back to reference Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 225–229. doi:10.1109/ICASSP.2014.6853591 Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 225–229. doi:10.​1109/​ICASSP.​2014.​6853591
go back to reference Senoussaoui, M., Kenny, P., Dehak, N., & Dumouchel, P. (2010). An i-vector extractor suitable for speaker recognition with both microphone and telephone speech. In: Odyssey 2010: the speaker and language recognition workshop, Brno, June 28–July 1, 2010, p. 6 Senoussaoui, M., Kenny, P., Dehak, N., & Dumouchel, P. (2010). An i-vector extractor suitable for speaker recognition with both microphone and telephone speech. In: Odyssey 2010: the speaker and language recognition workshop, Brno, June 28–July 1, 2010, p. 6
go back to reference Silovsky, J., & Prazak, J. (2012). Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4193–4196. doi:10.1109/ICASSP.2012.6288843 Silovsky, J., & Prazak, J. (2012). Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4193–4196. doi:10.​1109/​ICASSP.​2012.​6288843
go back to reference Sizov, A., el Khoury, E., Kinnunen, T., Wu, Z., & Marcel, S. (2015). Joint speaker verification and antispoofing in the i-vector space. IEEE Transactions on Information Forensics and Security, 10(4), 821–832. doi:10.1109/TIFS.2015.2407362.CrossRef Sizov, A., el Khoury, E., Kinnunen, T., Wu, Z., & Marcel, S. (2015). Joint speaker verification and antispoofing in the i-vector space. IEEE Transactions on Information Forensics and Security, 10(4), 821–832. doi:10.​1109/​TIFS.​2015.​2407362.CrossRef
go back to reference Variani, E., Lei, X., McDermott, E., Lopez Moreno, I., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp 4052–4056. doi:10.1109/ICASSP.2014.6854363 Variani, E., Lei, X., McDermott, E., Lopez Moreno, I., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp 4052–4056. doi:10.​1109/​ICASSP.​2014.​6854363
go back to reference Villalba, J., & Lleida, E. (2013). Handling i-vectors from different recording conditions using multi-channel simplified PLDA in speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6763–6767. doi:10.1109/ICASSP.2013.6638971 Villalba, J., & Lleida, E. (2013). Handling i-vectors from different recording conditions using multi-channel simplified PLDA in speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6763–6767. doi:10.​1109/​ICASSP.​2013.​6638971
go back to reference Wu, T., Yang, Y., Wu, Z., & Li, D. (2006). MASC: A speech corpus in mandarin for emotion analysis and affective speaker recognition. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp 1–5. doi:10.1109/ODYSSEY.2006.248084 Wu, T., Yang, Y., Wu, Z., & Li, D. (2006). MASC: A speech corpus in mandarin for emotion analysis and affective speaker recognition. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp 1–5. doi:10.​1109/​ODYSSEY.​2006.​248084
go back to reference Yin, S. C., Kenny, P., & Rose, R. (2006). Experiments in speaker adaptation for factor analysis based speaker verification. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp. 1–6. doi:10.1109/ODYSSEY.2006.248130 Yin, S. C., Kenny, P., & Rose, R. (2006). Experiments in speaker adaptation for factor analysis based speaker verification. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp. 1–6. doi:10.​1109/​ODYSSEY.​2006.​248130
go back to reference Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X. A., et al. (2006) The HTK book (for HTK version 3.4). Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X. A., et al. (2006) The HTK book (for HTK version 3.4).
go back to reference Yu, C., Liu, G., Hahm, S., & Hansen, J. (2014). Uncertainty propagation in front end factor analysis for noise robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4017–4021. doi:10.1109/ICASSP.2014.6854356 Yu, C., Liu, G., Hahm, S., & Hansen, J. (2014). Uncertainty propagation in front end factor analysis for noise robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4017–4021. doi:10.​1109/​ICASSP.​2014.​6854356
go back to reference Zheng, R., Zhang, C., Zhang, S., & Xu, B. (2014). Variational bayes based i-vector for speaker diarization of telephone conversations. In: IEEE international conference on acoustics, speech and signal processing, ICASSP, Florence. May 4–9, 2014, pp. 91–95. doi:10.1109/ICASSP.2014.6853564 Zheng, R., Zhang, C., Zhang, S., & Xu, B. (2014). Variational bayes based i-vector for speaker diarization of telephone conversations. In: IEEE international conference on acoustics, speech and signal processing, ICASSP, Florence. May 4–9, 2014, pp. 91–95. doi:10.​1109/​ICASSP.​2014.​6853564
Metadata
Title
i-Vectors in speech processing applications: a survey
Authors
Pulkit Verma
Pradip K. Das
Publication date
06-08-2015
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 4/2015
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-015-9295-3

Other articles of this Issue 4/2015

International Journal of Speech Technology 4/2015 Go to the issue