Top

International Journal of Speech Technology

Published in:

06-08-2015

i-Vectors in speech processing applications: a survey

Authors: Pulkit Verma, Pradip K. Das

Published in: International Journal of Speech Technology | Issue 4/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In the domain of speech recognition many methods have been proposed over time like Gaussian mixture models (GMM), GMM with universal background model (GMM-UBM framework), joint factor analysis, etc. i-Vector subspace modeling is one of the recent methods that has become the state of the art technique in this domain. This method largely provides the benefit of modeling both the intra-domain and inter-domain variabilities into the same low dimensional space. In this survey, we present a comprehensive collection of research work related to i-vectors since its inception. Some recent trends of using i-vectors in combination with other approaches are also discussed. The application of i-vectors in various fields of speech recognition, viz speaker, language, accent recognition, etc. is also presented. This paper should serve as a good starting point for anyone interested in working with i-vectors for speech processing in general. We then conclude the paper with a brief discussion on the future of i-vectors.

previous article Performance evaluation of a ACF-AMDF based pitch detection scheme in real-time

next article Ideal binary masking for reducing convolutive noise

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

https://catalog.ldc.upenn.edu/LDC93S1.

https://catalog.ldc.upenn.edu/LDC97S62.

http://kaldi.sourceforge.net.

http://alize.univ-avignon.fr.

http://research.microsoft.com/en-us/downloads/2476c44a-1f63-4fe0-b805-8c2de395bb2c/.

http://www-lium.univ-lemans.fr/diarization/.

https://ivectorchallenge.nist.gov.

Adami, A., Mihaescu, R., Reynolds, D., & Godfrey. J. (2003). Modeling prosodic dynamics for speaker recognition. 2003 IEEE international conference on, acoustics, speech, and signal processing, 2003, proceedings, (ICASSP ’03). (Vol. 4), pp. IV-788-91. doi:10.1109/ICASSP.2003.1202761.

Adami, A. G. (2007). Modeling prosodic differences for speaker recognition. Speech Communications, 49(4), 277–291. doi:10.1016/j.specom.2007.02.005.CrossRef

Alam, M. J., Ouellet, P., Kenny, P., & O’Shaughnessy, D. D. (2011) Comparative evaluation of feature normalization techniques for speaker verification. Advances in nonlinear speech processing—proceedings of 5th international conference on nonlinear speech processing, NOLISP 2011, Las Palmas de Gran Canaria. Retrieved November 7–9, 2011, pp. 246–253. doi:10.1007/978-3-642-25020-0_32.

Aronowitz, H. (2014). Inter dataset variability compensation for speaker recognition. IEEE international conference on acoustics, speech and signal processing (ICASSP), 2014, pp. 4002–4006. doi:10.1109/ICASSP.2014.6854353.

Aronowitz, H., & Barkan, O. (2012). Efficient approximated i-vector extraction. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4789–4792. doi:10.1109/ICASSP.2012.6288990.

Aronowitz, H., & Rendel, A. (2014). Domain adaptation for text dependent speaker verification. INTERSPEECH 2014, 15th annual conference of the international speech communication Association, Singapore. Retrieved September 14–18, 2014, pp. 1337–1341. http://www.isca-speech.org/archive/interspeech_2014/i14_1337.html.

Bahari, M., Saeidi, R., Van hamme, H., & Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7344–7348. doi:10.1109/ICASSP.2013.6639089.

Bahari, M. H., McLaren, M., Hamme, H. V., & van Leeuwen, D. A. (2014). Speaker age estimation using i-vectors. Engineering Applications of AI, 34, 99–108. doi:10.1016/j.engappai.2014.05.003.

Behravan, H., Hautamäki, V., & Kinnunen, T. (2013). Foreign accent detection from spoken Finnish using i-vectors. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013, pp. 79–83. http://www.isca-speech.org/archive/interspeech_2013/i13_0079.html.

Behravan, H., Hautamäki, V., & Kinnunen, T. (2015). Factors affecting i-vector based foreign accent recognition: A case study in spoken finnish. Speech Communication, 66, 118–129. doi:10.1016/j.specom.2014.CrossRef

Biswas, S., Rohdin, J., & Shinoda, K. (2014). i-Vector selection for effective PLDA modeling in speaker recognition. Proceedings Odyssey 2014—The speaker and language recognition workshop. pp. 100–105.

Bousquet, P., Matrouf, D., & Bonastre, J. (2011). Intersession compensation and scoring methods in the i-vectors space for speaker recognition. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 485–488. http://www.isca-speech.org/archive/interspeech_2011/i11_0485.html.

Brümmer, N., Strasheim, A., Hubeika, V., Matejka, P., Burget, L., & Glembek, O. (2009). Discriminative acoustic language recognition via channel-compensated GMM statistics. INTERSPEECH 2009, 10th annual conference of the international speech communication association, Brighton. Retrieved September 6–10, 2009. pp. 2187–2190. http://www.isca-speech.org/archive/interspeech_2009/i09_2187.html.

Burget, L., Plchot, O., Cumani, S., Glembek, O., Matejka, P., & Brümmer, N. (2011). Discriminatively trained probabilistic linear discriminant analysis for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011. Prague Congress Center, Prague. pp. 4832–4835, doi10.1109/ICASSP.2011.5947437.

Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. doi:10.1007/s10579-008-9076-6.CrossRef

Chen, L., & Yang, Y. (2011). Applying emotional factor analysis and i-Vector to emotional speaker recognition. In Z. Sun, J. Lai, X. Chen, & T. Tan (Eds.), Biometric recognition, lecture notes in computer science (Vol. 7098, pp. 174–179). Berlin: Springer. doi:10.1007/978-3-642-25449-9-22.CrossRef

Chen, L., & Yang, Y. (2013). Emotional speaker recognition based on i-vector through atom aligned sparse representation. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7760–7764. doi:10.1109/ICASSP.2013.6639174.

Chen, N., Shen, W., & Campbell, J. (2010). A linguistically-informative approach to dialect recognition using dialect-discriminating context-dependent phonetic models. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 5014–5017. doi10.1109/ICASSP.2010.5495068.

Cheng, Y. C., Hautamaki, V., Huang, Z., Li, K., & Lee, C. H. (2014). An i-vector based descriptor for alphabetical gesture recognition. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6593–6597. doi10.1109/ICASSP.2014.6854875.

Cumani, S., Glembek, O., Brümmer, N., de Villiers, E., & Laface, P. (2012). Gender independent discriminative speaker recognition in i-vector space. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4361–4364. doi:10.1109/ICASSP.2012.6288885.

Dehak, N. (2009). Discriminative and generative approaches for long- and short-term speaker characteristics modeling: Application to speaker verification. PhD thesis, Ecole de Technologie Superieure (Canada), aAINR50490.

Dehak, N., & Shum, S. (2011). Low-dimensional speech representation based on factor analysis and its applications. Johns Hopkins CLSP Lecture.

Dehak, N., Dumouchel, P., & Kenny, P. (2007). Modeling prosodic features with joint factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2095–2103. doi:10.1109/TASL.2007.902758.CrossRef

Dehak, N., Kenny, P., & Dumouchel, P. (2007b) Continuous prosodic features and formant modeling with joint factor analysis for speaker verification. INTERSPEECH 2007, 8th annual conference of the international speech communication association, Antwerp. Retrieved August 27–31, 2007. pp 1234–1237. http://www.isca-speech.org/archive/interspeech_2007/i07_1234.html.

Dehak, N., Dehak, R., Glass, J. R., Reynolds, D. A., & Kenny, P. (2010). Cosine similarity scoring without score normalization techniques. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 15–19.

Dehak, N., Karam, Z. N., Reynolds, D. A., Dehak, R., Campbell, W. M., & Glass, J. R. (2011a). A channel-blind system for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011, Prague Congress Center, Prague. pp. 4536–4539. doi:10.1109/ICASSP.2011.5947363.

Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011b). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798. doi:10.1109/TASL.2010.2064307.CrossRef

Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D. A., & Dehak, R. (2011c). Language recognition via i-vectors and dimensionality reduction. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 857–860. http://www.isca-speech.org/archive/interspeech_2011/i11_0857.html.

DeMarco, A., & Cox, S. J. (2012). Iterative classification of regional British accents in i-vector space. 2012 Symposium on machine learning in speech and language processing, MLSLP 2012, Portland. Retrieved September 14, 2012, pp. 1–4.

DeMarco, A., & Cox, S. J. (2013). Native accent classification via I-vectors and speaker compensation fusion. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 1472–1476. http://www.isca-speech.org/archive/interspeech_2013/i13_1472.html.

Dupuy, G., Rouvier, M., Meignier, S., & Estève, Y. (2012). i-Vectors and ILP clustering adapted to cross-show speaker diarization. INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. Retrieved September 9–13, 2012. pp. 2174–2177. http://www.isca-speech.org/archive/interspeech_2012/i12_2174.html.

Ferrer, L., Scheffer, N., & Shriberg, E. (2010). A comparison of approaches for modeling prosodic features in speaker recognition. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP). pp. 4414–4417. doi:10.1109/ICASSP.2010.5495632.

Foil, J. (1986). Language identification using noisy speech. IEEE international conference on ICASSP ’86. acoustics, speech, and signal processing (Vol. 11)

Gaida, C., Lange, P., Petrick, R., Proba, P., Malatawy, A., & Suendermann-Oeft, D. (2014). Comparing open-source speech recognition toolkits. http://suendermann.com/su/pdf/oasis2014.pdf.

Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length nNormalization in speaker recognition systems. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence Retrieved August 27–31, 2011. pp. 249–252. http://www.isca-speech.org/archive/interspeech_2011/i11_0249.html.

Garcia-Romero, D., Zhou, X., & Espy-Wilson, C. Y. (2012). Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4257–4260. doi:10.1109/ICASSP.2012.6288859.

Ghahabi, O., & Hernando, J. (2014a). Deep belief networks for i-vector based speaker recognition. IEEE International conference on acoustics, speech and signal processing, ICASSP 2014, Florence. Retrieved May 4–9, 2014. pp. 1700–1704. doi:10.1109/ICASSP.2014.6853888.

Ghahabi, O., & Hernando, J. (2014b). Global impostor selection for dbns in multi-session i-vector speaker recognition. Proceedings of advances in speech and language technologies for Iberian languages—Second international conference, IberSPEECH 2014, Las Palmas de Gran Canaria. Retrieved November 19–21, 2014. pp. 89–98, doi:10.1007/978-3-319-13623-3.

Glembek, O., Burget, L., Dehak, N., Brummer, N., & Kenny, P. (2009). Comparison of scoring methods used in speaker recognition with joint factor analysis. IEEE International conference on acoustics, speech and signal processing, ICASSP 2009. pp. 4057–4060. doi:10.1109/ICASSP.2009.4960519.

Glembek, O., Burget, L., Matejka, P., Karafiat, M., & Kenny, P. (2011). Simplification and optimization of i-vector extraction. IEEE International conference on acoustics, speech and signal processing (ICASSP), 2011. pp. 4516–4519. doi:10.1109/ICASSP.2011.5947358.

Glembek, O., Ma, J., Matejka, P., Zhang, B., Plchot, O., Burget, L., & Matsoukas, S. (2014). Domain adaptation via within-class covariance correction in i-vector based speaker recognition systems. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 4032–4036. doi10.1109/ICASSP.2014.6854359.

González, D. M., Plchot, O., Burget, L., Glembek, O., & Matejka, P. (2011). Language recognition in iVectors space. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 861–864. http://www.isca-speech.org/archive/interspeech_2011/i11_0861.html.

Gupta, V., Kenny, P., Ouellet, P., & Stafylakis, T. (2014). i-Vector-based speaker adaptation of deep neural networks for French broadcast audio transcription. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6334–6338. doi:10.1109/ICASSP.2014.6854823.

Hasan, T., Saeidi, R., Hansen, J., & van Leeuwen, D. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 7663–7667. doi:10.1109/ICASSP.2013.6639154.

Hautamäki, V., Cheng, Y., Rajan, P., & Lee, C. (2013). Minimax i-vector extractor for short duration speaker verification. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 3708–3712. http://www.isca-speech.org/archive/interspeech_2013/i13_3708.html.

Huang, Z., Cheng, Y., Li, K., Hautamäki, V., & Lee, C. (2013). A blind segmentation approach to acoustic event detection based on i-vector. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 2282–2286. http://www.isca-speech.org/archive/interspeech_2013/i13_2282.html.

Huggins-Daines, D., Kumar, M., Chan, A., Black, A., Ravishankar, M., & Rudnicky, A. (2006), Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. 2006 IEEE international conference on acoustics, speech and signal processing, 2006, ICASSP 2006 proceedings (Vol. 1), pp. I-I. doi:10.1109/ICASSP.2006.1659988.

Jancik, Z., Plchot, O., Brummer, N., Burget, L., Glembek, O., Hubeika, V., et al. (2010). Data selection and calibration issues in automatic language recognition—investigation with BUT-AGNITIO NIST LRE 2009 system. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 215–221.

Jiang, Y., Lee, K., Tang, Z., Ma, B., Larcher, A., & Li, H. (2012), PLDA modeling in i-vector and supervector space for speaker verification. INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. Retrieved September 9–13, 2012. pp. 1680–1683. http://www.isca-speech.org/archive/interspeech_2012/i12_1680.html.

Kanagasundaram, A., Vogt, R., Dean, D., Sridharan, S., & Mason, M. (2011). i-Vector based speaker recognition on short utterances. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 2341–2344. http://www.isca-speech.org/archive/interspeech_2011/i11_2341.html.

Kanagasundaram, A., Dean, D., Vogt, R., McLaren, M., Subramanian, S.,& Mason, M. (2012a). Weighted LDA techniques for i-vector based speaker verification. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4781–4784. doi:10.1109/ICASSP.2012.6288988.

Kanagasundaram, A., Vogt, R., Dean, D., & Sridharan, S. (2012b). PLDA based speaker recognition on short utterances. Odyssey 2012: The speaker and language recognition workshop, Singapore. Retrieved June 25–28, 2012. pp 28–33.

Kanagasundaram, A., Dean, D., Gonzalez-Dominguez, J., Sridharan, S., Ramos, D., & Gonzalez-Rodriguez, J. (2013). Improving short utterance based i-vector speaker recognition using source and utterance-duration normalization techniques. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, Retrieved August 25–29, 2013. pp. 2465–2469. http://www.isca-speech.org/archive/interspeech_2013/i13_2465.html.

Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.1016/j.specom.2014.01.004.CrossRef

Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., González-Rodríguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.1016/j.specom.2014.01.004.CrossRef

Karafiát, M., Burget, L., Matejka, P., Glembek, O., & Cernocký, J. (2011). iVector-based discriminative adaptation for automatic speech recognition. 2011 IEEE workshop on automatic speech recognition & understanding, ASRU 2011, Waikoloa. Retrieved December 11–15, 2011. pp. 152–157. doi:10.1109/ASRU.2011.6163922.

Karanasou, P., Wang, Y., Gales, M. J. F., & Woodland, P. C. (2014). Adaptation of deep neural network acoustic models using factorised i-Vectors. INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. Retrieved September 14–18, 2014. pp. 2180–2184. http://www.isca-speech.org/archive/interspeech_2014/i14_2180.html.

Kenny, P. (2005). Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal, (Report) CRIM-06/08-13.

Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007a). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447. doi:10.1109/TASL.2006.881693.CrossRef

Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007b). Speaker and session variability in GMM-based speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1448–1460. doi:10.1109/TASL.2007.894527.CrossRef

Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 16(5), 980–988. doi:10.1109/TASL.2008.925147.CrossRef

Kenny, P., Stafylakis, T., Ouellet, P., Alam, M., & Dumouchel, P. (2013). PLDA for speaker verification with utterances of arbitrary duration. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7649–7653. doi:10.1109/ICASSP.2013.6639151.

Kockmann, M., Burget, L., & Cernocký, J. (2010). Brno University of Technology System for Interspeech 2010 Paralinguistic Challenge. In: INTERSPEECH 2010, 11th annual conference of the international speech communication association, Makuhari, Chiba. September 26–30, 2010, pp 2822–2825. http://www.isca-speech.org/archive/interspeech_2010/i10_2822.html

Kockmann, M., Ferrer, L., Burget, L., & Cernocký, J. (2011). iVector fusion of prosodic and cepstral features for speaker verification. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence,. August 27–31, 2011, pp 265–268. http://www.isca-speech.org/archive/interspeech_2011/i11_0265.html

Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., et al. (2003). The CMU SPHINX-4 speech recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP 2003). Hong Kong, 1, 2–5.

Larcher, A., Bousquet, P., Lee, K. A., Matrouf, D., Li, H., & Bonastre, J. F. (2012) i-Vectors in the context of phonetically-constrained short utterances for speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4773–4776. doi:10.1109/ICASSP.2012.6288986

Larcher, A., Bonastre, J., Fauve, B. G. B., Lee, K., Lévy, C., Li, H., et al. (2013). ALIZE 3.0: Open source toolkit for state-of-the-art speaker recognition. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. August 25–29, 2013, pp 2768–2772, http://www.isca-speech.org/archive/interspeech_2013/i13_2768.html

Le,V. B., Mella, O., & Fohr, D. (2007). Speaker diarization using normalized cross likelihood ratio. In: INTERSPEECH 2007, 8th annual conference of the international speech communication association, Antwerp. August 27–31, 2007, pp. 1869–1872, http://www.isca-speech.org/archive/interspeech_2007/i07_1869.html

Lei, Y., Burget, L., Ferrer, L., Graciarena, M., & Scheffer, N. (2012a). Towards noise-robust speaker recognition using probabilistic linear discriminant analysis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4253–4256. 10.1109/ICASSP.2012.6288858

Lei, Y., Burget, L., & Scheffer, N. (2012b). Bilinear factor analysis for ivector based speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp 1588–1591, http://www.isca-speech.org/archive/interspeech_2012/i12_1588.html

Lei, Y., Burget, L., & Scheffer, N. (2013). A noise robust i-vector extractor using vector taylor series for speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6788–6791. doi:10.1109/ICASSP.2013.6638976

Lei, Y., McLaren, M., Ferrer, L., & Scheffer, N. (2014a). Simplified VTS-based i-Vector extraction in noise-robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4037–4041. doi:10.1109/ICASSP.2014.6854360.

Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014b). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 1695–1699. doi:10.1109/ICASSP.2014.6853887

Li, M., & Liu, W. (2014). Speaker verification and spoken language identification using a generalized i-vector framework with phonetic tokenizations and tandem features. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp 1120–1124, http://www.isca-speech.org/archive/interspeech_2014/i14_1120.html

Li, M., Zhang, X., Yan, Y., & Narayanan, S. S. (2011). Speaker verification using sparse representations on total variability i-vectors. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp 2729–2732, http://www.isca-speech.org/archive/interspeech_2011/i11_2729.html

Mandasari, M., McLaren, M., & van Leeuwen, D. (2012). The effect of noise on modern automatic speaker recognition systems. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4249–4252. doi:10.1109/ICASSP.2012.6288857

Mandasari, M. I., McLaren, M., & van Leeuwen, D. A. (2011). Evaluation of i-vector speaker recognition systems for forensic application. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 21–24, http://www.isca-speech.org/archive/interspeech_2011/i11_0021.html

Mariooryad, S., & Busso, C. (2014). Compensating for speaker or lexical variabilities in speech for emotion recognition. Speech Communication, 57(0):1–12. doi:10.1016/j.specom.2013.07.011, http://www.sciencedirect.com/science/article/pii/S0167639313001015

Martinez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based prosodic system for language identification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4861–4864. doi:10.1109/ICASSP.2012.6289008

Martinez, D., Lleida, E., Ortega, A., & Miguel, A. (2013). Prosodic features and formant modeling for an ivector-based language recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6847–6851. doi:10.1109/ICASSP.2013.6638988

Martinez, D., Burget, L., Stafylakis, T., Lei, Y., Kenny, P., & Lleida, E. (2014). Unscented transform for ivector-based noisy speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4042–4046. doi:10.1109/ICASSP.2014.6854361

Matejka, P., Glembek, O., Castaldo, F., Alam, M., Plchot, O., Kenny, P., et al. (2011). Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4828–4831. doi:10.1109/ICASSP.2011.5947436

McLaren, M., & van Leeuwen, D. (2011a). Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5456–5459, DOI 10.1109/ICASSP.2011.5947593.

McLaren, M., & van Leeuwen, D. (2012a). Gender-independent speaker recognition using source normalisation. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4373–4376. doi:10.1109/ICASSP.2012.6288888

McLaren, M., & van Leeuwen, D. (2012b). Source-normalized LDA for Robust speaker recognition using i-vectors from multiple speech sources. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 755–766. doi:10.1109/TASL.2011.2164533.CrossRef

McLaren, M., & van Leeuwen, D. A. (2011b). To weight or not to weight: source-normalised LDA for speaker recognition using i-vectors. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 2709–2712, http://www.isca-speech.org/archive/interspeech_2011/i11_2709.html

Meignier, S., & Merlin, T. (2010). LIUM SpkDiarization: An open source toolkit for diarization. In: CMU SPUD workshop (Vol. 2010)

Novoselov, S., Pekhovsky, T., Simonchik, K., & Shulipa, A. (2014). RBM-PLDA subsystem for the NIST i-vector challenge. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore, September 14–18, 2014, pp 378–382, http://www.isca-speech.org/archive/interspeech_2014/i14_0378.html

Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247. doi:10.1109/5.237532.CrossRef

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, iEEE Catalog No.: CFP11SRW-USB.

Reynolds, D. A. (1992). A Gaussian mixture modeling approach to text-independent speaker identification. Ph.D. dissertation, Georgia Institute of Technology.

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(13), 19–41. doi:10.1006/dspr.1999.0361.CrossRef

Rouvier, M., & Favre, B. (2014). Speaker adaptation of DNN-based ASR with i-vectors: Does it actually adapt models to speakers? In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore, September 14–18, 2014, pp. 3007–3011, http://www.isca-speech.org/archive/interspeech_2014/i14_3007.html

Rouvier, M., Dupuy, G., Gay, P., el Khoury, E., Merlin, T., & Meignier, S. (2013). An open-source state-of-the-art toolbox for broadcast news diarization. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. August 25–29, 2013, pp. 1477–1481, http://www.isca-speech.org/archive/interspeech_2013/i13_1477.html

Sadjadi, S. O., Slaney, M., & Heck, L. (2013). Msr identity toolbox v1.0: A matlab toolbox for speaker-recognition research. Speech and Language Processing Technical Committee Newsletter http://research.microsoft.com/apps/pubs/default.aspx?id=205119

Sarkar, A. K., Matrouf, D., Bousquet, P., & Bonastre, J. (2012). Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2662–2665, http://www.isca-speech.org/archive/interspeech_2012/i12_2662.html

Sarkar, S., & Rao, K. S. (2014). A novel boosting algorithm for improved i-vector based speaker verification in noisy environments. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 671–675, http://www.isca-speech.org/archive/interspeech_2014/i14_0671.html

Segbroeck, M. V., Travadi, R., & Narayanan, S. S. (2014a) UBM fused total variability modeling for language identification. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 3027–3031, http://www.isca-speech.org/archive/interspeech_2014/i14_3027.html

Segbroeck, M. V., Travadi, R., Vaz, C., Kim, J., Black, M. P., Potamianos, A., et al. (2014b). Classification of cognitive load from speech using an i-vector framework. in: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 751–755, http://www.isca-speech.org/archive/interspeech_2014/i14_0751.html

Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 225–229. doi:10.1109/ICASSP.2014.6853591

Senoussaoui, M., Kenny, P., Dehak, N., & Dumouchel, P. (2010). An i-vector extractor suitable for speaker recognition with both microphone and telephone speech. In: Odyssey 2010: the speaker and language recognition workshop, Brno, June 28–July 1, 2010, p. 6

Senoussaoui, M., Kenny, P., Brümmer, N., de Villiers, E., & Dumouchel, P. (2011). Mixture of PLDA models in i-vector space for gender-independent speaker recognition. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 25–28, http://www.isca-speech.org/archive/interspeech_2011/i11_0025.html

Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D. A., & Glass, J. R. (2011). Exploiting intra-conversation variability for speaker diarization. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 945–948, http://www.isca-speech.org/archive/interspeech_2011/i11_0945.html

Silovsky, J., & Prazak, J. (2012). Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4193–4196. doi:10.1109/ICASSP.2012.6288843

Simonchik, K., Pekhovsky, T., Shulipa, A., & Afanasyev, A. (2012). Supervized mixture of PLDA models for cross-channel speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 1684–1687, http://www.isca-speech.org/archive/interspeech_2012/i12_1684.html

Sizov, A., el Khoury, E., Kinnunen, T., Wu, Z., & Marcel, S. (2015). Joint speaker verification and antispoofing in the i-vector space. IEEE Transactions on Information Forensics and Security, 10(4), 821–832. doi:10.1109/TIFS.2015.2407362.CrossRef

Soufifar, M., Kockmann, M., Burget, L., Plchot, O., Glembek, O., & Svendsen, T. (2011). iVector approach to phonotactic language recognition. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 2913–2916, http://www.isca-speech.org/archive/interspeech_2011/i11_2913.html

Travadi, R., Segbroeck, M. V., & Narayanan, S. S. (2014). Modified-prior i-vector estimation for language identification of short duration utterances. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 3037–3041, http://www.isca-speech.org/archive/interspeech_2014/i14_3037.html

Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication 12(3):247–251. doi:10.1016/0167-6393(93)90095-3, http://www.sciencedirect.com/science/article/pii/0167639393900953

Variani, E., Lei, X., McDermott, E., Lopez Moreno, I., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp 4052–4056. doi:10.1109/ICASSP.2014.6854363

Villalba, J., & Lleida, E. (2013). Handling i-vectors from different recording conditions using multi-channel simplified PLDA in speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6763–6767. doi:10.1109/ICASSP.2013.6638971

Wolf, J. J. (1972). Efficient acoustic parameters for speaker recognition. The Journal of the Acoustical Society of America, 51(6B):2044–2056. doi:10.1121/1.1913065, http://scitation.aip.org/content/asa/journal/jasa/51/6B/10.1121/1.1913065

Wu, T., Yang, Y., Wu, Z., & Li, D. (2006). MASC: A speech corpus in mandarin for emotion analysis and affective speaker recognition. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp 1–5. doi:10.1109/ODYSSEY.2006.248084

Xia, R., & Liu, Y. (2012). Using i-vector space model for emotion recognition. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2230–2233, http://www.isca-speech.org/archive/interspeech_2012/i12_2230.html

Yin, S. C., Kenny, P., & Rose, R. (2006). Experiments in speaker adaptation for factor analysis based speaker verification. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp. 1–6. doi:10.1109/ODYSSEY.2006.248130

Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X. A., et al. (2006) The HTK book (for HTK version 3.4).

Yu, C., Liu, G., Hahm, S., & Hansen, J. (2014). Uncertainty propagation in front end factor analysis for noise robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4017–4021. doi:10.1109/ICASSP.2014.6854356

Zheng, R., Zhang, C., Zhang, S., & Xu, B. (2014). Variational bayes based i-vector for speaker diarization of telephone conversations. In: IEEE international conference on acoustics, speech and signal processing, ICASSP, Florence. May 4–9, 2014, pp. 91–95. doi:10.1109/ICASSP.2014.6853564

Zhuang, X., Tsakalidis, S., Wu, S., Natarajan, P., Prasad, R., & Natarajan, P. (2012). Compact audio representation for event detection in consumer media. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2089–2092, http://www.isca-speech.org/archive/interspeech_2012/i12_2089.html

Title: i-Vectors in speech processing applications: a survey
Authors: Pulkit Verma
Pradip K. Das
Publication date: 06-08-2015
Publisher: Springer US
Published in: International Journal of Speech Technology / Issue 4/2015
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-015-9295-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2015

Sensitivity of automatic speaker identification to SVD digital audio watermarking

Supervised and unsupervised separation of convolutive speech mixtures using f 0 and formant frequencies

Binary mask based method for enhancement of mixed noise speech of low SNR input

Automatic prominent syllable detection with machine learning classifiers

An intelligent audio watermarking based on KNN learning algorithm

Robust glottal closure instant detection by jointly exploiting stationary wavelet transform and harmonic superposition