Skip to main content
Erschienen in: International Journal of Speech Technology 3/2018

30.11.2017

Improved i-vector extraction technique for speaker verification with short utterances

verfasst von: Arnab Poddar, Md Sahidullah, Goutam Saha

Erschienen in: International Journal of Speech Technology | Ausgabe 3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A major challenge in ASV is to improve performance with short speech segments for end-user convenience in real-world applications. In this paper, we present a detailed analysis of ASV systems to observe the duration variability effects on state-of-the-art i-vector and classical Gaussian mixture model-universal background model (GMM-UBM) based ASV systems. We observe an increase in uncertainty of model parameter estimation for i-vector based ASV with speech of shorter duration. In order to compensate the effect of duration variability in short utterances, we have proposed adaptation technique for Baum-Welch statistics estimation used to i-vector extraction. Information from pre-estimated background model parameters are used for adaptation method. The ASV performance with the proposed approach is considerably superior to the conventional i-vector based system. Furthermore, the fusion of proposed i-vector based system and GMM-UBM further improves the ASV performance, especially for short speech segments. Experiments conducted on two speech corpora, NIST SRE 2008 and 2010, have shown relative improvement in equal error rate (EER) in the range of 12–20%.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Angkititrakul, P., & Hansen, J. H. (2007). Discriminative in-set/out-of-set speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(2), 498–508.CrossRef Angkititrakul, P., & Hansen, J. H. (2007). Discriminative in-set/out-of-set speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(2), 498–508.CrossRef
Zurück zum Zitat Brummer, N., Burget, L., Cernocky, H., Glembek, O., Grezl, F., Karafiat, M., et al. (2007). Fusion of heterogeneous speaker recognition systems in the SBTU submission for the NIST speaker recognition evaluation 2006. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2072–2084.CrossRef Brummer, N., Burget, L., Cernocky, H., Glembek, O., Grezl, F., Karafiat, M., et al. (2007). Fusion of heterogeneous speaker recognition systems in the SBTU submission for the NIST speaker recognition evaluation 2006. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2072–2084.CrossRef
Zurück zum Zitat Cai, W., Li, M., Li, L., & Hong, Q. (2015). Duration dependent covariance regularization in plda modeling for speaker verification. In INTERSPEECH (pp. 1027–1031). Cai, W., Li, M., Li, L., & Hong, Q. (2015). Duration dependent covariance regularization in plda modeling for speaker verification. In INTERSPEECH (pp. 1027–1031).
Zurück zum Zitat Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006a). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.CrossRef Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006a). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.CrossRef
Zurück zum Zitat Campbell, W. M., Sturim, D. E., Reynolds, D. A., & Solomonoff, A. (2006b). SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), IEEE. Campbell, W. M., Sturim, D. E., Reynolds, D. A., & Solomonoff, A. (2006b). SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), IEEE.
Zurück zum Zitat Campbell, J. P, Jr. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437–1462.CrossRef Campbell, J. P, Jr. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437–1462.CrossRef
Zurück zum Zitat Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.CrossRef Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.CrossRef
Zurück zum Zitat Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRef Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRef
Zurück zum Zitat Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39, 1–38.MathSciNetMATH Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39, 1–38.MathSciNetMATH
Zurück zum Zitat Fauve, B. G., Evans, N. W., Pearson, N., Bonastre, J. F., & Mason, J. S. (2007). Influence of task duration in text-independent speaker verification. In Proceedings of INTERSPEECH, ISCA (pp. 794–797). Fauve, B. G., Evans, N. W., Pearson, N., Bonastre, J. F., & Mason, J. S. (2007). Influence of task duration in text-independent speaker verification. In Proceedings of INTERSPEECH, ISCA (pp. 794–797).
Zurück zum Zitat Fauve, B. G., Evans, N. W., & Mason, J. S. (2008). Improving the performance of text-independent short duration SVM-and GMM-based speaker verification. In Odyssey, ISCA (p. 18). Fauve, B. G., Evans, N. W., & Mason, J. S. (2008). Improving the performance of text-independent short duration SVM-and GMM-based speaker verification. In Odyssey, ISCA (p. 18).
Zurück zum Zitat Ferrer, L., Bratt, H., Kajarekar, S., Shriberg, E., Sönmez, K., Stolcke, A., & Venkataraman, A. (2003). Modeling duration patterns for speaker recognition (pp. 2017–2020). Ferrer, L., Bratt, H., Kajarekar, S., Shriberg, E., Sönmez, K., Stolcke, A., & Venkataraman, A. (2003). Modeling duration patterns for speaker recognition (pp. 2017–2020).
Zurück zum Zitat Gauvain, J. L., & Lee, C. H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2), 291–298.CrossRef Gauvain, J. L., & Lee, C. H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2), 291–298.CrossRef
Zurück zum Zitat Hasan, T., & Hansen, J. H. (2011). A study on universal background model training in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 1890–1899.CrossRef Hasan, T., & Hansen, J. H. (2011). A study on universal background model training in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 1890–1899.CrossRef
Zurück zum Zitat Hasan, T., Saeidi, R., & Hansen, J. H., van Leeuwen, D. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (pp. 7663–7667). Hasan, T., Saeidi, R., & Hansen, J. H., van Leeuwen, D. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (pp. 7663–7667).
Zurück zum Zitat Kanagasundaram, A., Vogt, R., Dean, D. B., Sridharan, S., & Mason, M. W. (2011). I-vector based speaker recognition on short utterances. In Proceedings of INTERSPEECH, ISCA (pp. 2341–2344). Kanagasundaram, A., Vogt, R., Dean, D. B., Sridharan, S., & Mason, M. W. (2011). I-vector based speaker recognition on short utterances. In Proceedings of INTERSPEECH, ISCA (pp. 2341–2344).
Zurück zum Zitat Kanagasundaram, A., Vogt, R. J., Dean, D. B., & Sridharan, S. (2012). PLDA based speaker recognition on short utterances. In The speaker and language recognition workshop (Odyssey) ISCA. Kanagasundaram, A., Vogt, R. J., Dean, D. B., & Sridharan, S. (2012). PLDA based speaker recognition on short utterances. In The speaker and language recognition workshop (Odyssey) ISCA.
Zurück zum Zitat Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82.CrossRef Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82.CrossRef
Zurück zum Zitat Kanagasundaram, A., Dean, D., Sridharan, S., Ghaemmaghami, H., & Fookes, C. (2017). A study on the effects of using short utterance length development data in the design of gplda speaker verification systems. International Journal of Speech Technology, 20(2), 247–259.CrossRef Kanagasundaram, A., Dean, D., Sridharan, S., Ghaemmaghami, H., & Fookes, C. (2017). A study on the effects of using short utterance length development data in the design of gplda speaker verification systems. International Journal of Speech Technology, 20(2), 247–259.CrossRef
Zurück zum Zitat Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In The speaker and language recognition workshop (Odyssey) ISCA, (pp. 14). Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In The speaker and language recognition workshop (Odyssey) ISCA, (pp. 14).
Zurück zum Zitat Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447.CrossRef Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447.CrossRef
Zurück zum Zitat Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980–988.CrossRef Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980–988.CrossRef
Zurück zum Zitat Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.CrossRef Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.CrossRef
Zurück zum Zitat Krishnamoorthy, P., Jayanna, H., & Prasanna, S. (2011). Speaker recognition under limited data condition by noise addition. Expert Systems with Applications, 38(10), 13,487–13,490.CrossRef Krishnamoorthy, P., Jayanna, H., & Prasanna, S. (2011). Speaker recognition under limited data condition by noise addition. Expert Systems with Applications, 38(10), 13,487–13,490.CrossRef
Zurück zum Zitat Li, L., Wang, D., Zhang, C., & Zheng, T. F. (2016a). Improving short utterance speaker recognition by modeling speech unit classes. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(6), 1129–1139.CrossRef Li, L., Wang, D., Zhang, C., & Zheng, T. F. (2016a). Improving short utterance speaker recognition by modeling speech unit classes. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(6), 1129–1139.CrossRef
Zurück zum Zitat Li, L., Wang, D., Zhang, X., Zheng, T. F., & Jin, P. (2016b). System combination for short utterance speaker recognition. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), Asia-Pacific, IEEE, (pp. 1–5). Li, L., Wang, D., Zhang, X., Zheng, T. F., & Jin, P. (2016b). System combination for short utterance speaker recognition. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), Asia-Pacific, IEEE, (pp. 1–5).
Zurück zum Zitat Li, M., & Narayanan, S. (2014). Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification. Computer Speech & Language, 28(4), 940–958.CrossRef Li, M., & Narayanan, S. (2014). Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification. Computer Speech & Language, 28(4), 940–958.CrossRef
Zurück zum Zitat Li, W., Fu, T., You, H., Zhu, J., & Chen, N. (2016c). Feature sparsity analysis for i-vector based speaker verification. Speech Communication, 80, 60–70.CrossRef Li, W., Fu, T., You, H., Zhu, J., & Chen, N. (2016c). Feature sparsity analysis for i-vector based speaker verification. Speech Communication, 80, 60–70.CrossRef
Zurück zum Zitat Mandasari, M.I., McLaren, M., & van Leeuwen, D. A. (2011). Evaluation of i-vector speaker recognition systems for forensic application. In Proceedings of INTERSPEECH, ISCA (pp. 21–24). Mandasari, M.I., McLaren, M., & van Leeuwen, D. A. (2011). Evaluation of i-vector speaker recognition systems for forensic application. In Proceedings of INTERSPEECH, ISCA (pp. 21–24).
Zurück zum Zitat NIST. (2008). The NIST year 2008 speaker recognition evaluation plan. Technical report, NIST. NIST. (2008). The NIST year 2008 speaker recognition evaluation plan. Technical report, NIST.
Zurück zum Zitat NIST. (2010). The NIST year 2010 speaker recognition evaluation plan. Technical report, NIST. NIST. (2010). The NIST year 2010 speaker recognition evaluation plan. Technical report, NIST.
Zurück zum Zitat Poddar, A., Sahidullah, M., & Saha, G. (2015). Performance comparison of speaker recognition systems in presence of duration variability. In Annual IEEE India Conference (INDICON), IEEE (pp. 1–6). Poddar, A., Sahidullah, M., & Saha, G. (2015). Performance comparison of speaker recognition systems in presence of duration variability. In Annual IEEE India Conference (INDICON), IEEE (pp. 1–6).
Zurück zum Zitat Poddar, A., Sahidullah, M., & Saha, G. (2017). An adaptive i-vector extraction for speaker verification with short utterance. In Proc. of International Conference on Pattern Recognition and Machine Intelligence (PReMI 2017), Berlin: Springer. Poddar, A., Sahidullah, M., & Saha, G. (2017). An adaptive i-vector extraction for speaker verification with short utterance. In Proc. of International Conference on Pattern Recognition and Machine Intelligence (PReMI 2017), Berlin: Springer.
Zurück zum Zitat Poorjam, A. H., Saeidi, R., Kinnunen, T., & Hautamäki, V. (2016). Incorporating uncertainty as a quality measure in i-vector based language recognition. Odyssey pp. 74–80. Poorjam, A. H., Saeidi, R., Kinnunen, T., & Hautamäki, V. (2016). Incorporating uncertainty as a quality measure in i-vector based language recognition. Odyssey pp. 74–80.
Zurück zum Zitat Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using gaussian mixture speaker models. IEEE transactions on speech and audio processing, 3(1), 72–83.CrossRef Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using gaussian mixture speaker models. IEEE transactions on speech and audio processing, 3(1), 72–83.CrossRef
Zurück zum Zitat Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19–41.CrossRef Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19–41.CrossRef
Zurück zum Zitat Sahidullah, M., & Kinnunen, T. (2016). Local spectral variability features for speaker verification. Digital Signal Processing, 50, 1–11.CrossRef Sahidullah, M., & Kinnunen, T. (2016). Local spectral variability features for speaker verification. Digital Signal Processing, 50, 1–11.CrossRef
Zurück zum Zitat Sahidullah, M., & Saha, G. (2012a). Comparison of speech activity detection techniques for speaker recognition. arXiv preprint arXiv:12100297 Sahidullah, M., & Saha, G. (2012a). Comparison of speech activity detection techniques for speaker recognition. arXiv preprint arXiv:​12100297
Zurück zum Zitat Sahidullah, M., & Saha, G. (2012b). Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Communication, 54(4), 543–565.CrossRef Sahidullah, M., & Saha, G. (2012b). Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Communication, 54(4), 543–565.CrossRef
Zurück zum Zitat Sahidullah, M., & Saha, G. (2013). A novel windowing technique for efficient computation of MFCC for speaker recognition. IEEE Signal Processing Letters, 20(2), 149–152.CrossRef Sahidullah, M., & Saha, G. (2013). A novel windowing technique for efficient computation of MFCC for speaker recognition. IEEE Signal Processing Letters, 20(2), 149–152.CrossRef
Zurück zum Zitat Sarkar, A. K., Matrouf, D., Bousquet, P. M., & Bonastre, J. F. (2012). Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In Proceedings of INTERSPEECH ISCA. Sarkar, A. K., Matrouf, D., Bousquet, P. M., & Bonastre, J. F. (2012). Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In Proceedings of INTERSPEECH ISCA.
Zurück zum Zitat Shum, S. (2011). Unsupervised methods for speaker diarization. PhD thesis, Massachusetts Institute of Technology. Shum, S. (2011). Unsupervised methods for speaker diarization. PhD thesis, Massachusetts Institute of Technology.
Zurück zum Zitat Suh, J. W., & Hansen, J. H. (2012). Acoustic hole filling for sparse enrollment data using a cohort universal corpus for speaker recognition. The Journal of the Acoustical Society of America, 131(2), 1515–1528.CrossRef Suh, J. W., & Hansen, J. H. (2012). Acoustic hole filling for sparse enrollment data using a cohort universal corpus for speaker recognition. The Journal of the Acoustical Society of America, 131(2), 1515–1528.CrossRef
Zurück zum Zitat Van Segbroeck, M., Travadi, R., & Narayanan, S. S. (2015). Rapid language identification. IEEE Transactions on Audio, Speech, and Language Processing, 23(7), 1118–1129.CrossRef Van Segbroeck, M., Travadi, R., & Narayanan, S. S. (2015). Rapid language identification. IEEE Transactions on Audio, Speech, and Language Processing, 23(7), 1118–1129.CrossRef
Metadaten
Titel
Improved i-vector extraction technique for speaker verification with short utterances
verfasst von
Arnab Poddar
Md Sahidullah
Goutam Saha
Publikationsdatum
30.11.2017
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 3/2018
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-017-9477-2

Weitere Artikel der Ausgabe 3/2018

International Journal of Speech Technology 3/2018 Zur Ausgabe

Neuer Inhalt