nach oben

International Journal of Speech Technology

Erschienen in:

30.11.2017

Improved i-vector extraction technique for speaker verification with short utterances

verfasst von: Arnab Poddar, Md Sahidullah, Goutam Saha

Erschienen in: International Journal of Speech Technology | Ausgabe 3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

A major challenge in ASV is to improve performance with short speech segments for end-user convenience in real-world applications. In this paper, we present a detailed analysis of ASV systems to observe the duration variability effects on state-of-the-art i-vector and classical Gaussian mixture model-universal background model (GMM-UBM) based ASV systems. We observe an increase in uncertainty of model parameter estimation for i-vector based ASV with speech of shorter duration. In order to compensate the effect of duration variability in short utterances, we have proposed adaptation technique for Baum-Welch statistics estimation used to i-vector extraction. Information from pre-estimated background model parameters are used for adaptation method. The ASV performance with the proposed approach is considerably superior to the conventional i-vector based system. Furthermore, the fusion of proposed i-vector based system and GMM-UBM further improves the ASV performance, especially for short speech segments. Experiments conducted on two speech corpora, NIST SRE 2008 and 2010, have shown relative improvement in equal error rate (EER) in the range of 12–20%.

Vorheriger Artikel Performance comparison of multitaper techniques for speaker verification with expressive speech

Nächster Artikel Improvement of phone recognition accuracy using speech mode classification

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://searchsecurity.techtarget.com/news/450301866/Barclays-replaces-passwords-with-voice-authentication.

https://sites.google.com/site/nikobrummer/focal.

Angkititrakul, P., & Hansen, J. H. (2007). Discriminative in-set/out-of-set speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(2), 498–508.CrossRef

Brummer, N., Burget, L., Cernocky, H., Glembek, O., Grezl, F., Karafiat, M., et al. (2007). Fusion of heterogeneous speaker recognition systems in the SBTU submission for the NIST speaker recognition evaluation 2006. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2072–2084.CrossRef

Cai, W., Li, M., Li, L., & Hong, Q. (2015). Duration dependent covariance regularization in plda modeling for speaker verification. In INTERSPEECH (pp. 1027–1031).

Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006a). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.CrossRef

Campbell, W. M., Sturim, D. E., Reynolds, D. A., & Solomonoff, A. (2006b). SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), IEEE.

Campbell, J. P, Jr. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437–1462.CrossRef

Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.CrossRef

Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRef

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39, 1–38.MathSciNetMATH

Fauve, B. G., Evans, N. W., Pearson, N., Bonastre, J. F., & Mason, J. S. (2007). Influence of task duration in text-independent speaker verification. In Proceedings of INTERSPEECH, ISCA (pp. 794–797).

Fauve, B. G., Evans, N. W., & Mason, J. S. (2008). Improving the performance of text-independent short duration SVM-and GMM-based speaker verification. In Odyssey, ISCA (p. 18).

Ferrer, L., Bratt, H., Kajarekar, S., Shriberg, E., Sönmez, K., Stolcke, A., & Venkataraman, A. (2003). Modeling duration patterns for speaker recognition (pp. 2017–2020).

Gauvain, J. L., & Lee, C. H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2), 291–298.CrossRef

Hasan, T., & Hansen, J. H. (2011). A study on universal background model training in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 1890–1899.CrossRef

Hasan, T., Saeidi, R., & Hansen, J. H., van Leeuwen, D. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (pp. 7663–7667).

Kanagasundaram, A., Vogt, R., Dean, D. B., Sridharan, S., & Mason, M. W. (2011). I-vector based speaker recognition on short utterances. In Proceedings of INTERSPEECH, ISCA (pp. 2341–2344).

Kanagasundaram, A., Vogt, R. J., Dean, D. B., & Sridharan, S. (2012). PLDA based speaker recognition on short utterances. In The speaker and language recognition workshop (Odyssey) ISCA.

Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82.CrossRef

Kanagasundaram, A., Dean, D., Sridharan, S., Ghaemmaghami, H., & Fookes, C. (2017). A study on the effects of using short utterance length development data in the design of gplda speaker verification systems. International Journal of Speech Technology, 20(2), 247–259.CrossRef

Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In The speaker and language recognition workshop (Odyssey) ISCA, (pp. 14).

Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447.CrossRef

Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980–988.CrossRef

Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.CrossRef

Krishnamoorthy, P., Jayanna, H., & Prasanna, S. (2011). Speaker recognition under limited data condition by noise addition. Expert Systems with Applications, 38(10), 13,487–13,490.CrossRef

Li, L., Wang, D., Zhang, C., & Zheng, T. F. (2016a). Improving short utterance speaker recognition by modeling speech unit classes. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(6), 1129–1139.CrossRef

Li, L., Wang, D., Zhang, X., Zheng, T. F., & Jin, P. (2016b). System combination for short utterance speaker recognition. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), Asia-Pacific, IEEE, (pp. 1–5).

Li, M., & Narayanan, S. (2014). Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification. Computer Speech & Language, 28(4), 940–958.CrossRef

Li, W., Fu, T., You, H., Zhu, J., & Chen, N. (2016c). Feature sparsity analysis for i-vector based speaker verification. Speech Communication, 80, 60–70.CrossRef

Mandasari, M.I., McLaren, M., & van Leeuwen, D. A. (2011). Evaluation of i-vector speaker recognition systems for forensic application. In Proceedings of INTERSPEECH, ISCA (pp. 21–24).

NIST. (2008). The NIST year 2008 speaker recognition evaluation plan. Technical report, NIST.

NIST. (2010). The NIST year 2010 speaker recognition evaluation plan. Technical report, NIST.

Poddar, A., Sahidullah, M., & Saha, G. (2015). Performance comparison of speaker recognition systems in presence of duration variability. In Annual IEEE India Conference (INDICON), IEEE (pp. 1–6).

Poddar, A., Sahidullah, M., & Saha, G. (2017). An adaptive i-vector extraction for speaker verification with short utterance. In Proc. of International Conference on Pattern Recognition and Machine Intelligence (PReMI 2017), Berlin: Springer.

Poorjam, A. H., Saeidi, R., Kinnunen, T., & Hautamäki, V. (2016). Incorporating uncertainty as a quality measure in i-vector based language recognition. Odyssey pp. 74–80.

Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using gaussian mixture speaker models. IEEE transactions on speech and audio processing, 3(1), 72–83.CrossRef

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19–41.CrossRef

Sahidullah, M., & Kinnunen, T. (2016). Local spectral variability features for speaker verification. Digital Signal Processing, 50, 1–11.CrossRef

Sahidullah, M., & Saha, G. (2012a). Comparison of speech activity detection techniques for speaker recognition. arXiv preprint arXiv:12100297

Sahidullah, M., & Saha, G. (2012b). Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Communication, 54(4), 543–565.CrossRef

Sahidullah, M., & Saha, G. (2013). A novel windowing technique for efficient computation of MFCC for speaker recognition. IEEE Signal Processing Letters, 20(2), 149–152.CrossRef

Sarkar, A. K., Matrouf, D., Bousquet, P. M., & Bonastre, J. F. (2012). Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In Proceedings of INTERSPEECH ISCA.

Shum, S. (2011). Unsupervised methods for speaker diarization. PhD thesis, Massachusetts Institute of Technology.

Suh, J. W., & Hansen, J. H. (2012). Acoustic hole filling for sparse enrollment data using a cohort universal corpus for speaker recognition. The Journal of the Acoustical Society of America, 131(2), 1515–1528.CrossRef

Van Segbroeck, M., Travadi, R., & Narayanan, S. S. (2015). Rapid language identification. IEEE Transactions on Audio, Speech, and Language Processing, 23(7), 1118–1129.CrossRef

Titel: Improved i-vector extraction technique for speaker verification with short utterances
verfasst von: Arnab Poddar
Md Sahidullah
Goutam Saha
Publikationsdatum: 30.11.2017
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 3/2018
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-017-9477-2

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Jonas Klose/© Pine Valley Capital GmbH, Carina Kießling von der Strategieberatung Roland Berger/© Monika Walther Fotografie | ATZ, Beijing Auto Show 2024: Deutsche Hersteller wollen angreifen./© EKH-Pictures / Generated with AI / Stock.adobe.com, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2018

Performance comparison of multitaper techniques for speaker verification with expressive speech

Robust emotion recognition from speech: Gamma tone features and models

A new speech signal denoising algorithm using common vector approach

Agricultural price information acquisition using noise-robust Mandarin auto speech recognition

Prosody modification for speech recognition in emotionally mismatched conditions

Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.