nach oben

International Journal of Speech Technology

Erschienen in:

22.01.2021

Convolutional neural network vectors for speaker recognition

verfasst von: Soufiane Hourri, Nikola S. Nikolov, Jamal Kharroubi

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Deep learning models are now considered state-of-the-art in many areas of pattern recognition. In speaker recognition, several architectures have been studied, such as deep neural networks (DNNs), deep belief networks (DBNs), restricted Boltzmann machines (RBMs), and so on, while convolutional neural networks (CNNs) are the most widely used models in computer vision. The problem is that CNN is limited to the computer vision field due to its structure which is designed for two-dimensional data. To overcome this limitation, we aim at developing a customized CNN for speaker recognition. The goal of this paper is to propose a new approach to extract speaker characteristics by constructing CNN filters linked to the speaker. Besides, we propose new vectors to identify speakers, which we call in this work convVectors. Experiments have been performed with a gender-dependent corpus (THUYG-20 SRE) under three noise conditions : clean, 9db, and 0db. We compared the proposed method with our baseline system and the state-of-the-art methods. Results showed that the convVectors method was the most robust, improving the baseline system by an average of 43%, and recording an equal error rate of 1.05% EER. This is an important finding to understand how deep learning models can be adapted to the problem of speaker recognition.

Vorheriger Artikel A review on speech processing using machine learning paradigm

Nächster Artikel Design and comparative analysis of on-chip sigma delta ADC for signal processing applications

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Agrawal, P., Kapoor, R., & Agrawal, S. (2014). A hybrid partial fingerprint matching algorithm for estimation of equal error rate. 2014 IEEE International Conference on Advanced Communications (pp. 1295–1299). IEEE: Control and Computing Technologies.

Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al. (2016) Deep speech 2: End-to-end speech recognition in english and mandarin. In: International conference on machine learning, pp 173–182

Basyal, L., Kaushal, S., & Singh, G. (2018). Voice recognition robot with real time surveillance and automation. International Journal of Creative Research Thoughts, 6(1), 2320–2882.

Bennani, Y., & Gallinari, P . (1994) . Connectionist approaches for automatic speaker recognition. In: Automatic Speaker Recognition, Identification and Verification.

Bouziane, A., Kadi, H., Hourri, S., & Kharroubi, J . (2016). An open and free speech corpus for speaker recognition: The fscsr speech corpus. In: 2016 11th International Conference on Intelligent Systems: Theories and Applications (SITA), IEEE, pp 1–5

Brinker, T. J., Hekler, A., Enk, A. H., Klode, J., Hauschild, A., Berking, C., et al. (2019). A convolutional neural network trained with dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image classification task. European Journal of Cancer, 111, 148–154.CrossRef

Chen, Y., Zhu, K., Zhu, L., He, X., Ghamisi, P., & Benediktsson, J. A. (2019). Automatic design of convolutional neural network for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 57(9), 7048–7066.CrossRef

Chen, Yh., Lopez-Moreno, I., Sainath, TN., Visontai, M., Alvarez, R., & Parada, C . (2015). Locally-connected and convolutional neural networks for small footprint speaker recognition. In: Sixteenth Annual Conference of the International Speech Communication Association.

Choi, SS., Cha, SH., & Tappert, CC .(2010) . A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics pp 43–48.

Chung, JS., Nagrani, A., & Zisserman, A . (2018) . Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:180605622.

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.MathSciNetCrossRef

Deng, L .(2014). A tutorial survey of architectures, algorithms, and applications for deep learning. In: APSIPA Transactions on Signal and Information Processing 3.

Deng, L., Hinton, G., & Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related applications: An overview. 2013 IEEE International Conference on Acoustics (pp. 8599–8603). IEEE: Speech and Signal Processing.

Forsyth, M. E., Sutherland, A. M., Elliott, J., & Jack, M. A. (1993). Hmm speaker verification with sparse training data on telephone quality speech. Speech Communication, 13(3–4), 411–416.CrossRef

Gardner, M. W., & Dorling, S. (1998). Artificial neural networks (the multilayer perceptron): A review of applications in the atmospheric sciences. Atmospheric Environment, 32(14–15), 2627–2636.CrossRef

Hasan, M. R., Jamil, M., Rahman, M., et al. (2004). Speaker identification using MEL frequency cepstral coefficients. Variations, 1(4), 9.

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, Ar., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., & Kingsbury, B., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29.

Hinton, GE. (2012). A practical guide to training restricted Boltzmann machines. In: Neural networks: Tricks of the trade, Springer, pp 599–619.

Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.MathSciNetCrossRef

Hourri, S., & Kharroubi, J. (2019). A novel scoring method based on distance calculation for similarity measurement in text-independent speaker verification. Procedia Computer Science, 148, 256–265.CrossRef

Hourri, S., & Kharroubi, J. (2020). A deep learning approach for speaker recognition. International Journal of Speech Technology, 23(1), 123–131.CrossRef

Hourri, S., Nikolov, N. S., & Kharroubi, J. (2020). A deep learning approach to integrate convolutional neural networks in speaker recognition. International Journal of Speech Technology, 23(3), 615–623.CrossRef

Hoy, M. B. (2018). Alexa, Siri, Cortana, and more: An introduction to voice assistants. Medical Reference Services Quarterly, 37(1), 81–88.CrossRef

Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of Physiology, 195(1), 215–243.CrossRef

Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:14042188.

Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., & Alam, J . (2014) . Deep neural networks for extracting baum-welch statistics for speaker recognition. In: Proceeding of the Odyssey, pp 293–298.

Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.CrossRef

Krizhevsky, A., Sutskever, I., Hinton, GE. (2012). Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105.

LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10).

Lee, KF., & Hon, HW. (1988) .Large-vocabulary speaker-independent continuous speech recognition using hmm. In: Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on, IEEE, pp 123–126.

Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M .(2014) . A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, IEEE, pp 1695–1699.

Li, C., Ma, X., Jiang, B., Li, X., Zhang, X., Liu, X., Cao, Y., Kannan, A., & Zhu, Z . (2017). Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:170502304.

Li, J., Sun, M., Zhang, X., & Wang, Y. (2020). Joint decision of anti-spoofing and automatic speaker verification by multi-task learning with contrastive loss. IEEE Access, 8, 7907–7915.CrossRef

Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., et al. (2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42, 60–88.CrossRef

Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., & Yu, K. (2015). Deep feature for text-dependent speaker verification. Speech Communication, 73, 1–13.CrossRef

Martinez, J., Perez, H., Escamilla, E., & Suzuki, MM. (2012). Speaker recognition using mel frequency cepstral coefficients (MFCC) and vector quantization (VQ) techniques. In: Electrical Communications and Computers (CONIELECOMP), 2012 22nd International Conference on, IEEE, pp 248–251.

Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In: Eleventh Annual Conference of the International Speech Communication Association.

Molau, S., Pitz, M., Schluter, R., & Ney, H . (2001) . Computing mel-frequency cepstral coefficients on the power spectrum. In: Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, IEEE, vol 1, pp 73–76.

Palaz, D., Collobert, R., et al. (2015). Analysis of cnn-based speech recognition system using raw speech as input. Idiap: Tech. Rep.

Prasad, NV., & Umesh, S. (2013). Improved cepstral mean and variance normalization using bayesian framework. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE, pp 156–161.

Ravanelli, M., & Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet. In: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp 1021–1028.

Reddy, D. R. (1976). Speech recognition by machine: A review. Proceedings of the IEEE, 64(4), 501–531.CrossRef

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRef

Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.CrossRef

Rozi, A., Wang, D., Zhang, Z., & Zheng, TF. (2015). An open/free database and benchmark for uyghur speaker recognition. In: Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2015 International Conference, IEEE, pp 81–85.

Sadjadi, S. O., & Hansen, J. H. (2015). Mean hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Communication, 72, 138–148.CrossRef

Sainath, TN., Mohamed, Ar., Kingsbury, B., & Ramabhadran, B. (2013). Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, pp 8614–8618.

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). Dbscan revisited, revisited: Why and how you should (still) use Dbscan. ACM Transactions on Database Systems (TODS), 42(3), 1–21.MathSciNetCrossRef

Senoussaoui, M., Dehak, N., Kenny, P., Dehak, R., & Dumouchel, P. (2012). First attempt of boltzmann machines for speaker verification. In: Odyssey 2012-the speaker and language recognition workshop.

Sermanet, P., Chintala, S., & LeCun, Y. (2012). Convolutional neural networks applied to house numbers digit classification. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), IEEE, pp 3288–3291.

Shahin, I., & Botros, N .(1998). Speaker identification using dynamic time warping with stress compensation technique. In: Southeastcon’98. Proceedings. IEEE, IEEE, pp 65–68.

Singh, S., & Rajan, E. (2011). Vector quantization approach for speaker recognition using MFCC and inverted MFCC. International Journal of Computer Applications, 17(1), 1–7.CrossRef

Skourt, BA., Nikolov, NS., & Majda, A. (2019). Feature-extraction methods for lung-nodule detection: A comparative deep learning study. In: 2019 International Conference on Intelligent Systems and Advanced Computing Sciences (ISACS), IEEE, pp 1–6.

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. 2018 IEEE International Conference on Acoustics (pp. 5329–5333). IEEE: Speech and Signal Processing (ICASSP).

Soong, F. K., Rosenberg, A. E., Juang, B. H., & Rabiner, L. R. (1987). Report: A vector quantization approach to speaker recognition. AT&T Technical Journal, 66(2), 14–26.CrossRef

Tirumala, SS., & Shahamiri, SR. (2016). A review on deep learning approaches in speaker identification. In: Proceedings of the 8th International Conference on Signal Processing Systems, ACM, pp 142–147.

Tóth, L. (2014). Combining time-and frequency-domain convolution in convolutional neural network-based phone recognition. 2014 IEEE International Conference on Acoustics (pp. 190–194). IEEE: Speech and Signal Processing (ICASSP).

Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13(3), 35–75.CrossRef

Zhang, C., Koishida, K., & Hansen, J. H. (2018). Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(9), 1633–1644.CrossRef

Titel: Convolutional neural network vectors for speaker recognition
verfasst von: Soufiane Hourri
Nikola S. Nikolov
Jamal Kharroubi
Publikationsdatum: 22.01.2021
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 2/2021
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-021-09795-2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2021

Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM

DNN and i-vector combined method for speaker recognition on multi-variability environments

Age-related hearing loss, speech understanding and cognitive technologies

An effective tumor detection approach using denoised MRI based on fuzzy bayesian segmentation approach

Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system

In search of a suitable method for disambiguation of word senses in Bengali