Skip to main content
Erschienen in: International Journal of Speech Technology 2/2021

22.01.2021

Convolutional neural network vectors for speaker recognition

verfasst von: Soufiane Hourri, Nikola S. Nikolov, Jamal Kharroubi

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Deep learning models are now considered state-of-the-art in many areas of pattern recognition. In speaker recognition, several architectures have been studied, such as deep neural networks (DNNs), deep belief networks (DBNs), restricted Boltzmann machines (RBMs), and so on, while convolutional neural networks (CNNs) are the most widely used models in computer vision. The problem is that CNN is limited to the computer vision field due to its structure which is designed for two-dimensional data. To overcome this limitation, we aim at developing a customized CNN for speaker recognition. The goal of this paper is to propose a new approach to extract speaker characteristics by constructing CNN filters linked to the speaker. Besides, we propose new vectors to identify speakers, which we call in this work convVectors. Experiments have been performed with a gender-dependent corpus (THUYG-20 SRE) under three noise conditions : clean, 9db, and 0db. We compared the proposed method with our baseline system and the state-of-the-art methods. Results showed that the convVectors method was the most robust, improving the baseline system by an average of 43%, and recording an equal error rate of 1.05% EER. This is an important finding to understand how deep learning models can be adapted to the problem of speaker recognition.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Agrawal, P., Kapoor, R., & Agrawal, S. (2014). A hybrid partial fingerprint matching algorithm for estimation of equal error rate. 2014 IEEE International Conference on Advanced Communications (pp. 1295–1299). IEEE: Control and Computing Technologies. Agrawal, P., Kapoor, R., & Agrawal, S. (2014). A hybrid partial fingerprint matching algorithm for estimation of equal error rate. 2014 IEEE International Conference on Advanced Communications (pp. 1295–1299). IEEE: Control and Computing Technologies.
Zurück zum Zitat Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al. (2016) Deep speech 2: End-to-end speech recognition in english and mandarin. In: International conference on machine learning, pp 173–182 Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al. (2016) Deep speech 2: End-to-end speech recognition in english and mandarin. In: International conference on machine learning, pp 173–182
Zurück zum Zitat Basyal, L., Kaushal, S., & Singh, G. (2018). Voice recognition robot with real time surveillance and automation. International Journal of Creative Research Thoughts, 6(1), 2320–2882. Basyal, L., Kaushal, S., & Singh, G. (2018). Voice recognition robot with real time surveillance and automation. International Journal of Creative Research Thoughts, 6(1), 2320–2882.
Zurück zum Zitat Bennani, Y., & Gallinari, P . (1994) . Connectionist approaches for automatic speaker recognition. In: Automatic Speaker Recognition, Identification and Verification. Bennani, Y., & Gallinari, P . (1994) . Connectionist approaches for automatic speaker recognition. In: Automatic Speaker Recognition, Identification and Verification.
Zurück zum Zitat Bouziane, A., Kadi, H., Hourri, S., & Kharroubi, J . (2016). An open and free speech corpus for speaker recognition: The fscsr speech corpus. In: 2016 11th International Conference on Intelligent Systems: Theories and Applications (SITA), IEEE, pp 1–5 Bouziane, A., Kadi, H., Hourri, S., & Kharroubi, J . (2016). An open and free speech corpus for speaker recognition: The fscsr speech corpus. In: 2016 11th International Conference on Intelligent Systems: Theories and Applications (SITA), IEEE, pp 1–5
Zurück zum Zitat Brinker, T. J., Hekler, A., Enk, A. H., Klode, J., Hauschild, A., Berking, C., et al. (2019). A convolutional neural network trained with dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image classification task. European Journal of Cancer, 111, 148–154.CrossRef Brinker, T. J., Hekler, A., Enk, A. H., Klode, J., Hauschild, A., Berking, C., et al. (2019). A convolutional neural network trained with dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image classification task. European Journal of Cancer, 111, 148–154.CrossRef
Zurück zum Zitat Chen, Y., Zhu, K., Zhu, L., He, X., Ghamisi, P., & Benediktsson, J. A. (2019). Automatic design of convolutional neural network for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 57(9), 7048–7066.CrossRef Chen, Y., Zhu, K., Zhu, L., He, X., Ghamisi, P., & Benediktsson, J. A. (2019). Automatic design of convolutional neural network for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 57(9), 7048–7066.CrossRef
Zurück zum Zitat Chen, Yh., Lopez-Moreno, I., Sainath, TN., Visontai, M., Alvarez, R., & Parada, C . (2015). Locally-connected and convolutional neural networks for small footprint speaker recognition. In: Sixteenth Annual Conference of the International Speech Communication Association. Chen, Yh., Lopez-Moreno, I., Sainath, TN., Visontai, M., Alvarez, R., & Parada, C . (2015). Locally-connected and convolutional neural networks for small footprint speaker recognition. In: Sixteenth Annual Conference of the International Speech Communication Association.
Zurück zum Zitat Choi, SS., Cha, SH., & Tappert, CC .(2010) . A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics pp 43–48. Choi, SS., Cha, SH., & Tappert, CC .(2010) . A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics pp 43–48.
Zurück zum Zitat Chung, JS., Nagrani, A., & Zisserman, A . (2018) . Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:180605622. Chung, JS., Nagrani, A., & Zisserman, A . (2018) . Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:​180605622.
Zurück zum Zitat Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.MathSciNetCrossRef Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.MathSciNetCrossRef
Zurück zum Zitat Deng, L .(2014). A tutorial survey of architectures, algorithms, and applications for deep learning. In: APSIPA Transactions on Signal and Information Processing 3. Deng, L .(2014). A tutorial survey of architectures, algorithms, and applications for deep learning. In: APSIPA Transactions on Signal and Information Processing 3.
Zurück zum Zitat Deng, L., Hinton, G., & Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related applications: An overview. 2013 IEEE International Conference on Acoustics (pp. 8599–8603). IEEE: Speech and Signal Processing. Deng, L., Hinton, G., & Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related applications: An overview. 2013 IEEE International Conference on Acoustics (pp. 8599–8603). IEEE: Speech and Signal Processing.
Zurück zum Zitat Forsyth, M. E., Sutherland, A. M., Elliott, J., & Jack, M. A. (1993). Hmm speaker verification with sparse training data on telephone quality speech. Speech Communication, 13(3–4), 411–416.CrossRef Forsyth, M. E., Sutherland, A. M., Elliott, J., & Jack, M. A. (1993). Hmm speaker verification with sparse training data on telephone quality speech. Speech Communication, 13(3–4), 411–416.CrossRef
Zurück zum Zitat Gardner, M. W., & Dorling, S. (1998). Artificial neural networks (the multilayer perceptron): A review of applications in the atmospheric sciences. Atmospheric Environment, 32(14–15), 2627–2636.CrossRef Gardner, M. W., & Dorling, S. (1998). Artificial neural networks (the multilayer perceptron): A review of applications in the atmospheric sciences. Atmospheric Environment, 32(14–15), 2627–2636.CrossRef
Zurück zum Zitat Hasan, M. R., Jamil, M., Rahman, M., et al. (2004). Speaker identification using MEL frequency cepstral coefficients. Variations, 1(4), 9. Hasan, M. R., Jamil, M., Rahman, M., et al. (2004). Speaker identification using MEL frequency cepstral coefficients. Variations, 1(4), 9.
Zurück zum Zitat Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, Ar., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., & Kingsbury, B., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29. Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, Ar., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., & Kingsbury, B., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29.
Zurück zum Zitat Hinton, GE. (2012). A practical guide to training restricted Boltzmann machines. In: Neural networks: Tricks of the trade, Springer, pp 599–619. Hinton, GE. (2012). A practical guide to training restricted Boltzmann machines. In: Neural networks: Tricks of the trade, Springer, pp 599–619.
Zurück zum Zitat Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.MathSciNetCrossRef Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.MathSciNetCrossRef
Zurück zum Zitat Hourri, S., & Kharroubi, J. (2019). A novel scoring method based on distance calculation for similarity measurement in text-independent speaker verification. Procedia Computer Science, 148, 256–265.CrossRef Hourri, S., & Kharroubi, J. (2019). A novel scoring method based on distance calculation for similarity measurement in text-independent speaker verification. Procedia Computer Science, 148, 256–265.CrossRef
Zurück zum Zitat Hourri, S., & Kharroubi, J. (2020). A deep learning approach for speaker recognition. International Journal of Speech Technology, 23(1), 123–131.CrossRef Hourri, S., & Kharroubi, J. (2020). A deep learning approach for speaker recognition. International Journal of Speech Technology, 23(1), 123–131.CrossRef
Zurück zum Zitat Hourri, S., Nikolov, N. S., & Kharroubi, J. (2020). A deep learning approach to integrate convolutional neural networks in speaker recognition. International Journal of Speech Technology, 23(3), 615–623.CrossRef Hourri, S., Nikolov, N. S., & Kharroubi, J. (2020). A deep learning approach to integrate convolutional neural networks in speaker recognition. International Journal of Speech Technology, 23(3), 615–623.CrossRef
Zurück zum Zitat Hoy, M. B. (2018). Alexa, Siri, Cortana, and more: An introduction to voice assistants. Medical Reference Services Quarterly, 37(1), 81–88.CrossRef Hoy, M. B. (2018). Alexa, Siri, Cortana, and more: An introduction to voice assistants. Medical Reference Services Quarterly, 37(1), 81–88.CrossRef
Zurück zum Zitat Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of Physiology, 195(1), 215–243.CrossRef Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of Physiology, 195(1), 215–243.CrossRef
Zurück zum Zitat Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:14042188. Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:​14042188.
Zurück zum Zitat Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., & Alam, J . (2014) . Deep neural networks for extracting baum-welch statistics for speaker recognition. In: Proceeding of the Odyssey, pp 293–298. Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., & Alam, J . (2014) . Deep neural networks for extracting baum-welch statistics for speaker recognition. In: Proceeding of the Odyssey, pp 293–298.
Zurück zum Zitat Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.CrossRef Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.CrossRef
Zurück zum Zitat Krizhevsky, A., Sutskever, I., Hinton, GE. (2012). Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105. Krizhevsky, A., Sutskever, I., Hinton, GE. (2012). Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105.
Zurück zum Zitat LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10). LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10).
Zurück zum Zitat Lee, KF., & Hon, HW. (1988) .Large-vocabulary speaker-independent continuous speech recognition using hmm. In: Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on, IEEE, pp 123–126. Lee, KF., & Hon, HW. (1988) .Large-vocabulary speaker-independent continuous speech recognition using hmm. In: Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on, IEEE, pp 123–126.
Zurück zum Zitat Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M .(2014) . A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, IEEE, pp 1695–1699. Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M .(2014) . A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, IEEE, pp 1695–1699.
Zurück zum Zitat Li, C., Ma, X., Jiang, B., Li, X., Zhang, X., Liu, X., Cao, Y., Kannan, A., & Zhu, Z . (2017). Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:170502304. Li, C., Ma, X., Jiang, B., Li, X., Zhang, X., Liu, X., Cao, Y., Kannan, A., & Zhu, Z . (2017). Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:​170502304.
Zurück zum Zitat Li, J., Sun, M., Zhang, X., & Wang, Y. (2020). Joint decision of anti-spoofing and automatic speaker verification by multi-task learning with contrastive loss. IEEE Access, 8, 7907–7915.CrossRef Li, J., Sun, M., Zhang, X., & Wang, Y. (2020). Joint decision of anti-spoofing and automatic speaker verification by multi-task learning with contrastive loss. IEEE Access, 8, 7907–7915.CrossRef
Zurück zum Zitat Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., et al. (2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42, 60–88.CrossRef Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., et al. (2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42, 60–88.CrossRef
Zurück zum Zitat Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., & Yu, K. (2015). Deep feature for text-dependent speaker verification. Speech Communication, 73, 1–13.CrossRef Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., & Yu, K. (2015). Deep feature for text-dependent speaker verification. Speech Communication, 73, 1–13.CrossRef
Zurück zum Zitat Martinez, J., Perez, H., Escamilla, E., & Suzuki, MM. (2012). Speaker recognition using mel frequency cepstral coefficients (MFCC) and vector quantization (VQ) techniques. In: Electrical Communications and Computers (CONIELECOMP), 2012 22nd International Conference on, IEEE, pp 248–251. Martinez, J., Perez, H., Escamilla, E., & Suzuki, MM. (2012). Speaker recognition using mel frequency cepstral coefficients (MFCC) and vector quantization (VQ) techniques. In: Electrical Communications and Computers (CONIELECOMP), 2012 22nd International Conference on, IEEE, pp 248–251.
Zurück zum Zitat Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In: Eleventh Annual Conference of the International Speech Communication Association. Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In: Eleventh Annual Conference of the International Speech Communication Association.
Zurück zum Zitat Molau, S., Pitz, M., Schluter, R., & Ney, H . (2001) . Computing mel-frequency cepstral coefficients on the power spectrum. In: Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, IEEE, vol 1, pp 73–76. Molau, S., Pitz, M., Schluter, R., & Ney, H . (2001) . Computing mel-frequency cepstral coefficients on the power spectrum. In: Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, IEEE, vol 1, pp 73–76.
Zurück zum Zitat Palaz, D., Collobert, R., et al. (2015). Analysis of cnn-based speech recognition system using raw speech as input. Idiap: Tech. Rep. Palaz, D., Collobert, R., et al. (2015). Analysis of cnn-based speech recognition system using raw speech as input. Idiap: Tech. Rep.
Zurück zum Zitat Prasad, NV., & Umesh, S. (2013). Improved cepstral mean and variance normalization using bayesian framework. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE, pp 156–161. Prasad, NV., & Umesh, S. (2013). Improved cepstral mean and variance normalization using bayesian framework. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE, pp 156–161.
Zurück zum Zitat Ravanelli, M., & Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet. In: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp 1021–1028. Ravanelli, M., & Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet. In: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp 1021–1028.
Zurück zum Zitat Reddy, D. R. (1976). Speech recognition by machine: A review. Proceedings of the IEEE, 64(4), 501–531.CrossRef Reddy, D. R. (1976). Speech recognition by machine: A review. Proceedings of the IEEE, 64(4), 501–531.CrossRef
Zurück zum Zitat Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRef Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRef
Zurück zum Zitat Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.CrossRef Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.CrossRef
Zurück zum Zitat Rozi, A., Wang, D., Zhang, Z., & Zheng, TF. (2015). An open/free database and benchmark for uyghur speaker recognition. In: Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2015 International Conference, IEEE, pp 81–85. Rozi, A., Wang, D., Zhang, Z., & Zheng, TF. (2015). An open/free database and benchmark for uyghur speaker recognition. In: Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2015 International Conference, IEEE, pp 81–85.
Zurück zum Zitat Sadjadi, S. O., & Hansen, J. H. (2015). Mean hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Communication, 72, 138–148.CrossRef Sadjadi, S. O., & Hansen, J. H. (2015). Mean hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Communication, 72, 138–148.CrossRef
Zurück zum Zitat Sainath, TN., Mohamed, Ar., Kingsbury, B., & Ramabhadran, B. (2013). Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, pp 8614–8618. Sainath, TN., Mohamed, Ar., Kingsbury, B., & Ramabhadran, B. (2013). Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, pp 8614–8618.
Zurück zum Zitat Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). Dbscan revisited, revisited: Why and how you should (still) use Dbscan. ACM Transactions on Database Systems (TODS), 42(3), 1–21.MathSciNetCrossRef Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). Dbscan revisited, revisited: Why and how you should (still) use Dbscan. ACM Transactions on Database Systems (TODS), 42(3), 1–21.MathSciNetCrossRef
Zurück zum Zitat Senoussaoui, M., Dehak, N., Kenny, P., Dehak, R., & Dumouchel, P. (2012). First attempt of boltzmann machines for speaker verification. In: Odyssey 2012-the speaker and language recognition workshop. Senoussaoui, M., Dehak, N., Kenny, P., Dehak, R., & Dumouchel, P. (2012). First attempt of boltzmann machines for speaker verification. In: Odyssey 2012-the speaker and language recognition workshop.
Zurück zum Zitat Sermanet, P., Chintala, S., & LeCun, Y. (2012). Convolutional neural networks applied to house numbers digit classification. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), IEEE, pp 3288–3291. Sermanet, P., Chintala, S., & LeCun, Y. (2012). Convolutional neural networks applied to house numbers digit classification. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), IEEE, pp 3288–3291.
Zurück zum Zitat Shahin, I., & Botros, N .(1998). Speaker identification using dynamic time warping with stress compensation technique. In: Southeastcon’98. Proceedings. IEEE, IEEE, pp 65–68. Shahin, I., & Botros, N .(1998). Speaker identification using dynamic time warping with stress compensation technique. In: Southeastcon’98. Proceedings. IEEE, IEEE, pp 65–68.
Zurück zum Zitat Singh, S., & Rajan, E. (2011). Vector quantization approach for speaker recognition using MFCC and inverted MFCC. International Journal of Computer Applications, 17(1), 1–7.CrossRef Singh, S., & Rajan, E. (2011). Vector quantization approach for speaker recognition using MFCC and inverted MFCC. International Journal of Computer Applications, 17(1), 1–7.CrossRef
Zurück zum Zitat Skourt, BA., Nikolov, NS., & Majda, A. (2019). Feature-extraction methods for lung-nodule detection: A comparative deep learning study. In: 2019 International Conference on Intelligent Systems and Advanced Computing Sciences (ISACS), IEEE, pp 1–6. Skourt, BA., Nikolov, NS., & Majda, A. (2019). Feature-extraction methods for lung-nodule detection: A comparative deep learning study. In: 2019 International Conference on Intelligent Systems and Advanced Computing Sciences (ISACS), IEEE, pp 1–6.
Zurück zum Zitat Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. 2018 IEEE International Conference on Acoustics (pp. 5329–5333). IEEE: Speech and Signal Processing (ICASSP). Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. 2018 IEEE International Conference on Acoustics (pp. 5329–5333). IEEE: Speech and Signal Processing (ICASSP).
Zurück zum Zitat Soong, F. K., Rosenberg, A. E., Juang, B. H., & Rabiner, L. R. (1987). Report: A vector quantization approach to speaker recognition. AT&T Technical Journal, 66(2), 14–26.CrossRef Soong, F. K., Rosenberg, A. E., Juang, B. H., & Rabiner, L. R. (1987). Report: A vector quantization approach to speaker recognition. AT&T Technical Journal, 66(2), 14–26.CrossRef
Zurück zum Zitat Tirumala, SS., & Shahamiri, SR. (2016). A review on deep learning approaches in speaker identification. In: Proceedings of the 8th International Conference on Signal Processing Systems, ACM, pp 142–147. Tirumala, SS., & Shahamiri, SR. (2016). A review on deep learning approaches in speaker identification. In: Proceedings of the 8th International Conference on Signal Processing Systems, ACM, pp 142–147.
Zurück zum Zitat Tóth, L. (2014). Combining time-and frequency-domain convolution in convolutional neural network-based phone recognition. 2014 IEEE International Conference on Acoustics (pp. 190–194). IEEE: Speech and Signal Processing (ICASSP). Tóth, L. (2014). Combining time-and frequency-domain convolution in convolutional neural network-based phone recognition. 2014 IEEE International Conference on Acoustics (pp. 190–194). IEEE: Speech and Signal Processing (ICASSP).
Zurück zum Zitat Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13(3), 35–75.CrossRef Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13(3), 35–75.CrossRef
Zurück zum Zitat Zhang, C., Koishida, K., & Hansen, J. H. (2018). Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(9), 1633–1644.CrossRef Zhang, C., Koishida, K., & Hansen, J. H. (2018). Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(9), 1633–1644.CrossRef
Metadaten
Titel
Convolutional neural network vectors for speaker recognition
verfasst von
Soufiane Hourri
Nikola S. Nikolov
Jamal Kharroubi
Publikationsdatum
22.01.2021
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 2/2021
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-021-09795-2

Weitere Artikel der Ausgabe 2/2021

International Journal of Speech Technology 2/2021 Zur Ausgabe