Skip to main content
Erschienen in: Pattern Analysis and Applications 2/2024

01.06.2024 | Theoretical Advances

A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition

verfasst von: V. Karthikeyan, S. Suja Priyadharsini

Erschienen in: Pattern Analysis and Applications | Ausgabe 2/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Short-utterance speaker identification is a difficult area of study in natural language processing (NLP). Most cutting-edge experimental approaches for speech processing make use of convolutional neural networks (CNNs) and deep neural networks and analyse data in a unidirectional stream of time. In the past, approaches for identifying speakers that utilised CNNs often made use of highly dense or vast layers, leading to a large number of factors and significant computational expenses. In this article, we provide a novel multi-scale attention-focused 1-dimensional convolutional neural network (MSA-CNN) for recognising speakers that combines L1 and L2 norms. The multi-scale convolutional training architecture was developed to autonomously extract multi-scale characteristics of raw audio data by employing a variety of filter banks. In order for the multi-scale system to emphasis on important speaker feature characteristics in varying settings, a novel attention mechanism was built. In the end, it was combined and applied to the suggested multi-layered convolutional neural network framework to identify the speakers' labels. The recommended network model was tested on a number of standard voice databases and real time recorded corpus. The findings from the experiments demonstrate that our methodology outperformed a baseline CNN scheme (without an attention mechanism) in addition to conventional speaker identification techniques involving feature engineering, achieving an accuracy rate of 97.94% across numerous databases as well as distortion constraints.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Hansen JHL, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process Mag 32(6):74–99CrossRef Hansen JHL, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process Mag 32(6):74–99CrossRef
2.
Zurück zum Zitat Burton D (1987) Text-dependent speaker verification using vector quantization source coding. IEEE Trans Acoust 35(2):133–143CrossRef Burton D (1987) Text-dependent speaker verification using vector quantization source coding. IEEE Trans Acoust 35(2):133–143CrossRef
3.
Zurück zum Zitat Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Trans Speech Audio Process 3(1):72–83CrossRef Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Trans Speech Audio Process 3(1):72–83CrossRef
4.
Zurück zum Zitat Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digit Signal Process 10(1–3):19–41CrossRef Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digit Signal Process 10(1–3):19–41CrossRef
5.
Zurück zum Zitat Campbell WM, Sturim DE, Reynolds DA, Solomonoff A (2006) SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: Proceedings.2006 IEEE international conference on acoustics, speech and signal processing, 2006. ICASSP 2006, 1. IEEE, pp I-I Campbell WM, Sturim DE, Reynolds DA, Solomonoff A (2006) SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: Proceedings.2006 IEEE international conference on acoustics, speech and signal processing, 2006. ICASSP 2006, 1. IEEE, pp I-I
6.
Zurück zum Zitat Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process Lett 13(5):308–311CrossRef Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process Lett 13(5):308–311CrossRef
7.
Zurück zum Zitat Dehak N (2009) Discriminative and generative approaches for long-and short-term speaker characteristics modeling: application to speaker verification (Doctoral dissertation, École de technologie supérieure) Dehak N (2009) Discriminative and generative approaches for long-and short-term speaker characteristics modeling: application to speaker verification (Doctoral dissertation, École de technologie supérieure)
8.
Zurück zum Zitat Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798CrossRef Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798CrossRef
9.
Zurück zum Zitat Garcia-Romero D, Espy-Wilson CY (2011) Analysis of i-vector length normalization in speaker recognition systems. In: Twelfth annual conference of the international speech communication association Garcia-Romero D, Espy-Wilson CY (2011) Analysis of i-vector length normalization in speaker recognition systems. In: Twelfth annual conference of the international speech communication association
10.
Zurück zum Zitat Cumani S, Plchot O, Laface P (2013) Probabilistic linear discriminant analysis of i-vector posterior distributions. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 7644–7648 Cumani S, Plchot O, Laface P (2013) Probabilistic linear discriminant analysis of i-vector posterior distributions. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 7644–7648
11.
Zurück zum Zitat Das RK, MahadevaPrasanna SR (2018) Speaker verification from short utterance perspective: a review. IETE Tech Rev 35(6):599–617CrossRef Das RK, MahadevaPrasanna SR (2018) Speaker verification from short utterance perspective: a review. IETE Tech Rev 35(6):599–617CrossRef
12.
Zurück zum Zitat Poddar A, Sahidullah M, Saha G (2017) Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biom 7(2):91–101CrossRef Poddar A, Sahidullah M, Saha G (2017) Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biom 7(2):91–101CrossRef
13.
Zurück zum Zitat Sarkar AK, Matrouf D, Bousquet PM, Bonastre J-F (2012) Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In: Thirteenth annual conference of the international speech communication association Sarkar AK, Matrouf D, Bousquet PM, Bonastre J-F (2012) Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In: Thirteenth annual conference of the international speech communication association
14.
Zurück zum Zitat Lei Y, Scheffer N, Ferrer L, McLaren M (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 1695–1699 Lei Y, Scheffer N, Ferrer L, McLaren M (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 1695–1699
15.
Zurück zum Zitat Gonzalez-Dominguez J (2014) Deep neural networks for small footprint text-dependent speaker verification. In: ICASSP, vol 14, pp 4052–4056 Gonzalez-Dominguez J (2014) Deep neural networks for small footprint text-dependent speaker verification. In: ICASSP, vol 14, pp 4052–4056
16.
Zurück zum Zitat Snyder D, Garcia-Romero D, Povey D, Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp 999–1003 Snyder D, Garcia-Romero D, Povey D, Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp 999–1003
18.
Zurück zum Zitat Zhang C, Koishida K (2017) End-to-end text-independent speaker verification with triplet loss on short utterances. In: Proceedings of the Interspeech Zhang C, Koishida K (2017) End-to-end text-independent speaker verification with triplet loss on short utterances. In: Proceedings of the Interspeech
19.
Zurück zum Zitat Heigold G, Moreno I, Bengio S, Shazeer N (2016) End-to-end text-dependent speaker verification. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 5115–5119 Heigold G, Moreno I, Bengio S, Shazeer N (2016) End-to-end text-dependent speaker verification. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 5115–5119
20.
Zurück zum Zitat Wan L, Wang Q, Papir A, Moreno IL (2018) Generalized end-to-end loss for speaker verification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 4879–4883 Wan L, Wang Q, Papir A, Moreno IL (2018) Generalized end-to-end loss for speaker verification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 4879–4883
21.
Zurück zum Zitat He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
22.
Zurück zum Zitat Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9 Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
23.
Zurück zum Zitat Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008
26.
Zurück zum Zitat Zhang S-X, Chen Z, Zhao Y, Li J, Gong Y (2016) End-to-End attention based text-dependent speaker verification. In: Spoken language technology workshop (SLT), 2016 IEEE, IEEE, pp 171–178 Zhang S-X, Chen Z, Zhao Y, Li J, Gong Y (2016) End-to-End attention based text-dependent speaker verification. In: Spoken language technology workshop (SLT), 2016 IEEE, IEEE, pp 171–178
27.
Zurück zum Zitat Matejka P, et al (2016) Analysis of DNN approaches to speaker identification. In: IEEE ICASSP, pp 5100–5104 Matejka P, et al (2016) Analysis of DNN approaches to speaker identification. In: IEEE ICASSP, pp 5100–5104
28.
Zurück zum Zitat Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675CrossRef Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675CrossRef
29.
Zurück zum Zitat Zhang Z, Wang L, Kai A, Yamada T, Li W, Iwahashi M (2015) Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP J Audio Speech Music Process 2015:1–13CrossRef Zhang Z, Wang L, Kai A, Yamada T, Li W, Iwahashi M (2015) Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP J Audio Speech Music Process 2015:1–13CrossRef
30.
Zurück zum Zitat Richardson FS, Melot JT, Brandstein MS, Reynolds DA (2016) Speaker recognition using real versus synthetic parallel data for DNN channel compensation. In: Proceedings of the INTERSPEECH, pp 1–6 Richardson FS, Melot JT, Brandstein MS, Reynolds DA (2016) Speaker recognition using real versus synthetic parallel data for DNN channel compensation. In: Proceedings of the INTERSPEECH, pp 1–6
31.
Zurück zum Zitat Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S (2018) X-vectors: robust DNN embeddings for speaker recognition. In: Proceedings of the IEEE ICASSP, pp 5329–5333 Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S (2018) X-vectors: robust DNN embeddings for speaker recognition. In: Proceedings of the IEEE ICASSP, pp 5329–5333
33.
Zurück zum Zitat Chowdhury A, Ross A (2019) Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals. IEEE Trans Inf Forensics Secur 15:1616–1629CrossRef Chowdhury A, Ross A (2019) Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals. IEEE Trans Inf Forensics Secur 15:1616–1629CrossRef
34.
Zurück zum Zitat Karthikeyan V, SujaPriyadharsini S (2022) Modified layer deep convolution neural network for text-independent speaker recognition. J Exp Theor Artif Intell 36(2):273–285CrossRef Karthikeyan V, SujaPriyadharsini S (2022) Modified layer deep convolution neural network for text-independent speaker recognition. J Exp Theor Artif Intell 36(2):273–285CrossRef
35.
Zurück zum Zitat Qin X, Li N, Weng C, Su D, Li M (2022) Simple attention module based speaker verification with iterative noisy label detection. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6722–6726. IEEE Qin X, Li N, Weng C, Su D, Li M (2022) Simple attention module based speaker verification with iterative noisy label detection. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6722–6726. IEEE
37.
Zurück zum Zitat Bian T, Chen F, Xu L (2019) Self-attention based speaker recognition using Cluster-Range Loss. Neurocomputing 368:59–68CrossRef Bian T, Chen F, Xu L (2019) Self-attention based speaker recognition using Cluster-Range Loss. Neurocomputing 368:59–68CrossRef
38.
Zurück zum Zitat Yao Y, Zhang S, Yang S, Gui G (2020) Learning attention representation with a multi-scale CNN for gear fault diagnosis under different working conditions. Sensors 20(4):1233CrossRef Yao Y, Zhang S, Yang S, Gui G (2020) Learning attention representation with a multi-scale CNN for gear fault diagnosis under different working conditions. Sensors 20(4):1233CrossRef
39.
Zurück zum Zitat Cai W, Chen J, Li M (2018) Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In: Proceedings of the Odyssey 2018: the speaker and language recognition workshop, Les Sables d’Olonne, France, pp 74–81 Cai W, Chen J, Li M (2018) Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In: Proceedings of the Odyssey 2018: the speaker and language recognition workshop, Les Sables d’Olonne, France, pp 74–81
40.
Zurück zum Zitat Okabe K, Koshinaka T, Shinoda K (2018) Attentive statistics pooling for deep speaker embedding. In: Proceedings of the 19th annual conference of the international speech communication association (Interspeech), Hyderabad, India, pp 2252–2256 Okabe K, Koshinaka T, Shinoda K (2018) Attentive statistics pooling for deep speaker embedding. In: Proceedings of the 19th annual conference of the international speech communication association (Interspeech), Hyderabad, India, pp 2252–2256
41.
Zurück zum Zitat Arandjelović R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, pp 5297–5307 Arandjelović R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, pp 5297–5307
42.
Zurück zum Zitat Xie W, Nagrani A, Chung JS, Zisserman A (2019) Utterance-level aggregation for speaker recognition in the wild. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), Brighton, UK, pp 5791–5795 Xie W, Nagrani A, Chung JS, Zisserman A (2019) Utterance-level aggregation for speaker recognition in the wild. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), Brighton, UK, pp 5791–5795
43.
Zurück zum Zitat Wang M, Feng D, Su T, Chen M (2022) Attention-based temporal-frequency aggregation for speaker verification. Sensors 22(6):2147CrossRef Wang M, Feng D, Su T, Chen M (2022) Attention-based temporal-frequency aggregation for speaker verification. Sensors 22(6):2147CrossRef
44.
Zurück zum Zitat San-Segundo R et al (2012) Design, development and field evaluation of a Spanish into sign language translation system. Pattern Anal Appl 15:203–224MathSciNetCrossRef San-Segundo R et al (2012) Design, development and field evaluation of a Spanish into sign language translation system. Pattern Anal Appl 15:203–224MathSciNetCrossRef
47.
Zurück zum Zitat Brooks C (2008) Introductory econometrics for finance, 2nd edn. Cambridge University Press, CambridgeCrossRef Brooks C (2008) Introductory econometrics for finance, 2nd edn. Cambridge University Press, CambridgeCrossRef
48.
Zurück zum Zitat Feng L (2004) Speaker recognition (Master's thesis, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark Feng L (2004) Speaker recognition (Master's thesis, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark
49.
Zurück zum Zitat Dhakal P, Damacharla P, Javaid AY, Devabhaktuni V (2019) A near real-time automatic speaker recognition architecture for voice-based user interface. Mach Learn Knowl Extr 1(1):504–520CrossRef Dhakal P, Damacharla P, Javaid AY, Devabhaktuni V (2019) A near real-time automatic speaker recognition architecture for voice-based user interface. Mach Learn Knowl Extr 1(1):504–520CrossRef
51.
Zurück zum Zitat Garofolo JS (1993) Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium Garofolo JS (1993) Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium
52.
Zurück zum Zitat Wildermoth BR, Paliwal KK (2003) GMM based speaker recognition on readily available databases. In: Microelectronic engineering research conference, Brisbane, Australia, vol 7, p 55 Wildermoth BR, Paliwal KK (2003) GMM based speaker recognition on readily available databases. In: Microelectronic engineering research conference, Brisbane, Australia, vol 7, p 55
53.
Zurück zum Zitat Lukic Y., Vogt C., Dürr O., & Stadelmann T. 2016. Speaker identification and clustering using convolutional neural networks. In 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP) (pp. 1–6). IEEE. Lukic Y., Vogt C., Dürr O., & Stadelmann T. 2016. Speaker identification and clustering using convolutional neural networks. In 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP) (pp. 1–6). IEEE.
54.
Zurück zum Zitat Thanda Setty V (2018) Speaker recognition using deep neural networks with reduced complexity (Thesis). Texas State University, San Marcos, Texas Thanda Setty V (2018) Speaker recognition using deep neural networks with reduced complexity (Thesis). Texas State University, San Marcos, Texas
55.
Zurück zum Zitat Ghezaiel W, Brun L, Lézoray O (2021) Hybrid network for end-to-end text-independent speaker identification. In: 2020 25th international conference on pattern recognition (ICPR), pp 2352–2359. IEEE Ghezaiel W, Brun L, Lézoray O (2021) Hybrid network for end-to-end text-independent speaker identification. In: 2020 25th international conference on pattern recognition (ICPR), pp 2352–2359. IEEE
57.
Zurück zum Zitat Nunes JAC, Macêdo D, Zanchettin C (2020) AM-mobilenet1D: a portable model for speaker recognition. In: 2020 International joint conference on neural networks (IJCNN), pp 1–8. IEEE Nunes JAC, Macêdo D, Zanchettin C (2020) AM-mobilenet1D: a portable model for speaker recognition. In: 2020 International joint conference on neural networks (IJCNN), pp 1–8. IEEE
58.
Zurück zum Zitat Ravanelli M, Bengio Y (2018) Speaker recognition from raw waveform with sincnet. In: 2018 IEEE spoken language technology workshop (SLT), pp 1021–1028. IEEE Ravanelli M, Bengio Y (2018) Speaker recognition from raw waveform with sincnet. In: 2018 IEEE spoken language technology workshop (SLT), pp 1021–1028. IEEE
59.
Zurück zum Zitat Nunes JAC, Macêdo D, Zanchettin C (2019) Additive margin sincnet for speaker recognition. In: Proceedings of the 2019 IEEE international joint conference on neural networks (IJCNN), Budapest, Hungary, 14–19, pp 1–5 Nunes JAC, Macêdo D, Zanchettin C (2019) Additive margin sincnet for speaker recognition. In: Proceedings of the 2019 IEEE international joint conference on neural networks (IJCNN), Budapest, Hungary, 14–19, pp 1–5
60.
Zurück zum Zitat Chowdhury L, Zunair H, Mohammed N (2020) Robust deep speaker recognition: learning latent representation with joint angular margin loss. Appl Sci 10(21):7522CrossRef Chowdhury L, Zunair H, Mohammed N (2020) Robust deep speaker recognition: learning latent representation with joint angular margin loss. Appl Sci 10(21):7522CrossRef
61.
Zurück zum Zitat Prachi NN, Nahiyan FM, Habibullah M, Khan R (2022) Deep learning based speaker recognition system with CNN and LSTM techniques. In: 2022 interdisciplinary research in technology and management (IRTM), pp 1–6. IEEE Prachi NN, Nahiyan FM, Habibullah M, Khan R (2022) Deep learning based speaker recognition system with CNN and LSTM techniques. In: 2022 interdisciplinary research in technology and management (IRTM), pp 1–6. IEEE
62.
Zurück zum Zitat NIST Multimodal Information Group (2008) NIST Speaker Recognition Evaluation Training Set Part 1 LDC2011S05; Linguistic Data Consortium: Philadelphia, PA, USA, 2011 NIST Multimodal Information Group (2008) NIST Speaker Recognition Evaluation Training Set Part 1 LDC2011S05; Linguistic Data Consortium: Philadelphia, PA, USA, 2011
63.
Zurück zum Zitat Al-Kaltakchi MT, Woo WL, Dlay SS, Chambers JA (2017) Comparison of I-vector and GMM-UBM approaches to speaker identification with TIMIT and NIST 2008 databases in challenging environments. In: 2017 25th European signal processing conference (EUSIPCO), pp 533–537. IEEE Al-Kaltakchi MT, Woo WL, Dlay SS, Chambers JA (2017) Comparison of I-vector and GMM-UBM approaches to speaker identification with TIMIT and NIST 2008 databases in challenging environments. In: 2017 25th European signal processing conference (EUSIPCO), pp 533–537. IEEE
64.
Zurück zum Zitat Chang J, Wang D (2017) Robust speaker recognition based on DNN/i-vectors and speech separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5415–5419. IEEE Chang J, Wang D (2017) Robust speaker recognition based on DNN/i-vectors and speech separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5415–5419. IEEE
65.
Zurück zum Zitat Sun C, Yang Y, Wen C, Xie K, Wen F (2018) Voiceprint identification for limited dataset using the deep migration hybrid model based on transfer learning. Sensors 18(7):2399CrossRef Sun C, Yang Y, Wen C, Xie K, Wen F (2018) Voiceprint identification for limited dataset using the deep migration hybrid model based on transfer learning. Sensors 18(7):2399CrossRef
66.
Zurück zum Zitat Wen Y, Zhou T, Singh R, Raj B (2018) A corrective learning approach for text-independent speaker verification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4894–4898. IEEE Wen Y, Zhou T, Singh R, Raj B (2018) A corrective learning approach for text-independent speaker verification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4894–4898. IEEE
67.
Zurück zum Zitat Ribas D, Vincent E (2019) An improved uncertainty propagation method for robust i-vector based speaker recognition. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6331–6335. IEEE Ribas D, Vincent E (2019) An improved uncertainty propagation method for robust i-vector based speaker recognition. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6331–6335. IEEE
68.
Zurück zum Zitat Cieri C, Miller D, Walker K (2004) Fisher English training speech parts 1 and 2. In: Philadelphia: linguistic data consortium. University of Pennsylvania, Philadelphia Cieri C, Miller D, Walker K (2004) Fisher English training speech parts 1 and 2. In: Philadelphia: linguistic data consortium. University of Pennsylvania, Philadelphia
69.
Zurück zum Zitat Tan B, Li Q, Foresta R (2010) An automatic non-native speaker recognition system. In: 2010 IEEE international conference on technologies for homeland security (HST), pp 77–83. IEEE Tan B, Li Q, Foresta R (2010) An automatic non-native speaker recognition system. In: 2010 IEEE international conference on technologies for homeland security (HST), pp 77–83. IEEE
70.
Zurück zum Zitat McClanahan R, De Leon P (2013) Towards a more efficient SVM supervector speaker verification system using Gaussian reduction and a tree-structured hash (No. SAND2013-2166C). Sandia National Lab. (SNL-NM), Albuquerque, NM (United States) McClanahan R, De Leon P (2013) Towards a more efficient SVM supervector speaker verification system using Gaussian reduction and a tree-structured hash (No. SAND2013-2166C). Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
71.
Zurück zum Zitat Chowdhury A, Ross A (2017) Extracting sub-glottal and supra-glottal features from MFCC using convolutional neural networks for speaker identification in degraded audio signals. In: 2017 IEEE international joint conference on biometrics (IJCB), pp 608–617. IEEE. Chowdhury A, Ross A (2017) Extracting sub-glottal and supra-glottal features from MFCC using convolutional neural networks for speaker identification in degraded audio signals. In: 2017 IEEE international joint conference on biometrics (IJCB), pp 608–617. IEEE.
72.
Zurück zum Zitat Nammous MK, Saeed K, Kobojek P (2022) Using a small amount of text-independent speech data for a BiLSTM large-scale speaker identification approach. J King Saud Univ-Comput Inf Sci 34(3):764–770 Nammous MK, Saeed K, Kobojek P (2022) Using a small amount of text-independent speech data for a BiLSTM large-scale speaker identification approach. J King Saud Univ-Comput Inf Sci 34(3):764–770
73.
Zurück zum Zitat Karthikeyan V, Suja PS (2022) Adaptive boosted random forest-support vector machine based classification scheme for speaker identification. Appl Soft Comput 131:109826CrossRef Karthikeyan V, Suja PS (2022) Adaptive boosted random forest-support vector machine based classification scheme for speaker identification. Appl Soft Comput 131:109826CrossRef
Metadaten
Titel
A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition
verfasst von
V. Karthikeyan
S. Suja Priyadharsini
Publikationsdatum
01.06.2024
Verlag
Springer London
Erschienen in
Pattern Analysis and Applications / Ausgabe 2/2024
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-024-01278-9

Weitere Artikel der Ausgabe 2/2024

Pattern Analysis and Applications 2/2024 Zur Ausgabe

Premium Partner