nach oben

Pattern Analysis and Applications

Erschienen in:

01.06.2024 | Theoretical Advances

A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition

verfasst von: V. Karthikeyan, S. Suja Priyadharsini

Erschienen in: Pattern Analysis and Applications | Ausgabe 2/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Short-utterance speaker identification is a difficult area of study in natural language processing (NLP). Most cutting-edge experimental approaches for speech processing make use of convolutional neural networks (CNNs) and deep neural networks and analyse data in a unidirectional stream of time. In the past, approaches for identifying speakers that utilised CNNs often made use of highly dense or vast layers, leading to a large number of factors and significant computational expenses. In this article, we provide a novel multi-scale attention-focused 1-dimensional convolutional neural network (MSA-CNN) for recognising speakers that combines L1 and L2 norms. The multi-scale convolutional training architecture was developed to autonomously extract multi-scale characteristics of raw audio data by employing a variety of filter banks. In order for the multi-scale system to emphasis on important speaker feature characteristics in varying settings, a novel attention mechanism was built. In the end, it was combined and applied to the suggested multi-layered convolutional neural network framework to identify the speakers' labels. The recommended network model was tested on a number of standard voice databases and real time recorded corpus. The findings from the experiments demonstrate that our methodology outperformed a baseline CNN scheme (without an attention mechanism) in addition to conventional speaker identification techniques involving feature engineering, achieving an accuracy rate of 97.94% across numerous databases as well as distortion constraints.

Vorheriger Artikel Correction to: FBRNet: a feature fusion and border refinement network for real-time semantic segmentation

Nächster Artikel Proxemics-net++: classification of human interactions in still images

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Hansen JHL, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process Mag 32(6):74–99CrossRef

Burton D (1987) Text-dependent speaker verification using vector quantization source coding. IEEE Trans Acoust 35(2):133–143CrossRef

Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Trans Speech Audio Process 3(1):72–83CrossRef

Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digit Signal Process 10(1–3):19–41CrossRef

Campbell WM, Sturim DE, Reynolds DA, Solomonoff A (2006) SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: Proceedings.2006 IEEE international conference on acoustics, speech and signal processing, 2006. ICASSP 2006, 1. IEEE, pp I-I

Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process Lett 13(5):308–311CrossRef

Dehak N (2009) Discriminative and generative approaches for long-and short-term speaker characteristics modeling: application to speaker verification (Doctoral dissertation, École de technologie supérieure)

Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798CrossRef

Garcia-Romero D, Espy-Wilson CY (2011) Analysis of i-vector length normalization in speaker recognition systems. In: Twelfth annual conference of the international speech communication association

10.

Cumani S, Plchot O, Laface P (2013) Probabilistic linear discriminant analysis of i-vector posterior distributions. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 7644–7648

11.

Das RK, MahadevaPrasanna SR (2018) Speaker verification from short utterance perspective: a review. IETE Tech Rev 35(6):599–617CrossRef

12.

Poddar A, Sahidullah M, Saha G (2017) Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biom 7(2):91–101CrossRef

13.

Sarkar AK, Matrouf D, Bousquet PM, Bonastre J-F (2012) Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In: Thirteenth annual conference of the international speech communication association

14.

Lei Y, Scheffer N, Ferrer L, McLaren M (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 1695–1699

15.

Gonzalez-Dominguez J (2014) Deep neural networks for small footprint text-dependent speaker verification. In: ICASSP, vol 14, pp 4052–4056

16.

Snyder D, Garcia-Romero D, Povey D, Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp 999–1003

17.

Li C, Ma X, Jiang B, Li X, Zhang X, Liu X, Cao Y, Kannan A, Zhu Z (2017) Deep speaker: an end-to-end neural speaker embedding system. http://arxiv.org/abs/1705.02304

18.

Zhang C, Koishida K (2017) End-to-end text-independent speaker verification with triplet loss on short utterances. In: Proceedings of the Interspeech

19.

Heigold G, Moreno I, Bengio S, Shazeer N (2016) End-to-end text-dependent speaker verification. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 5115–5119

20.

Wan L, Wang Q, Papir A, Moreno IL (2018) Generalized end-to-end loss for speaker verification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 4879–4883

21.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

22.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

23.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008

24.

Hu J, Shen L, Sun G (2017) Squeeze-and-excitation networks. http://arxiv.org/abs/1709.01507

25.

Zhang H, Goodfellow I, Metaxas D, Odena A (2018) Self-attention generative adversarial networks. http://arxiv.org/abs/1805.08318

26.

Zhang S-X, Chen Z, Zhao Y, Li J, Gong Y (2016) End-to-End attention based text-dependent speaker verification. In: Spoken language technology workshop (SLT), 2016 IEEE, IEEE, pp 171–178

27.

Matejka P, et al (2016) Analysis of DNN approaches to speaker identification. In: IEEE ICASSP, pp 5100–5104

28.

Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675CrossRef

29.

Zhang Z, Wang L, Kai A, Yamada T, Li W, Iwahashi M (2015) Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP J Audio Speech Music Process 2015:1–13CrossRef

30.

Richardson FS, Melot JT, Brandstein MS, Reynolds DA (2016) Speaker recognition using real versus synthetic parallel data for DNN channel compensation. In: Proceedings of the INTERSPEECH, pp 1–6

31.

Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S (2018) X-vectors: robust DNN embeddings for speaker recognition. In: Proceedings of the IEEE ICASSP, pp 5329–5333

32.

Karthikeyan V, Suja-Priyadharsini S (2021) A strong hybrid AdaBoost classification algorithm for speaker recognition. Sādhanā. 46(3):1–19. https://doi.org/10.1007/s12046-021-01649-6MathSciNetCrossRef

33.

Chowdhury A, Ross A (2019) Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals. IEEE Trans Inf Forensics Secur 15:1616–1629CrossRef

34.

Karthikeyan V, SujaPriyadharsini S (2022) Modified layer deep convolution neural network for text-independent speaker recognition. J Exp Theor Artif Intell 36(2):273–285CrossRef

35.

Qin X, Li N, Weng C, Su D, Li M (2022) Simple attention module based speaker verification with iterative noisy label detection. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6722–6726. IEEE

36.

Zhu H, Lee KA, Li H (2021) Serialized multi-layer multi-head attention for neural speaker embedding. http://arxiv.org/abs/2107.06493

37.

Bian T, Chen F, Xu L (2019) Self-attention based speaker recognition using Cluster-Range Loss. Neurocomputing 368:59–68CrossRef

38.

Yao Y, Zhang S, Yang S, Gui G (2020) Learning attention representation with a multi-scale CNN for gear fault diagnosis under different working conditions. Sensors 20(4):1233CrossRef

39.

Cai W, Chen J, Li M (2018) Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In: Proceedings of the Odyssey 2018: the speaker and language recognition workshop, Les Sables d’Olonne, France, pp 74–81

40.

Okabe K, Koshinaka T, Shinoda K (2018) Attentive statistics pooling for deep speaker embedding. In: Proceedings of the 19th annual conference of the international speech communication association (Interspeech), Hyderabad, India, pp 2252–2256

41.

Arandjelović R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, pp 5297–5307

42.

Xie W, Nagrani A, Chung JS, Zisserman A (2019) Utterance-level aggregation for speaker recognition in the wild. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), Brighton, UK, pp 5791–5795

43.

Wang M, Feng D, Su T, Chen M (2022) Attention-based temporal-frequency aggregation for speaker verification. Sensors 22(6):2147CrossRef

44.

San-Segundo R et al (2012) Design, development and field evaluation of a Spanish into sign language translation system. Pattern Anal Appl 15:203–224MathSciNetCrossRef

45.

Karthikeyan V, Suja Priyadharsini S (2023) A focus module-based lightweight end-to-end CNN framework for voiceprint recognition. SIViP (2023). https://doi.org/10.1007/s11760-023-02500-7

46.

Karthikeyan V, Suja PS (2022) Hybrid machine learning classification scheme for speaker identification. J Forensic Sci 46(3):1033–1048. https://doi.org/10.1111/1556-4029.15006CrossRef

47.

Brooks C (2008) Introductory econometrics for finance, 2nd edn. Cambridge University Press, CambridgeCrossRef

48.

Feng L (2004) Speaker recognition (Master's thesis, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark

49.

Dhakal P, Damacharla P, Javaid AY, Devabhaktuni V (2019) A near real-time automatic speaker recognition architecture for voice-based user interface. Mach Learn Knowl Extr 1(1):504–520CrossRef

50.

Banerjee A, Dubey A, Menon A, Nanda S, Nandi GC (2018) Speaker recognition using deep belief networks. http://arxiv.org/abs/1805.08865

51.

Garofolo JS (1993) Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium

52.

Wildermoth BR, Paliwal KK (2003) GMM based speaker recognition on readily available databases. In: Microelectronic engineering research conference, Brisbane, Australia, vol 7, p 55

53.

Lukic Y., Vogt C., Dürr O., & Stadelmann T. 2016. Speaker identification and clustering using convolutional neural networks. In 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP) (pp. 1–6). IEEE.

54.

Thanda Setty V (2018) Speaker recognition using deep neural networks with reduced complexity (Thesis). Texas State University, San Marcos, Texas

55.

Ghezaiel W, Brun L, Lézoray O (2021) Hybrid network for end-to-end text-independent speaker identification. In: 2020 25th international conference on pattern recognition (ICPR), pp 2352–2359. IEEE

56.

Li W (2021) Speaker identification from raw waveform with LineNet. http://arxiv.org/abs/2105.14826

57.

Nunes JAC, Macêdo D, Zanchettin C (2020) AM-mobilenet1D: a portable model for speaker recognition. In: 2020 International joint conference on neural networks (IJCNN), pp 1–8. IEEE

58.

Ravanelli M, Bengio Y (2018) Speaker recognition from raw waveform with sincnet. In: 2018 IEEE spoken language technology workshop (SLT), pp 1021–1028. IEEE

59.

Nunes JAC, Macêdo D, Zanchettin C (2019) Additive margin sincnet for speaker recognition. In: Proceedings of the 2019 IEEE international joint conference on neural networks (IJCNN), Budapest, Hungary, 14–19, pp 1–5

60.

Chowdhury L, Zunair H, Mohammed N (2020) Robust deep speaker recognition: learning latent representation with joint angular margin loss. Appl Sci 10(21):7522CrossRef

61.

Prachi NN, Nahiyan FM, Habibullah M, Khan R (2022) Deep learning based speaker recognition system with CNN and LSTM techniques. In: 2022 interdisciplinary research in technology and management (IRTM), pp 1–6. IEEE

62.

NIST Multimodal Information Group (2008) NIST Speaker Recognition Evaluation Training Set Part 1 LDC2011S05; Linguistic Data Consortium: Philadelphia, PA, USA, 2011

63.

Al-Kaltakchi MT, Woo WL, Dlay SS, Chambers JA (2017) Comparison of I-vector and GMM-UBM approaches to speaker identification with TIMIT and NIST 2008 databases in challenging environments. In: 2017 25th European signal processing conference (EUSIPCO), pp 533–537. IEEE

64.

Chang J, Wang D (2017) Robust speaker recognition based on DNN/i-vectors and speech separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5415–5419. IEEE

65.

Sun C, Yang Y, Wen C, Xie K, Wen F (2018) Voiceprint identification for limited dataset using the deep migration hybrid model based on transfer learning. Sensors 18(7):2399CrossRef

66.

Wen Y, Zhou T, Singh R, Raj B (2018) A corrective learning approach for text-independent speaker verification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4894–4898. IEEE

67.

Ribas D, Vincent E (2019) An improved uncertainty propagation method for robust i-vector based speaker recognition. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6331–6335. IEEE

68.

Cieri C, Miller D, Walker K (2004) Fisher English training speech parts 1 and 2. In: Philadelphia: linguistic data consortium. University of Pennsylvania, Philadelphia

69.

Tan B, Li Q, Foresta R (2010) An automatic non-native speaker recognition system. In: 2010 IEEE international conference on technologies for homeland security (HST), pp 77–83. IEEE

70.

McClanahan R, De Leon P (2013) Towards a more efficient SVM supervector speaker verification system using Gaussian reduction and a tree-structured hash (No. SAND2013-2166C). Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

71.

Chowdhury A, Ross A (2017) Extracting sub-glottal and supra-glottal features from MFCC using convolutional neural networks for speaker identification in degraded audio signals. In: 2017 IEEE international joint conference on biometrics (IJCB), pp 608–617. IEEE.

72.

Nammous MK, Saeed K, Kobojek P (2022) Using a small amount of text-independent speech data for a BiLSTM large-scale speaker identification approach. J King Saud Univ-Comput Inf Sci 34(3):764–770

73.

Karthikeyan V, Suja PS (2022) Adaptive boosted random forest-support vector machine based classification scheme for speaker identification. Appl Soft Comput 131:109826CrossRef

Titel: A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition
verfasst von: V. Karthikeyan
S. Suja Priyadharsini
Publikationsdatum: 01.06.2024
Verlag: Springer London
Erschienen in: Pattern Analysis and Applications / Ausgabe 2/2024
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI: https://doi.org/10.1007/s10044-024-01278-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2024

The limitations of differentiable architecture search

A novel Venus’ visible image processing neoteric workflow for improved planetary surface feature analysis

Remote sensing image location based on improved Yolov7 target detection

Block-wise imputation EM algorithm in multi-source scenario: ADNI case

Spatial–temporal attention with graph and general neural network-based sign language recognition

Focalize K-NN: an imputation algorithm for time series datasets

Premium Partner