nach oben

International Journal of Speech Technology

Erschienen in:

14.09.2022

Mining speech signal patterns for robust speaker variability classification

verfasst von: Moses Effiong Ekpenyong, Odudu-Obong Uwem Udocox

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

A speaker identification framework that combines both high- and low-level features, for state-of-the-art variability analysis and classification is proposed in this paper. The proposed framework introduces a workable solution that targets robust speaker variability classification using speech samples recorded in suboptimal conditions. A translated Ibibio (New Benue-Congo, Nigeria) version of “The Tiger and the Mouse”—a prosodically balanced corpus that demonstrates the prosody of read-aloud English was used in this study with speech samples obtained from 50 participants (25 males and 25 females). Identity-vectors (i-vectors) or low dimensional signal patterns were extracted and used as baselines for investigating speakers’ variability patterns across various classifiers (Decision Tree: DT, Support Vector Machine: SVM, k-Nearest Neighbour: k-NN, and Deep Neural Network: DNN) and kernels. Treatment of the baselines with high level features (speech duration, F0, intensity) was also experimented for word, syllable, and phoneme units. Results obtained revealed that DTs and some variants of SVM gave high classification accuracies (above 70%). Hence, the hypothesis of universal Gaussianity appears inexact, as the linear predictor that is optimal in the mean square error sense may not hold for Ibibio. Further treatments of the baselines with Linear Discriminant Analysis (LDA) and Cosine Distant Scoring (CDS) yielded very poor classification results, except for the k-NN and Gaussian SVM classifiers which performed well for the LDA treated baselines.

Vorheriger Artikel An approach for speech enhancement with dysarthric speech recognition using optimization based machine learning frameworks

Nächster Artikel An automated speech analysis system for the detection of cognitive decline in elderly

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Akinlabi, A., & Urua, E. E. (2003). Foot structure in the Ibibio verb. Journal of African Languages and Linguistics, 24(2), 119–160.CrossRef

Aronowitz, H., & Barkan, O. (2012). Efficient approximated i-vector extraction. In Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4789–4792).

Beaulac, C., & Rosenthal, J. S. (2020). BEST: A decision tree algorithm that handles missing values. Computational Statistics, 35(3), 1001–1026.MathSciNetCrossRefMATH

Bent, T., Bradlow, A. R., & Wright, B. A. (2006). The influence of linguistic experience on the cognitive processing of pitch in speech and nonspeech sounds. Journal of Experimental Psychology: Human Perception and Performance, 32(1), 97.

Bidelman, G. M., Hutka, S., & Moreno, S. (2013). Tone language speakers and musicians share enhanced perceptual and cognitive abilities for musical pitch: Evidence for bidirectionality between the domains of language and music. PLoS ONE, 8(4), e60676.CrossRef

Burnham, D., Kasisopa, B., Reid, A., Luksaneeyanawin, S., Lacerda, F., Attina, V., Rattanasone, N. X., Schwarz, I. C., & Webster, D. (2015). Universality and language-specific experience in the perception of lexical tone and pitch. Applied Psycholinguistics, 36(06), 1459–1491.CrossRef

Campbell, N. (2002). Recording and storing of speech data. JST/CREST Expressive Speech Processing Project.

Charbuty, B., & Abdulazeez, A. (2021). Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends, 2(01), 20–28.CrossRef

Chen, A., Liu, L., & Kager, R. (2016). Cross-domain correlation in pitch perception, the influence of native language. Language, Cognition and Neuroscience, 31(6), 751–760.CrossRef

Cooper, A., & Wang, Y. (2010). Cantonese tone word learning by tone and non-tone language speakers. In Proceedings of INTERSPEECH conference (pp. 1840–1843).

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.CrossRefMATH

Dargan, S., Kumar, M., Ayyagari, M. R., & Kumar, G. (2020). A survey of deep learning and its applications: A new paradigm to machine learning. Archives of Computational Methods in Engineering, 27(4), 1071–1092.MathSciNetCrossRef

Dehak, N., Dehak, R., Glass, J. R., Reynolds, D. A., & Kenny, P. (2010a). Cosine similarity scoring without score normalization techniques. In Proceedings of Odyssey 2010 – The speaker and language recognition workshop (pp. 71–75).

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010b). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRef

Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In Proceedings of INTERSPEECH conference (pp. 1–4).

Diaz de Maria, F., & Figueiras-Vidal, A. R. (1995). Radial basis functions for nonlinear prediction of speech in analysis-by-synthesis coders. In Proceedings of IEEE workshop on non-linear signal and image processing (pp. 788–791).

Dighe, P., Luyet, G., Asaei, A., & Bourlard, H. (2016, March). Exploiting low-dimensional structures to enhance DNN based acoustic modeling in speech recognition. In 2016 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 5690–5694).

Ekpenyong, M. E. (2018c). Adaptive template-based front end for tone language speech synthesis. In Human Language Technologies for Under-Resourced African Languages (pp. 1–29). Cham: Springer.

Ekpenyong, M. E., Inyang, U. G., Edoho, M. E., & Urua, E-A. (2018a). Intra-speaker variability assessment for speaker recognition in degraded conditions: A case study of African tone languages. In Ekpenyong M. E. (Ed.). Human Language Technologies for Under-Resourced African Languages: Design, Challenges, and Prospects, SpringerBriefs in Electrical and Computer Engineering (pp. 31–84). Switzerland: Cham.

Ekpenyong, M., Inyang, U. & Udoh, E. O. (2018b). Unsupervised visualization of under-resourced speech prosody. Speech Communication, 101(2018), 45–56.

Ekpenyong, M., Urua, E. A., Watts, O., King, S. & Yamagishi, J. (2014). Statistical parametric speech synthesis for Ibibio. Speech Communication, 56, 243–251.

Faundez-Zanuy, M., McLaughlin, S., Esposito, A., Hussain, A., Schoentgen, J., Kubin, G., Kleijn, W. B., & Maragos, P. (2002). Nonlinear speech processing: Overview and applications. Control and Intelligent Systems., 30(1), 1–10.

Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1), 87–102.CrossRefMATH

Friel, N., & Pettitt, A. N. (2011). Classification using distance nearest neighbours. Statistics and Computing, 21(3), 431–437.MathSciNetCrossRefMATH

Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length normalization in speaker recognition systems. In Proceedings of INTERSPEECH (pp. 249–252).

Gibbon, D., Ahoua, F., Gbéry, E., Urua, E., & Ekpenyong, M. (2004). WALA: A multilingual resource repository for West African languages. In M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, & S. Silva (Eds.), Proceedings of 4th International conference on language resources and evaluation conference (LREC), Vol. II, (pp. 579–582).

Gut, U. (2014). The LeaP Corpus. In D. Jacques, U. Gut, & K. Gjert (Eds.), The Oxford handbook of corpus phonology (pp. 509–516). Oxford University Press.

Hatch, A. O., Kajarekar, S., & Stolcke, A. (2006). Within-Class covariance normalization for SVM-based speaker recognition. In Proceedings of 9th International conference on speech language processing (pp. 1471–1474).

Heselwood, B., & Plug, L. (2011). The role of F2 and F3 in the perception of rhoticity: Evidence from listening experiments. In Proceedings of ICPhS.

Huang, C., Chen, T., Li, S. Z., Chang, E., & Zhou, J. L. (2001). Analysis of speaker variability. In INTERSPEECH (pp. 1377–1380).

Ikeno, A., & Hansen, J. H. (2007). The effect of listener accent background on accent perception and comprehension. EURASIP Journal on Audio, Speech, and Music Processing, 2007, 1–8.CrossRef

Isei-Jaakkola, T., Naka, T., & Hirose, K. (2010). Comparison of the formant frequencies F3 and F4 on a three-dimensional vowel chart. The Journal of the Acoustical Society of America, 127(3), 2019–2019.CrossRef

Jian, F. H-L. (1999). Taiwanese tone Sandhi viewed from an intensity perspective. In Proceedings of ICPhS99 (pp. 2387–2390). San Francisco.

Kanamori, T., Fujiwara, S., & Takeda, A. (2017). Breakdown point of robust support vector machines. Entropy, 19(2), 83.MathSciNetCrossRef

Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980–988.CrossRef

King, B. P. (2015). Practical natural language processing for low-resource languages. Doctoral Thesis, University of Michigan.

Larochelle, H., Bengio, Y., Louradour, J., & Lamblin, P. (2009). Exploring strategies for training deep neural networks. Journal of Machine Learning Research, 10, 1–40.MATH

Li, M., Zhang, X., Yan, Y., & Narayanan, S. (2011). Speaker verification using sparse representations on total variability I-vectors. In Proceedings of INTERSPEECH conference (pp. 1–4).

Ma, B., Zhu, D., & Tong, R. (2006). Chinese dialect identification using tone features based on pitch flux. In Proceedings of international conference on acoustics, speech and signal processing.

McLaren, M., & van Leeuwen, D. (2011). Improved speaker recognition when using i-vectors from multiple speech sources. In Proceedings of IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 5460–5463).

Michaud, A., & Vaissiere, J. (2015). Tone and intonation: Introductory notes and practical recommendations. Theoretical and Empirical Foundations of Experimental Phonetics, 3, 43–80.

Narang, V., Misra, D., & Yadav. (2012). F1 and F2 correlation with F0: A study of vowels of Hindi, Punjabi, Korean and Thai. International Journal of Asian Language Prrocessing, 22(2), 63–73.

Nikias, C. L., & Mendel, J. M. (1993). Signal processing with higher-order spectra. IEEE Signal Processing Magazine, 10, 10–37.CrossRef

Odejobi, O. A. (2008). Recognition of tones in Yoruba speech: Experiments with artificial neural networks, In B. Prasad & S. R. M. Prasanna (Eds.), Speech, audio, image and biomedical signal processing using neural networks. Studies Computational Intelligence (SCI), Volume 83, (pp. 23–47). Springer.

Paillereau, N. M. (2016). Do isolated vowels represent vowel targets in French? An acoustic study on coarticulation. In SHS Web of Conferences (Vol. 27, p. 09003). EDP Sciences.

Pang-Ning, T., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Pearson Addison Wesley.MATH

Plummer, A. R., & Reidy, P. F. (2018). Computing low-dimensional representations of speech from socio-auditory structures for phonetic analyses. Journal of Phonetics, 71, 355–375.CrossRef

Priemer, R. (1991). Introductory signal processing. World Scientific Publishers.

Rajan, P., Kinnunen, T., & Hautamäki, V. (2013). Effect of multicondition training on i-vector PLDA configurations for speaker recognition. In Proceedings of INTERSPEECH (pp. 3694–3697).

Reynolds, D. A. (1992). A Gaussian mixture modeling approach to text-independent speaker identification, Ph.D. Thesis, Georgia Institute of Technology.

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRef

Schertz, J., Chow, C. T. Y., & Kamal, N. S. N. (2019). The influence of tone language experience and speech style on the use of intonation in language discrimination. The Journal of the Acoustical Society of America, 146(1), 58–64.CrossRef

Schwanhäußer, B., & Burnham, D. (2005). Lexical tone and pitch perception in tone and non-tone language speakers. In Ninth European conference on speech communication and technology.

Sloboda, J. A., Wise, K. J., & Peretz, I. (2005). Quantifying tone deafness in the general population. Annals of the New York Academy of Sciences, 1060(1), 255–261.CrossRef

Stevens, C. J., Keller, P. E., & Tyler, M. D. (2013). Tonal language background and detecting pitch contour in spoken and musical items. Psychology of Music, 41(1), 59–74.CrossRef

Sun, H., & Hu, X. (2017). Attribute selection for decision tree learning with class constraint. Chemometrics and Intelligent Laboratory Systems, 163, 16–23.CrossRef

Tan, P. N., Steinbach, M., & Kumar, V. (2006). Classification: Basic concepts, decision trees, and model evaluation. Introduction to Data Mining, 1, 145–205.

Terasawa, H., Slaney, M. and Berger, J. (2005). A timbre space for speech. In Proceedings of INTERSPEECH.

Tharwat, A., Gaber, T., Ibrahim, A., & Hassanien, A. E. (2017). Linear discriminant analysis: A detailed tutorial. AI Communications, 30(2), 169–190.MathSciNetCrossRef

Thyssen, J., Nielsen, H., & Hansen, S. D. (1994). Non-linear short term prediction in speech coding, In Proceedings of international conference on acoustics, speech and signal processing (pp. I-185–I-188).

Tian, Y., Zhou, J. L., Chu, M., & Chang, E. (2004). Tone recognition with fractionized models and outlined features. In Proceedings of IEEE International conference on acoustics, speech, and signal processing (ICASSP’04) (pp. 1–4).

Townshend, B. (1991). Non-linear prediction of speech. In Proceedings of international conference on acoustic speech signal (pp. 425–428).

Tu, Y. H., Tashev, I., Zarar, S., & Lee, C. H. (2018). A hybrid approach to combining conventional and deep learning techniques for single-channel speech enhancement and recognition. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 2531–2535).

Uddin, A. N., Rahman, M. A., Islam, M., & Haque, M. A. (2018). Native language identification using i-vector. arXiv:1811.05540.

Urua, E. (2000). Ibibio phonetics and phonology. Centre for Advanced Studies of African Societies.

Yu, A. C. L. (2010). Tonal effects on perceived vowel duration. Laboratory Phonology, 10(4), 151–168.

Yu, D., & Deng, L. (2011). Deep learning and its applications to signal and information processing. IEEE Signal Process. Magazine, 28(1), 145–154.CrossRef

Yu, H., & Yang, J. (2001). A direct LDA algorithm for high-dimensional data—with application to face recognition. Pattern recognition, 34(10), 2067–2070.

Titel: Mining speech signal patterns for robust speaker variability classification
verfasst von: Moses Effiong Ekpenyong
Odudu-Obong Uwem Udocox
Publikationsdatum: 14.09.2022
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 2/2023
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-022-09984-7

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Internationaler Motorenkongress/© [M] ATZlive | Chisnikov / Fotolia.com, Search Icon, Banner Hanser, Benny Hahn/© ZEP GmbH, Customer Experience/© © oatawa / Getty Images / iStock, Erdgasmotor 1.5 TGI evo von Volkswagen/© Volkswagen AG, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2023

An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network

Linguistic analysis for emotion recognition: a case of Chinese speakers

A radius-incorporated localized multiple kernel learning algorithm for detecting depression in speech

The perception of artificial-intelligence (AI) based synthesized speech in younger and older adults

An approach for speech enhancement with dysarthric speech recognition using optimization based machine learning frameworks

Noise robust automatic speech recognition: review and analysis

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.