Skip to main content
Erschienen in: International Journal of Speech Technology 2/2023

14.09.2022

Mining speech signal patterns for robust speaker variability classification

verfasst von: Moses Effiong Ekpenyong, Odudu-Obong Uwem Udocox

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A speaker identification framework that combines both high- and low-level features, for state-of-the-art variability analysis and classification is proposed in this paper. The proposed framework introduces a workable solution that targets robust speaker variability classification using speech samples recorded in suboptimal conditions. A translated Ibibio (New Benue-Congo, Nigeria) version of “The Tiger and the Mouse”—a prosodically balanced corpus that demonstrates the prosody of read-aloud English was used in this study with speech samples obtained from 50 participants (25 males and 25 females). Identity-vectors (i-vectors) or low dimensional signal patterns were extracted and used as baselines for investigating speakers’ variability patterns across various classifiers (Decision Tree: DT, Support Vector Machine: SVM, k-Nearest Neighbour: k-NN, and Deep Neural Network: DNN) and kernels. Treatment of the baselines with high level features (speech duration, F0, intensity) was also experimented for word, syllable, and phoneme units. Results obtained revealed that DTs and some variants of SVM gave high classification accuracies (above 70%). Hence, the hypothesis of universal Gaussianity appears inexact, as the linear predictor that is optimal in the mean square error sense may not hold for Ibibio. Further treatments of the baselines with Linear Discriminant Analysis (LDA) and Cosine Distant Scoring (CDS) yielded very poor classification results, except for the k-NN and Gaussian SVM classifiers which performed well for the LDA treated baselines.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Akinlabi, A., & Urua, E. E. (2003). Foot structure in the Ibibio verb. Journal of African Languages and Linguistics, 24(2), 119–160.CrossRef Akinlabi, A., & Urua, E. E. (2003). Foot structure in the Ibibio verb. Journal of African Languages and Linguistics, 24(2), 119–160.CrossRef
Zurück zum Zitat Aronowitz, H., & Barkan, O. (2012). Efficient approximated i-vector extraction. In Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4789–4792). Aronowitz, H., & Barkan, O. (2012). Efficient approximated i-vector extraction. In Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4789–4792).
Zurück zum Zitat Beaulac, C., & Rosenthal, J. S. (2020). BEST: A decision tree algorithm that handles missing values. Computational Statistics, 35(3), 1001–1026.MathSciNetCrossRefMATH Beaulac, C., & Rosenthal, J. S. (2020). BEST: A decision tree algorithm that handles missing values. Computational Statistics, 35(3), 1001–1026.MathSciNetCrossRefMATH
Zurück zum Zitat Bent, T., Bradlow, A. R., & Wright, B. A. (2006). The influence of linguistic experience on the cognitive processing of pitch in speech and nonspeech sounds. Journal of Experimental Psychology: Human Perception and Performance, 32(1), 97. Bent, T., Bradlow, A. R., & Wright, B. A. (2006). The influence of linguistic experience on the cognitive processing of pitch in speech and nonspeech sounds. Journal of Experimental Psychology: Human Perception and Performance, 32(1), 97.
Zurück zum Zitat Bidelman, G. M., Hutka, S., & Moreno, S. (2013). Tone language speakers and musicians share enhanced perceptual and cognitive abilities for musical pitch: Evidence for bidirectionality between the domains of language and music. PLoS ONE, 8(4), e60676.CrossRef Bidelman, G. M., Hutka, S., & Moreno, S. (2013). Tone language speakers and musicians share enhanced perceptual and cognitive abilities for musical pitch: Evidence for bidirectionality between the domains of language and music. PLoS ONE, 8(4), e60676.CrossRef
Zurück zum Zitat Burnham, D., Kasisopa, B., Reid, A., Luksaneeyanawin, S., Lacerda, F., Attina, V., Rattanasone, N. X., Schwarz, I. C., & Webster, D. (2015). Universality and language-specific experience in the perception of lexical tone and pitch. Applied Psycholinguistics, 36(06), 1459–1491.CrossRef Burnham, D., Kasisopa, B., Reid, A., Luksaneeyanawin, S., Lacerda, F., Attina, V., Rattanasone, N. X., Schwarz, I. C., & Webster, D. (2015). Universality and language-specific experience in the perception of lexical tone and pitch. Applied Psycholinguistics, 36(06), 1459–1491.CrossRef
Zurück zum Zitat Campbell, N. (2002). Recording and storing of speech data. JST/CREST Expressive Speech Processing Project. Campbell, N. (2002). Recording and storing of speech data. JST/CREST Expressive Speech Processing Project.
Zurück zum Zitat Charbuty, B., & Abdulazeez, A. (2021). Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends, 2(01), 20–28.CrossRef Charbuty, B., & Abdulazeez, A. (2021). Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends, 2(01), 20–28.CrossRef
Zurück zum Zitat Chen, A., Liu, L., & Kager, R. (2016). Cross-domain correlation in pitch perception, the influence of native language. Language, Cognition and Neuroscience, 31(6), 751–760.CrossRef Chen, A., Liu, L., & Kager, R. (2016). Cross-domain correlation in pitch perception, the influence of native language. Language, Cognition and Neuroscience, 31(6), 751–760.CrossRef
Zurück zum Zitat Cooper, A., & Wang, Y. (2010). Cantonese tone word learning by tone and non-tone language speakers. In Proceedings of INTERSPEECH conference (pp. 1840–1843). Cooper, A., & Wang, Y. (2010). Cantonese tone word learning by tone and non-tone language speakers. In Proceedings of INTERSPEECH conference (pp. 1840–1843).
Zurück zum Zitat Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.CrossRefMATH Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.CrossRefMATH
Zurück zum Zitat Dargan, S., Kumar, M., Ayyagari, M. R., & Kumar, G. (2020). A survey of deep learning and its applications: A new paradigm to machine learning. Archives of Computational Methods in Engineering, 27(4), 1071–1092.MathSciNetCrossRef Dargan, S., Kumar, M., Ayyagari, M. R., & Kumar, G. (2020). A survey of deep learning and its applications: A new paradigm to machine learning. Archives of Computational Methods in Engineering, 27(4), 1071–1092.MathSciNetCrossRef
Zurück zum Zitat Dehak, N., Dehak, R., Glass, J. R., Reynolds, D. A., & Kenny, P. (2010a). Cosine similarity scoring without score normalization techniques. In Proceedings of Odyssey 2010 – The speaker and language recognition workshop (pp. 71–75). Dehak, N., Dehak, R., Glass, J. R., Reynolds, D. A., & Kenny, P. (2010a). Cosine similarity scoring without score normalization techniques. In Proceedings of Odyssey 2010 – The speaker and language recognition workshop (pp. 71–75).
Zurück zum Zitat Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010b). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRef Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010b). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRef
Zurück zum Zitat Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In Proceedings of INTERSPEECH conference (pp. 1–4). Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In Proceedings of INTERSPEECH conference (pp. 1–4).
Zurück zum Zitat Diaz de Maria, F., & Figueiras-Vidal, A. R. (1995). Radial basis functions for nonlinear prediction of speech in analysis-by-synthesis coders. In Proceedings of IEEE workshop on non-linear signal and image processing (pp. 788–791). Diaz de Maria, F., & Figueiras-Vidal, A. R. (1995). Radial basis functions for nonlinear prediction of speech in analysis-by-synthesis coders. In Proceedings of IEEE workshop on non-linear signal and image processing (pp. 788–791).
Zurück zum Zitat Dighe, P., Luyet, G., Asaei, A., & Bourlard, H. (2016, March). Exploiting low-dimensional structures to enhance DNN based acoustic modeling in speech recognition. In 2016 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 5690–5694). Dighe, P., Luyet, G., Asaei, A., & Bourlard, H. (2016, March). Exploiting low-dimensional structures to enhance DNN based acoustic modeling in speech recognition. In 2016 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 5690–5694).
Zurück zum Zitat Ekpenyong, M. E. (2018c). Adaptive template-based front end for tone language speech synthesis. In Human Language Technologies for Under-Resourced African Languages (pp. 1–29). Cham: Springer. Ekpenyong, M. E. (2018c). Adaptive template-based front end for tone language speech synthesis. In Human Language Technologies for Under-Resourced African Languages (pp. 1–29). Cham: Springer.
Zurück zum Zitat Ekpenyong, M. E., Inyang, U. G., Edoho, M. E., & Urua, E-A. (2018a). Intra-speaker variability assessment for speaker recognition in degraded conditions: A case study of African tone languages. In Ekpenyong M. E. (Ed.). Human Language Technologies for Under-Resourced African Languages: Design, Challenges, and Prospects, SpringerBriefs in Electrical and Computer Engineering (pp. 31–84). Switzerland: Cham. Ekpenyong, M. E., Inyang, U. G., Edoho, M. E., & Urua, E-A. (2018a). Intra-speaker variability assessment for speaker recognition in degraded conditions: A case study of African tone languages. In Ekpenyong M. E. (Ed.). Human Language Technologies for Under-Resourced African Languages: Design, Challenges, and Prospects, SpringerBriefs in Electrical and Computer Engineering (pp. 31–84). Switzerland: Cham.
Zurück zum Zitat Ekpenyong, M., Inyang, U. & Udoh, E. O. (2018b). Unsupervised visualization of under-resourced speech prosody. Speech Communication, 101(2018), 45–56. Ekpenyong, M., Inyang, U. & Udoh, E. O. (2018b). Unsupervised visualization of under-resourced speech prosody. Speech Communication, 101(2018), 45–56.
Zurück zum Zitat Ekpenyong, M., Urua, E. A., Watts, O., King, S. & Yamagishi, J. (2014). Statistical parametric speech synthesis for Ibibio. Speech Communication, 56, 243–251. Ekpenyong, M., Urua, E. A., Watts, O., King, S. & Yamagishi, J. (2014). Statistical parametric speech synthesis for Ibibio. Speech Communication, 56, 243–251.
Zurück zum Zitat Faundez-Zanuy, M., McLaughlin, S., Esposito, A., Hussain, A., Schoentgen, J., Kubin, G., Kleijn, W. B., & Maragos, P. (2002). Nonlinear speech processing: Overview and applications. Control and Intelligent Systems., 30(1), 1–10. Faundez-Zanuy, M., McLaughlin, S., Esposito, A., Hussain, A., Schoentgen, J., Kubin, G., Kleijn, W. B., & Maragos, P. (2002). Nonlinear speech processing: Overview and applications. Control and Intelligent Systems., 30(1), 1–10.
Zurück zum Zitat Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1), 87–102.CrossRefMATH Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1), 87–102.CrossRefMATH
Zurück zum Zitat Friel, N., & Pettitt, A. N. (2011). Classification using distance nearest neighbours. Statistics and Computing, 21(3), 431–437.MathSciNetCrossRefMATH Friel, N., & Pettitt, A. N. (2011). Classification using distance nearest neighbours. Statistics and Computing, 21(3), 431–437.MathSciNetCrossRefMATH
Zurück zum Zitat Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length normalization in speaker recognition systems. In Proceedings of INTERSPEECH (pp. 249–252). Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length normalization in speaker recognition systems. In Proceedings of INTERSPEECH (pp. 249–252).
Zurück zum Zitat Gibbon, D., Ahoua, F., Gbéry, E., Urua, E., & Ekpenyong, M. (2004). WALA: A multilingual resource repository for West African languages. In M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, & S. Silva (Eds.), Proceedings of 4th International conference on language resources and evaluation conference (LREC), Vol. II, (pp. 579–582). Gibbon, D., Ahoua, F., Gbéry, E., Urua, E., & Ekpenyong, M. (2004). WALA: A multilingual resource repository for West African languages. In M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, & S. Silva (Eds.), Proceedings of 4th International conference on language resources and evaluation conference (LREC), Vol. II, (pp. 579–582).
Zurück zum Zitat Gut, U. (2014). The LeaP Corpus. In D. Jacques, U. Gut, & K. Gjert (Eds.), The Oxford handbook of corpus phonology (pp. 509–516). Oxford University Press. Gut, U. (2014). The LeaP Corpus. In D. Jacques, U. Gut, & K. Gjert (Eds.), The Oxford handbook of corpus phonology (pp. 509–516). Oxford University Press.
Zurück zum Zitat Hatch, A. O., Kajarekar, S., & Stolcke, A. (2006). Within-Class covariance normalization for SVM-based speaker recognition. In Proceedings of 9th International conference on speech language processing (pp. 1471–1474). Hatch, A. O., Kajarekar, S., & Stolcke, A. (2006). Within-Class covariance normalization for SVM-based speaker recognition. In Proceedings of 9th International conference on speech language processing (pp. 1471–1474).
Zurück zum Zitat Heselwood, B., & Plug, L. (2011). The role of F2 and F3 in the perception of rhoticity: Evidence from listening experiments. In Proceedings of ICPhS. Heselwood, B., & Plug, L. (2011). The role of F2 and F3 in the perception of rhoticity: Evidence from listening experiments. In Proceedings of ICPhS.
Zurück zum Zitat Huang, C., Chen, T., Li, S. Z., Chang, E., & Zhou, J. L. (2001). Analysis of speaker variability. In INTERSPEECH (pp. 1377–1380). Huang, C., Chen, T., Li, S. Z., Chang, E., & Zhou, J. L. (2001). Analysis of speaker variability. In INTERSPEECH (pp. 1377–1380).
Zurück zum Zitat Ikeno, A., & Hansen, J. H. (2007). The effect of listener accent background on accent perception and comprehension. EURASIP Journal on Audio, Speech, and Music Processing, 2007, 1–8.CrossRef Ikeno, A., & Hansen, J. H. (2007). The effect of listener accent background on accent perception and comprehension. EURASIP Journal on Audio, Speech, and Music Processing, 2007, 1–8.CrossRef
Zurück zum Zitat Isei-Jaakkola, T., Naka, T., & Hirose, K. (2010). Comparison of the formant frequencies F3 and F4 on a three-dimensional vowel chart. The Journal of the Acoustical Society of America, 127(3), 2019–2019.CrossRef Isei-Jaakkola, T., Naka, T., & Hirose, K. (2010). Comparison of the formant frequencies F3 and F4 on a three-dimensional vowel chart. The Journal of the Acoustical Society of America, 127(3), 2019–2019.CrossRef
Zurück zum Zitat Jian, F. H-L. (1999). Taiwanese tone Sandhi viewed from an intensity perspective. In Proceedings of ICPhS99 (pp. 2387–2390). San Francisco. Jian, F. H-L. (1999). Taiwanese tone Sandhi viewed from an intensity perspective. In Proceedings of ICPhS99 (pp. 2387–2390). San Francisco.
Zurück zum Zitat Kanamori, T., Fujiwara, S., & Takeda, A. (2017). Breakdown point of robust support vector machines. Entropy, 19(2), 83.MathSciNetCrossRef Kanamori, T., Fujiwara, S., & Takeda, A. (2017). Breakdown point of robust support vector machines. Entropy, 19(2), 83.MathSciNetCrossRef
Zurück zum Zitat Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980–988.CrossRef Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980–988.CrossRef
Zurück zum Zitat King, B. P. (2015). Practical natural language processing for low-resource languages. Doctoral Thesis, University of Michigan. King, B. P. (2015). Practical natural language processing for low-resource languages. Doctoral Thesis, University of Michigan.
Zurück zum Zitat Larochelle, H., Bengio, Y., Louradour, J., & Lamblin, P. (2009). Exploring strategies for training deep neural networks. Journal of Machine Learning Research, 10, 1–40.MATH Larochelle, H., Bengio, Y., Louradour, J., & Lamblin, P. (2009). Exploring strategies for training deep neural networks. Journal of Machine Learning Research, 10, 1–40.MATH
Zurück zum Zitat Li, M., Zhang, X., Yan, Y., & Narayanan, S. (2011). Speaker verification using sparse representations on total variability I-vectors. In Proceedings of INTERSPEECH conference (pp. 1–4). Li, M., Zhang, X., Yan, Y., & Narayanan, S. (2011). Speaker verification using sparse representations on total variability I-vectors. In Proceedings of INTERSPEECH conference (pp. 1–4).
Zurück zum Zitat Ma, B., Zhu, D., & Tong, R. (2006). Chinese dialect identification using tone features based on pitch flux. In Proceedings of international conference on acoustics, speech and signal processing. Ma, B., Zhu, D., & Tong, R. (2006). Chinese dialect identification using tone features based on pitch flux. In Proceedings of international conference on acoustics, speech and signal processing.
Zurück zum Zitat McLaren, M., & van Leeuwen, D. (2011). Improved speaker recognition when using i-vectors from multiple speech sources. In Proceedings of IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 5460–5463). McLaren, M., & van Leeuwen, D. (2011). Improved speaker recognition when using i-vectors from multiple speech sources. In Proceedings of IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 5460–5463).
Zurück zum Zitat Michaud, A., & Vaissiere, J. (2015). Tone and intonation: Introductory notes and practical recommendations. Theoretical and Empirical Foundations of Experimental Phonetics, 3, 43–80. Michaud, A., & Vaissiere, J. (2015). Tone and intonation: Introductory notes and practical recommendations. Theoretical and Empirical Foundations of Experimental Phonetics, 3, 43–80.
Zurück zum Zitat Narang, V., Misra, D., & Yadav. (2012). F1 and F2 correlation with F0: A study of vowels of Hindi, Punjabi, Korean and Thai. International Journal of Asian Language Prrocessing, 22(2), 63–73. Narang, V., Misra, D., & Yadav. (2012). F1 and F2 correlation with F0: A study of vowels of Hindi, Punjabi, Korean and Thai. International Journal of Asian Language Prrocessing, 22(2), 63–73.
Zurück zum Zitat Nikias, C. L., & Mendel, J. M. (1993). Signal processing with higher-order spectra. IEEE Signal Processing Magazine, 10, 10–37.CrossRef Nikias, C. L., & Mendel, J. M. (1993). Signal processing with higher-order spectra. IEEE Signal Processing Magazine, 10, 10–37.CrossRef
Zurück zum Zitat Odejobi, O. A. (2008). Recognition of tones in Yoruba speech: Experiments with artificial neural networks, In B. Prasad & S. R. M. Prasanna (Eds.), Speech, audio, image and biomedical signal processing using neural networks. Studies Computational Intelligence (SCI), Volume 83, (pp. 23–47). Springer. Odejobi, O. A. (2008). Recognition of tones in Yoruba speech: Experiments with artificial neural networks, In B. Prasad & S. R. M. Prasanna (Eds.), Speech, audio, image and biomedical signal processing using neural networks. Studies Computational Intelligence (SCI), Volume 83, (pp. 23–47). Springer.
Zurück zum Zitat Paillereau, N. M. (2016). Do isolated vowels represent vowel targets in French? An acoustic study on coarticulation. In SHS Web of Conferences (Vol. 27, p. 09003). EDP Sciences. Paillereau, N. M. (2016). Do isolated vowels represent vowel targets in French? An acoustic study on coarticulation. In SHS Web of Conferences (Vol. 27, p. 09003). EDP Sciences.
Zurück zum Zitat Pang-Ning, T., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Pearson Addison Wesley.MATH Pang-Ning, T., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Pearson Addison Wesley.MATH
Zurück zum Zitat Plummer, A. R., & Reidy, P. F. (2018). Computing low-dimensional representations of speech from socio-auditory structures for phonetic analyses. Journal of Phonetics, 71, 355–375.CrossRef Plummer, A. R., & Reidy, P. F. (2018). Computing low-dimensional representations of speech from socio-auditory structures for phonetic analyses. Journal of Phonetics, 71, 355–375.CrossRef
Zurück zum Zitat Priemer, R. (1991). Introductory signal processing. World Scientific Publishers. Priemer, R. (1991). Introductory signal processing. World Scientific Publishers.
Zurück zum Zitat Rajan, P., Kinnunen, T., & Hautamäki, V. (2013). Effect of multicondition training on i-vector PLDA configurations for speaker recognition. In Proceedings of INTERSPEECH (pp. 3694–3697). Rajan, P., Kinnunen, T., & Hautamäki, V. (2013). Effect of multicondition training on i-vector PLDA configurations for speaker recognition. In Proceedings of INTERSPEECH (pp. 3694–3697).
Zurück zum Zitat Reynolds, D. A. (1992). A Gaussian mixture modeling approach to text-independent speaker identification, Ph.D. Thesis, Georgia Institute of Technology. Reynolds, D. A. (1992). A Gaussian mixture modeling approach to text-independent speaker identification, Ph.D. Thesis, Georgia Institute of Technology.
Zurück zum Zitat Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRef Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRef
Zurück zum Zitat Schertz, J., Chow, C. T. Y., & Kamal, N. S. N. (2019). The influence of tone language experience and speech style on the use of intonation in language discrimination. The Journal of the Acoustical Society of America, 146(1), 58–64.CrossRef Schertz, J., Chow, C. T. Y., & Kamal, N. S. N. (2019). The influence of tone language experience and speech style on the use of intonation in language discrimination. The Journal of the Acoustical Society of America, 146(1), 58–64.CrossRef
Zurück zum Zitat Schwanhäußer, B., & Burnham, D. (2005). Lexical tone and pitch perception in tone and non-tone language speakers. In Ninth European conference on speech communication and technology. Schwanhäußer, B., & Burnham, D. (2005). Lexical tone and pitch perception in tone and non-tone language speakers. In Ninth European conference on speech communication and technology.
Zurück zum Zitat Sloboda, J. A., Wise, K. J., & Peretz, I. (2005). Quantifying tone deafness in the general population. Annals of the New York Academy of Sciences, 1060(1), 255–261.CrossRef Sloboda, J. A., Wise, K. J., & Peretz, I. (2005). Quantifying tone deafness in the general population. Annals of the New York Academy of Sciences, 1060(1), 255–261.CrossRef
Zurück zum Zitat Stevens, C. J., Keller, P. E., & Tyler, M. D. (2013). Tonal language background and detecting pitch contour in spoken and musical items. Psychology of Music, 41(1), 59–74.CrossRef Stevens, C. J., Keller, P. E., & Tyler, M. D. (2013). Tonal language background and detecting pitch contour in spoken and musical items. Psychology of Music, 41(1), 59–74.CrossRef
Zurück zum Zitat Sun, H., & Hu, X. (2017). Attribute selection for decision tree learning with class constraint. Chemometrics and Intelligent Laboratory Systems, 163, 16–23.CrossRef Sun, H., & Hu, X. (2017). Attribute selection for decision tree learning with class constraint. Chemometrics and Intelligent Laboratory Systems, 163, 16–23.CrossRef
Zurück zum Zitat Tan, P. N., Steinbach, M., & Kumar, V. (2006). Classification: Basic concepts, decision trees, and model evaluation. Introduction to Data Mining, 1, 145–205. Tan, P. N., Steinbach, M., & Kumar, V. (2006). Classification: Basic concepts, decision trees, and model evaluation. Introduction to Data Mining, 1, 145–205.
Zurück zum Zitat Terasawa, H., Slaney, M. and Berger, J. (2005). A timbre space for speech. In Proceedings of INTERSPEECH. Terasawa, H., Slaney, M. and Berger, J. (2005). A timbre space for speech. In Proceedings of INTERSPEECH.
Zurück zum Zitat Tharwat, A., Gaber, T., Ibrahim, A., & Hassanien, A. E. (2017). Linear discriminant analysis: A detailed tutorial. AI Communications, 30(2), 169–190.MathSciNetCrossRef Tharwat, A., Gaber, T., Ibrahim, A., & Hassanien, A. E. (2017). Linear discriminant analysis: A detailed tutorial. AI Communications, 30(2), 169–190.MathSciNetCrossRef
Zurück zum Zitat Thyssen, J., Nielsen, H., & Hansen, S. D. (1994). Non-linear short term prediction in speech coding, In Proceedings of international conference on acoustics, speech and signal processing (pp. I-185–I-188). Thyssen, J., Nielsen, H., & Hansen, S. D. (1994). Non-linear short term prediction in speech coding, In Proceedings of international conference on acoustics, speech and signal processing (pp. I-185–I-188).
Zurück zum Zitat Tian, Y., Zhou, J. L., Chu, M., & Chang, E. (2004). Tone recognition with fractionized models and outlined features. In Proceedings of IEEE International conference on acoustics, speech, and signal processing (ICASSP’04) (pp. 1–4). Tian, Y., Zhou, J. L., Chu, M., & Chang, E. (2004). Tone recognition with fractionized models and outlined features. In Proceedings of IEEE International conference on acoustics, speech, and signal processing (ICASSP’04) (pp. 1–4).
Zurück zum Zitat Townshend, B. (1991). Non-linear prediction of speech. In Proceedings of international conference on acoustic speech signal (pp. 425–428). Townshend, B. (1991). Non-linear prediction of speech. In Proceedings of international conference on acoustic speech signal (pp. 425–428).
Zurück zum Zitat Tu, Y. H., Tashev, I., Zarar, S., & Lee, C. H. (2018). A hybrid approach to combining conventional and deep learning techniques for single-channel speech enhancement and recognition. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 2531–2535). Tu, Y. H., Tashev, I., Zarar, S., & Lee, C. H. (2018). A hybrid approach to combining conventional and deep learning techniques for single-channel speech enhancement and recognition. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 2531–2535).
Zurück zum Zitat Urua, E. (2000). Ibibio phonetics and phonology. Centre for Advanced Studies of African Societies. Urua, E. (2000). Ibibio phonetics and phonology. Centre for Advanced Studies of African Societies.
Zurück zum Zitat Yu, A. C. L. (2010). Tonal effects on perceived vowel duration. Laboratory Phonology, 10(4), 151–168. Yu, A. C. L. (2010). Tonal effects on perceived vowel duration. Laboratory Phonology, 10(4), 151–168.
Zurück zum Zitat Yu, D., & Deng, L. (2011). Deep learning and its applications to signal and information processing. IEEE Signal Process. Magazine, 28(1), 145–154.CrossRef Yu, D., & Deng, L. (2011). Deep learning and its applications to signal and information processing. IEEE Signal Process. Magazine, 28(1), 145–154.CrossRef
Zurück zum Zitat Yu, H., & Yang, J. (2001). A direct LDA algorithm for high-dimensional data—with application to face recognition. Pattern recognition, 34(10), 2067–2070. Yu, H., & Yang, J. (2001). A direct LDA algorithm for high-dimensional data—with application to face recognition. Pattern recognition, 34(10), 2067–2070.
Metadaten
Titel
Mining speech signal patterns for robust speaker variability classification
verfasst von
Moses Effiong Ekpenyong
Odudu-Obong Uwem Udocox
Publikationsdatum
14.09.2022
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 2/2023
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-022-09984-7

Weitere Artikel der Ausgabe 2/2023

International Journal of Speech Technology 2/2023 Zur Ausgabe

Neuer Inhalt