nach oben

International Journal of Speech Technology

Erschienen in:

10.09.2015

Automatic prominent syllable detection with machine learning classifiers

verfasst von: David O. Johnson, Okim Kang

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this paper, we examine the performance of automatically detecting Brazil’s prominent syllables using five machine learning classifiers and seven sets of features consisting of three features: pitch, intensity, and duration, taken one at time, two at a time, and all three. Prominent syllables are the foundation of Brazil’s prosodic intonation model. We found that using pitch, intensity, and duration as features produces the best optimal results. Our findings also revealed that in terms of accuracy, F-measure, and Cohen’s kappa coefficient that bagging an ensemble of decision tree learners performed the best (accuracy = 95.9 ± 0.2 %; F-measure = 93.7 ± 0.4; κ = 0.907 ± 0.005). The performance of our current model proves to be significantly better than any other automatic detection software that exists or that of human transcription experts of prosody.

Vorheriger Artikel Sensitivity of automatic speaker identification to SVD digital audio watermarking

Nächster Artikel Bayesian estimation for speech enhancement given a priori knowledge of clean speech phase

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Ananthakrishnan, S., & Narayanan, S. S. (2008). Automatic prosodic event detection using acoustic, lexical, and syntactic evidence. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 216–228.CrossRef

Avanzi, M., Lacheret-Dujour, A., & Victorri, B. (2010). A corpus-based learning method for prominence detection in spontaneous speech. In Proceedings of prosodic prominence, speech prosody 2010 satellite workshop, Chicago, 10 May.

Beckman, M., & Elam, G. (1997). Guidelines for ToBI labelling. http://www.ling.ohio-state.edu/research/phonetics/E_ToBI.

Bocklet, T., & Shriberg, E. (2009, April). Speaker recognition using syllable-based constraints for cepstral frame selection. In IEEE international conference on acoustics, speech and signal processing, 2009 (ICASSP 2009) (pp. 4525–4528). IEEE.

Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (version 5.3.83). [Computer program]. Retrieved August 19, 2014.

Brazil, D. (1997). The communicative value of intonation in English. Cambridge: Cambridge University Press.

Breen, M., Dilley, L. C., Kraemer, J., & Gibson, E. (2012). Inter-transcriber reliability for two systems of prosodic annotation: ToBI (Tones and Break Indices) and RaP (Rhythm and Pitch).

Breiman, L. (1994). Bagging predictors. Technical Report 421. Department of Statistics, University of California at Berkeley.

Breiman, L. (1996). Bias, variance, and arcing classifiers. Technical Report 460. Department of Statistics, University of California at Berkeley.

Cauldwell, R. (2012). RIAS VAN DEN DOEL, How friendly are the natives? An evaluation of native-speaker judgements of foreign-accented British and American English. Utrecht: Netherlands Graduate School of Linguistics (LOT), 2006. pp. xii + 341. ISBN-10: 90-78328-09-6, ISBN-13: 978-90-78328-09-4. Journal of the International Phonetic Association, 42(02), 213–215.

Christodoulides, G., & Avanzi, M. (2014). An evaluation of machine learning methods for prominence detection in French. In Fifteenth annual conference of the International Speech Communication Association.

Chun, D. M. (2002). Discourse intonation in L2: From theory and research to practice. Amsterdam: John Benjamins.CrossRef

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.CrossRef

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273–297.MATH

Cutugno, F., Leone, E., Ludusan, B., & Origlia, A. (2012). Investigating syllabic prominence with conditional random fields and latent-dynamic conditional random fields. In INTERSPEECH.

Dilley, L. C. (2005). The phonetics and phonology of tonal systems. Doctoral dissertation, Massachusetts Institute of Technology.

Dilley, L. C., & Brown, M. (2005). The RaP (Rhythm and Pitch) labeling system. Unpublished manuscript.

Escudero-Mancebo, D., González-Ferreras, C., Vivaracho-Pascual, C., & Cardeñoso-Payo, V. (2014). A fuzzy classifier to deal with similarity between labels on automatic prosodic labeling. Computer Speech & Language, 28(1), 326–341.CrossRef

Fine, J., Bartolucci, G., Ginsberg, G., & Szatmari, P. (1991). The use of intonation to communicate in pervasive developmental disorders. Journal of Child Psychology and Psychiatry, 32(5), 771–782.CrossRef

Frith, U., & Happé, F. (1994). Language and communication in autistic disorders. Philosophical Transactions of the Royal Society B: Biological Sciences, 346(1315), 97–104.CrossRef

Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical Report N, 93, 27403.

González-Ferreras, C., Escudero-Mancebo, D., Vivaracho-Pascual, C., & Cardeñoso-Payo, V. (2012). Improving automatic classification of prosodic events by pairwise coupling. IEEE Transactions on Audio, Speech, and Language Processing, 20(7), 2045–2058.CrossRef

Hämäläinen, A., Boves, L., de Veth, J., & Bosch, L. T. (2007). On the utility of syllable-based acoustic models for pronunciation variation modelling. EURASIP Journal on Audio, Speech, and Music Processing, 2007(2), 3.

Happel, B. L., & Murre, J. M. (1994). Design and evolution of modular neural network architectures. Neural Networks, 7(6), 985–1004.CrossRef

Jeon, J. H., & Liu, Y. (2009). Automatic prosodic events detection using syllable-based acoustic and syntactic features. In IEEE international conference on acoustics, speech and signal processing, 2009 (ICASSP 2009) (pp. 4565–4568). IEEE.

Juslin, P. N., & Laukka, P. (2004). Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening. Journal of New Music Research, 33(3), 217–238.CrossRef

Kang, O. (2010). Relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness. System, 38(2), 301–315.CrossRef

Kang, O., & Pickering, L. (2013). Using acoustic and temporal analysis for assessing speaking. In A. Kunnan (Ed.), Companion to language assessment (pp. 1047–1062). Hoboken: Wiley-Blackwell.CrossRef

Kang, O., Rubin, D., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. The Modern Language Journal, 94(4), 554–566.CrossRef

Kang, O., & Wang, L. (2014). Impact of different task types on candidates’ speaking performances and interactive features that distinguish between CEFR levels. ISSN 1756-509X, 40.

KayPENTAX. (2008). Multi-speech and CSL software. Lincoln Park, NJ: KayPENTAX.

Kochanski, G., Grabe, E., Coleman, J., & Rosner, B. (2005). Loudness predicts prominence: Fundamental frequency lends little. The Journal of the Acoustical Society of America, 118(2), 1038–1054.CrossRef

Litman, D. J., Hirschberg, J. B., & Swerts, M. (2000). Predicting automatic speech recognition performance using prosodic cues. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference (pp. 218–225). Association for Computational Linguistics.

Ludusan, B., & Dupoux, E. (2014). Towards low-resource prosodic boundary detection.

Ludusan, B., Origlia, A., & Cutugno, F. (2011). On the use of the rhythmogram for automatic syllabic prominence detection (pp. 2424–2427). In INTERSPEECH.

Mahrt, T., Cole, J., Fleck, M. M., & Hasegawa-Johnson, M. (2012a). F0 and the perception of prominence. In INTERSPEECH.

Mahrt, T., Cole, J., Fleck, M., & Hasegawa-Johnson, M. (2012b). Modeling speaker variation in cues to prominence using the Bayesian information criterion. In Speech prosody 2012.

Mahrt, T., Huang, J. T., Mo, Y., Fleck, M. M., Hasegawa-Johnson, M., & Cole, J. (2011). Optimal models of prosodic prominence using the Bayesian information criterion (pp. 2037–2040). In INTERSPEECH.

MathWorks, Inc. (2013). MATLAB release 2013a. [Computer program]. Retrieved February 15, 2013.

McCann, J., & Peppé, S. (2003). Prosody in autism spectrum disorders: A critical review. International Journal of Language & Communication Disorders, 38(4), 325–350.CrossRef

Nadel, J., Simon, M., Canet, P., Soussignan, R., Blancard, P., Canamero, L., & Gaussier, P. (2006). Human responses to an expressive robot. In Proceedings of the sixth international workshop on epigenetic robotics. Lund University.

Ni, C. J., Liu, W., & Xu, B. (2011). Automatic prosodic events detection by using syllable-based acoustic, lexical and syntactic features. In INTERSPEECH (pp. 2017–2020).

Ni, C., Liu, W., & Xu, B. (2012). From English pitch accent detection to Mandarin stress detection, where is the difference? Computer Speech & Language, 26(3), 127–148.CrossRef

Obin, N., Rodet, X., & Lacheret-Dujour, A. (2009). A syllable-based prominence detection model based on discriminant analysis and context-dependency. In SPECOM (pp. 97–100).

Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169–198.MATH

Ostendorf, M. (1999, December). Moving beyond the ‘beads-on-a-string’ model of speech. In Proceedings of IEEE ASRU workshop (pp. 79–84). Piscataway, NJ: IEEE.

Ostendorf, M., Price, P. J., & Shattuck-Hufnagel, S. (1995). The Boston University radio news corpus. Linguistic Data Consortium, 1–19.

Paul, R., Augustyn, A., Klin, A., & Volkmar, F. R. (2005). Perception and production of prosody by speakers with autism spectrum disorders. Journal of Autism and Developmental Disorders, 35(2), 205–220.CrossRef

Pickering, L. (1999). An analysis of prosodic systems in the classroom discourse of native speaker and nonnative speaker teaching assistants. Unpublished doctoral dissertation, University of Florida, Gainesville.

Pickering, L. (2009). Intonation as a pragmatic resource in ELF interaction. Intercultural Pragmatics, 6(2), 235–255.CrossRef

Pierrehumbert, J. B. (1980). The phonology and phonetics of English intonation. Doctoral dissertation, Massachusetts Institute of Technology.

Pierrehumbert, J., & Beckman, M. (1988). Japanese tone structure. Linguistic Inquiry Monographs, 15, 1–282.

Price, P., Ostendorf, M., Shattuck-Hufnagel, S., & Veilleux, N. (1988). A methodology for analyzing prosody. The Journal of the Acoustical Society of America, 84(S1), S99.CrossRef

Quinlan, J. R. (1999). Simplifying decision trees. International Journal of Human-Computer Studies, 51(2), 497–510.CrossRef

Rosenberg, A., & Hirschberg, J. (2006). On the correlation between energy and pitch accent in read English speech. In INTERSPEECH.

Rosenberg, A., & Hirschberg, J. (2009). Detecting pitch accents at the word, syllable and vowel level. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (pp. 81–84). Association for Computational Linguistics.

Rosenberg, A., & Hirschberg, J. B. (2010). Production of English prominence by native mandarin Chinese speakers.

Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., & Stolcke, A. (2005). Modeling prosodic feature sequences for speaker recognition. Speech Communication, 46(3), 455–472.CrossRef

Shriberg, L. D., Paul, R., McSweeny, J. L., Klin, A., Cohen, D. J., & Volkmar, F. R. (2001). Speech and prosody characteristics of adolescents and adults with high-functioning autism and Asperger syndrome. Journal of Speech, Language, and Hearing Research, 44(5), 1097–1115.CrossRef

Silipo, R., & Greenberg, S. (1999). Automatic transcription of prosodic stress for spontaneous English discourse. In Proceedings of the XIVth international congress of phonetic sciences (ICPhS) (Vol. 3, p. 2351).

Silipo, R., & Greenberg, S. (2000). Prosodic stress revisited: Reassessing the role of fundamental frequency. In Proceedings of NIST speech transcription workshop.

Sridhar, V. R., Bangalore, S., & Narayanan, S. S. (2008). Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE Transactions on Audio, Speech, and Language Processing, 16(4), 797–811.CrossRef

Streefkerk, B. M., Pols, L. C., & Ten Bosch, L. F. (1997). Prominence in read aloud sentences, as marked by listeners and classified automatically. In Proceedings of the Institute of Phonetic Sciences, University of Amsterdam (Vol. 21, pp. 101–116).

Syrdal, A. K., & McGory, J. T. (2000). Inter-transcriber reliability of ToBI prosodic labeling. In INTERSPEECH (pp. 235–238).

Tamburini, F. (2006). Reliable prominence identification in English spontaneous speech. Proceedings of speech prosody 2006.

Terken, J. (1991). Fundamental frequency and perceived prominence of accented syllables. The Journal of the Acoustical Society of America, 89(4), 1768–1776.CrossRef

Wightman, C., Price, P., Pierrehumbert, J., & Hirschberg, J. (1992). ToBI: A standard for labeling English prosody. In Proceedings of the 1992 international conference on spoken language processing, ICSLP (pp. 12–16).

Xu, Y. (2012). Speech prosody: A methodological review. Journal of Speech Sciences, 1(1), 85–115.

Yoon, T., Chavarria, S., Cole, J., & Hasegawa-Johnson, M. (2004). Intertranscriber reliability of prosodic labeling on telephone conversation using ToBI. In INTERSPEECH.

Titel: Automatic prominent syllable detection with machine learning classifiers
verfasst von: David O. Johnson
Okim Kang
Publikationsdatum: 10.09.2015
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 4/2015
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-015-9299-z

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Interview Entropie Bild 1/© Bernhard Weßling, Joerg Schweinsberg/© Datacore Software, Smart Factory Symbolbild/© TensorSpark | Generated with AI | Getty Images, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Sustainibility Finance/© Robert Kneschke / stock.adobe.com / Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2015

Robust glottal closure instant detection by jointly exploiting stationary wavelet transform and harmonic superposition

A comparative study of different features for isolated spoken word recognition using HMM with reference to Assamese language

i-Vectors in speech processing applications: a survey

Supervised and unsupervised separation of convolutive speech mixtures using f 0 and formant frequencies

Ideal binary masking for reducing convolutive noise

Noise robust speaker verification via the fusion of SNR-independent and SNR-dependent PLDA

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.