nach oben

International Journal of Speech Technology

Erschienen in:

01.12.2013

Identification of Indian languages using multi-level spectral and prosodic features

verfasst von: V. Ramu Reddy, Sudhamay Maity, K. Sreenivasa Rao

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this paper spectral and prosodic features extracted from different levels are explored for analyzing the language specific information present in speech. In this work, spectral features extracted from frames of 20 ms (block processing), individual pitch cycles (pitch synchronous analysis) and glottal closure regions are used for discriminating the languages. Prosodic features extracted from syllable, tri-syllable and multi-word (phrase) levels are proposed in addition to spectral features for capturing the language specific information. In this study, language specific prosody is represented by intonation, rhythm and stress features at syllable and tri-syllable (words) levels, whereas temporal variations in fundamental frequency (F ₀ contour), durations of syllables and temporal variations in intensities (energy contour) are used to represent the prosody at multi-word (phrase) level. For analyzing the language specific information in the proposed features, Indian language speech database (IITKGP-MLILSC) is used. Gaussian mixture models are used to capture the language specific information from the proposed features. The evaluation results indicate that language identification performance is improved with combination of features. Performance of proposed features is also analyzed on standard Oregon Graduate Institute Multi-Language Telephone-based Speech (OGI-MLTS) database.

Vorheriger Artikel An overview of digital speech watermarking

Nächster Artikel A new approach of speaker clustering based on the stereophonic differential energy

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Ambikairajah, E., Li, H., Wang, L., Yin, B., & Sethu, V. (2011). Language identification: a tutorial. IEEE Circuits and Systems Magazine, 11(2), 82–108. CrossRef

Benesty, J., Sondhi, M. M., & Huang, Y. (2007). Springer handbook of speech processing. New York: Springer.

Bhaskararao, P. (2005). Salient phonetic features of Indian languages in speech technology. Sadhana, 36(5), 587–599. CrossRef

Carrasquillo, P. A. T., Reynolds, D. A., & Deller, J. R. (2002). Language identification using Gaussian mixture model tokenization. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (Vol. I, pp. 757–760).

Cimarusti, D., & Eves, R. B. (1982). Development of an automatic identification system of spoken languages: phase I. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, May 1982 (pp. 1661–1663).

Cole, R. A., Inouye, J. W. T., Muthusamy, Y. K., & Gopalakrishnan, M. (1989). Language identification with neural networks: a feasibility study. In Proc. IEEE pacific rim conf. communications, computers and signal processing (pp. 525–529). CrossRef

Corredor-Ardoy, C., Gauvain, J., Adda-Decker, M., & Lamel, L. (1997). Language identification with language-independent acoustic models. In Proc. EUROSPEECH-1997 (pp. 55–58).

Cummins, F., Gers, F., & Schmidhuber, J. (1999). Comparing prosody across languages. Tech. rep. I. D. S. I. A. Technical report IDSIA-07-99, Istituto Molle di Studie sull’Intelligenza Artificiale, CH6900 Lugano, Switzerland.

Cutler, A., & Ladd, D. R. (1983). Prosody: models and measurements. Berlin: Springer. CrossRef

Dalsgaard, P., & Andersen, O. (1992). Identification of mono- and polyphonemes using acoustic-phonetic features derived by a self-organising neural network. In Proc. int. conf. spoken language processing (ICSLP- 1992) (pp. 547–550).

Dutoit, T. (1997). An introduction to text-to-speech synthesis. Dordrecht: Kluwer Academic. CrossRef

Ember, M., & Ember, C. R. (1999). Cross-language predictors of consonant-vowel syllables. American Anthropologist, 101, 730–742. CrossRef

Gangashetty, S. V. (2005). Neural network models for recognition of consonant-vowel units of speech in multiple languages. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras.

Gobbel, A. E. T., & Hutchins, S. E. (1996). On using prosodic cues in language identification. Proceedings of International Conference on Spoken Language Processing (ICSLP), 101, 1768–1772. CrossRef

Guoliang, Z., Fang, Z., & Zhanjiang, S. (2001). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(16), 582–589. MATH

Gussenhoven, C., Reepp, B. H., Rietveld, A., Rump, H. H., & Terken, J. (1997). The perceptual prominence of fundamental frequency peaks. The Journal of the Acoustical Society of America, 102, 3009–3022. CrossRef

Hazen, T. J., & Zue, V. W. (1997). Segment-based automatic language identification. The Journal of the Acoustical Society of America, 101, 2323–2331. CrossRef

Ives, R. (1986). A minimal rule AI expert system for real-time classification of natural spoken languages. In Proc. 2nd annual artificial intelligence and advanced computer technology conf. (pp. 337–340).

Jayaram, A. K. V. S., Ramasubramanian, V., & Sreenivas, T. V. (2003). Language identification using parallel sub-word recognition. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (Vol. I, pp. 32–35).

Jothilakshmi, S., Ramalingam, V., & Palanivel, S. (2012). Hierarchical language identification system for Indian languages. Digital Signal Processing, 22, 544–553. MathSciNetCrossRef

Jyotsna, B., Murthy, H. A., & Nagarajan, T. (2000). Language identification from short segments of speech. In Proceedings of int. conf. spoken language processing, Beijing, China, Oct. 2000 (pp. 1033–1036).

Koolagudi, S. G., & Sreenivasa Rao, K. (2012). Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features. International Journal of Speech Technology, 15(3), 495–511. CrossRef

Krakow, R. A. (1999). Physiological organization of syllables: a review. Journal of Phonetics, 27, 23–54. CrossRef

Kumar Vuppala, A., & Sreenivasa Rao, K. (2013). Vowel onset point detection for noisy speech using spectral energy at formant frequencies. International Journal of Speech Technology, 16(2), 229–235. CrossRef

Kumar Vuppala, A., Yadav, J., Chakrabarti, S., & Sreenivasa Rao, K. (2012). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Audio, Speech, and Language Processing, 20, 1894–1903. CrossRef

Lamel, L. F., & Gauvain, J. L. (1994). Language identification using phonebased acoustic likelihoods. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, Apr. 1994 (Vol. 1, pp. 293–296).

Lander, T., Cole, R., Oshika, B., & Noel, M. (1995). The OGI 22 language telephone speech corpus. In Proc. EUROSPEECH-1995 (pp. 817–820).

Lavanya, P., Kishore, P., & Madhavi, G. (2005). A simple approach for building transliteration editors for Indian languages. Journal of Zhejiang University. Science, 6A(11), 1354–1361. CrossRef

LDC (1996). (LDC96S46 LDC96S60) Philadelphia, PA. http://www.ldc.upenn.edu/Catalog.

Leonard, R. G., & Doddington, G. R. (1974). Automatic language identification. Tech. Rep., A.F.R.A.D. Centre Tech. Rep. RADC-TR-74-200.

Lin, C. Y., & Wang, H. C. (2006). Language identification using pitch contour information in the ergodic Markov model. In Proc. 2006 IEEE int. conf. acoustics, speech and signal processing (ICASSP 2006).

Lu-Feng, Z., Man-hung, S., Xi, Y., & Gish, H. (2006). Discriminatively trained language models using support vector machines for language identification. In Proc. speaker and language recognition workshop, IEEE odyssey 2006 (pp. 1–6).

MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behavial and Brain Sciences, 21, 499–546.

Mahadeva Prasanna, S. R., Gangashetty, S. V., & Yegnanarayana, B. (2001). Significance of vowel onset point for speech analysis. In Proc. int. conf. signal processing and communication, Bangalore, India, Jul. 2001 (Vol. 1, pp. 81–86).

Maity, S., Kumar Vuppala, A., Sreenivasa Rao, K., & Nandi, D. (2012). IITKGP-MLILSC speech database for language identification. In National conference on communications (NCC), Kharagpur, India, Feb. 2012 (pp. 1–3). New York: IEEE Press. CrossRef

Man-Hung, S., Xi, Y., & Gish, H. (2009). Discriminatively trained GMMs for language classification using boosting methods. IEEE Transactions on Audio, Speech, and Language Processing, 17(1), 187–197. CrossRef

Mart´ınez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based prosodic system for language identification. In ICASSP.

Mary, L. (2006). Multilevel implicit features for language and speaker recognition. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras.

Mary, L., & Yegnanarayana, B. (2004). Autoassociative neural network models for language identification. In Proc. int. conf. intelligent sensing and information processing, Chennai, India (pp. 317–320).

Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796. CrossRef

Mary, L., Rao, K. S., & Yegnanarayana, B. (2005). Neural network classifiers for language identification using syntactic and prosodic features. In Proc. IEEE int. conf. intelligent sensing and information processing, Chennai, India, Jan. 2005 (pp. 404–408).

Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613. CrossRef

Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone speech corpus. In Proceedings of int. conf. spoken language processing (pp. 895–898).

Nagarajan, T., & Murthy, H. A. (2002). Language identification using spectral vector distribution across languages. In Proc. international conference on natural language processing (pp. 327–335).

Nakagawa, S., Ueda, Y., & Seino, T. (1992). Speaker-independent, text independent language identification by HMM. In Proc. int. conf. spoken language processing (ICSLP-1992) (pp. 1011–1014).

Navratil, J. (2001). Spoken language recognition a step toward multilinguality in speech processing. IEEE Transactions on Speech and Audio Processing, 9(6), 678–685. CrossRef

Nayeemulla Khan, A., Gangashetty, S. V., & Yegnanarayana, B. (2003). Syllabic properties of three Indian languages: implications for speech recognition and language identification. In Proc. int. conf. natural language processing, Mysore, India, Dec. 2003 (pp. 125–134).

Ohman, S. E. G. (1966). Coarticulation in VCV utterances: spectrographic measurements. The Journal of the Acoustical Society of America, 39, 151–168. CrossRef

Pellegrino, F., & Andre-Abrecht, R. (1999). An unsupervised approach to language identification. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (pp. 833–836).

Pellegrino, F., Farinas, J., & André-Obrecht, R. (1999). Comparison of two phonetic approaches to language identification. In Proc. EUROSPEECH’99 (pp. 399–402).

Ramasubramanian, V., Sai Jayaram, A. K. V., & Sreenivas, T. V. (2003). Language identification using parallel phone recognition. In WSLP, TIFR, Mumbai, Jan. 2003 (pp. 109–116).

Ramus, F., & Mehler, J. (1999). Language identification with suprasegmental cues: a study based on speech resynthesis. The Journal of the Acoustical Society of America, 105, 512–521. CrossRef

Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in speech signal. Cognition, 73(3), 265–292. CrossRef

Rao, K. S. (2005). Acquisition and incorporation prosody knowledge for speech systems in Indian languages. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, May 2005.

Rao, K. S. (2010). Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech & Language, 24(1), 474–494. CrossRef

Rao, K.S. (2012). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33. CrossRef

Rao, K. S., & Vuppala, A. K. (2013). Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Communication, 55, 745–756. CrossRef

Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980. CrossRef

Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech & Language, 21, 282–295. CrossRef

Rao, K. S., & Yegnanarayana, B. (2009a). Intonation modeling for Indian languages. In International conference on spoken language processing (ICSLP) (pp. 733–736).

Rao, K. S., & Yegnanarayana, B. (2009b). Intonation modeling for Indian languages. Computer Speech & Language, 23(2), 240–256. CrossRef

Rao, K. S., & Yegnanarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51, 1263–1269. CrossRef

Rao, K. S., Vuppala, A. K. & Chakrabarti, S. (2012). Spotting and recognition of consonant-vowel units from continuous speech using accurate vowel onset points. Circuits, Systems, and Signal Processing, 31(4), 1459–1474. CrossRef

Reynolds, D. A. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17(1–2), 91–108. CrossRef

Riek, L., Mistreta, W., & Morgan, D. (1991). Experiments in language identification. Tech. Rep., Lockheed Sanders Tech. Rep. SPCOT-91-002.

Rouas, J. L. (2007). Automatic prosodic variations modeling for language and dialect discrimination. IEEE Transactions on Audio, Speech, and Language Processing, 15, 1904–1911. CrossRef

Rouas, J.-L., Farinas, J., Pellegrino, F., & André-Obrecht, R. (2005). Rhythmic unit extraction and modelling for automatic language identification. Speech Communication, 47, 436–456. CrossRef

Sangwan, A., Mehrabani, M., & Hansen, J. H. L. (2010). Automatic language analysis and identification based on speech production knowledge. In ICASSP.

Sekhar, C. C. (1996). Neural network models for recognition of stop consonant-vowel (SCV) segments in continuous speech. Ph.D. thesis, Indian Institute of Technology Madras, Department of Computer Science and Engg, Chennai, India.

Shriberg, E., Stolcke, A., Hakkani-Tur, D., & Tur, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 32, 127–154. CrossRef

Sreenivasa Rao, K., Maity, S., & Ramu Reddy, V. (2013). Pitch synchronous and glottal closure based speech analysis for language recognition. International Journal of Speech Technology. doi:10.1007/s10772-013-9193-5.

Taylor, P. (2000). Analysis and synthesis of intonation using the tilt model. The Journal of the Acoustical Society of America, 107, 1697–1714. CrossRef

Ueda, Y., & Nakagawa, S. (1990). Diction for phoneme/syllable/word-category and identification of language using HMM. In Proc. int. conf. spoken language processing (ICSLP-1990) (pp. 1209–1212).

Wong, E., & Sridharan, S. (2002). Gaussian mixture model based language identification system. In Proc. int. conf. spoken language processing (ICSLP-2002) (pp. 93–96).

Xu, Y. (1998). Consistency of tone-syllable alignment across different syllable structures and speaking rates. Phonetica, 55, 179–203. CrossRef

Zissman, M. A. (1993). Automatic langauge identification using Gaussian mixture and hidden Markov models. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, Apr. 1993 (pp. 399–402). CrossRef

Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on Speech and Audio Processing, 4, 31–44. CrossRef

Titel: Identification of Indian languages using multi-level spectral and prosodic features
verfasst von: V. Ramu Reddy
Sudhamay Maity
K. Sreenivasa Rao
Publikationsdatum: 01.12.2013
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 4/2013
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-013-9198-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2013

Computational auditory models in predicting noise reduction performance for wideband telephony applications

A unified framework for domain independent online speaker indexing in eigen-voice space using an index tree of reference models

A voice command system for AUTONOMY using a novel speech alignment algorithm

Performance evaluation of a wavelet-based pitch detection scheme

Optimal speech enhancement under signal presence uncertainty using Log Gabor Wavelet and Bayesian Joint Statistics

A new approach of speaker clustering based on the stereophonic differential energy