Role of neural network models for developing speech systems

RAO, K SREENIVASA

doi:10.1007/s12046-011-0047-z

Role of neural network models for developing speech systems

Published: 22 November 2011

Volume 36, pages 783–836, (2011)
Cite this article

Sadhana Aims and scope Submit manuscript

K SREENIVASA RAO¹

314 Accesses
39 Citations
6 Altmetric
Explore all metrics

Abstract

This paper discusses the application of neural networks for developing different speech systems. Prosodic parameters of speech at syllable level depend on positional, contextual and phonological features of the syllables. In this paper, neural networks are explored to model the prosodic parameters of the syllables from their positional, contextual and phonological features. The prosodic parameters considered in this work are duration and sequence of pitch (F ₀) values of the syllables. These prosody models are further examined for applications such as text to speech synthesis, speech recognition, speaker recognition and language identification. Neural network models in voice conversion system are explored for capturing the mapping functions between source and target speakers at source, system and prosodic levels. We have also used neural network models for characterizing the emotions present in speech. For identification of dialects in Hindi, neural network models are used to capture the dialect specific information from spectral and prosodic features of speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Study on Selecting Acoustic Modeling Units in Deep Neural Networks Based Large Vocabulary Chinese Speech Recognition

Automatic Speech Recognition Based on Neural Networks

The Representation of Speech and Its Processing in the Human Brain and Deep Neural Networks

References

Abe M, Nakanura S, Shikano K, Kuwabara H 1998 Voice conversion through vector quantization. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1: 655–658
Google Scholar
Angkititrakul P, Hansen J L 2002 Stochastic trajectory model analysis for accent classification. ICSLP, Denver, CO, USA, pp. 493–496
Angkititrakul P, Hansen J L 2003 Use of trajectory models for automatic accent classification. Proc. Eurospeech, Geneva, Switzerland, pp. 1353–1356
Anjani A V N S 2000 Autoassociate neural network models for processing degraded speech. Master’s thesis, MS thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India
Arslan L, Hansen J 1996 Language accent classification in American English. Speech Commun. 18(4): 353–367
Article Google Scholar
Arslan L, Hansen J 1997 A study of temporal features and frequency characteristics in American English foreign accent. J. Acoust. Soc. Am. 102: 28–40
Article Google Scholar
Barbosa P A, Bailly G 1992 Generating segmental duration by p-centers. Proc. of the Fourth Workshop on Rhythm Perception and Production, Bourges, France, pp. 163–168
Barbosa P A, Bailly G 1994 Characterization of rhythmic patterns for text-to-speech synthesis. Speech Commun. 15: 127–137
Article Google Scholar
Batliner A, Mobius B, Mohler G, Schweitzer A, Noth E 2001 Prosodic models, automatic speech understanding, and speech synthesis: Towards the common ground. Proc. Eurospeech, Scandinavia
Benesty J, Sondhi M M, Huang Y (eds) 2008 Springer handbook on speech processing, (Berlin, Germany: Springer Publishers)
Google Scholar
Black A W, Taylor P, Caley R 2000 The festival speech synthesis system: System documentation. The Centre for Speech Technology Research (CSTR), University of Edinburgh, 1.4.0 edition. http:// www.cstr.ed.ac.uk/projects/festival/manual/festival_toc.html
Blackburn C S, Vonwiller J P, King R W 1993 Automatic accent classification using artificial neural networks. In Proc. Eurospeech, vol. 2, pp. 1241–1244
Blin L, Boeffard O, Barreaud V 2008 Web-based listening test system for speech synthesis and speech conversion evaluation. In Proc. LREC, Marrakech (Morocco)
Breiman L, Friedman N, Olshen R 1984 Classification and regression trees (Pacific Grove, CA, USA: Wadsworth and Brooks)
MATH Google Scholar
Buhmann J, Vereecken H, Fackrell J, Martens J P, Coile B V 2000 Data driven intonation modeling of 6 languages. Proc. Int. Conf. Spoken Language Processing, vol. 3, Beijing, China, pp. 179–183
Busso C, Deng Z, Yildirim S, Bulut M, Lee C M, Kazemzadeh A, Lee S, Neumann U, Narayanan S 2004 Analysis of emotion recognition using facial expressions, speech and multimodal information. In ACM 6th Int. Conf. on Multimodal Interfaces (ICMI 2004), ACM, State College, PA
Busso C, Lee S, Narayanan S 2009 Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans. Speech Audio Process. 17: 582–596
Article Google Scholar
Campbell W N 1990 Analog i/o nets for syllable timing. Speech Commun. 9(1): 57–61
Article Google Scholar
Campbell W N 1992 Syllable based segment duration. In (eds) G Bailly, C Benoit, T R Sawallis, Talking machines: Theories, models and designs, Elsevier, Amsterdam, pp. 211–224
Campbell W N 1993 Predicting segmental durations for accommodation within a syllable-level timing framework. In Proc. European Conf. Speech Communication and Technology, vol. 2, Berlin, Germany, pp. 1081–1084
Campbell W N, Isard S D 1991 Segment durations in a syllable frame. J. Phonetics: Special issue on speech synthesis 19: 37–47
Google Scholar
Chopde A 2001 Itrans Indian language transliteration package version 5.2 source. http://www. aczone.con/itrans/
Chung H 2002a Duration models and the perceptual evaluation of spoken Korean. In Proc. Speech Prosody, Aix-en-Provence, France, pp. 219–222
Chung H 2002b Perceptual evaluation of duration models in spoken Korean. Korean J. Speech Sci. 9: 207–215
Google Scholar
Cordoba R, Vallejo J A, Montero J M, Gutierrezarriola J, Lopez M A, Pardo J M 1999 Automatic modeling of duration in a Spanish text-to-speech system using neural networks. In Proc. European Conf. Speech Communication and Technology, Budapest, Hungary
Cosi P, Tesser F, Gretter R 2001 Festival speaks Italian. In Proc. Eurospeech 2001, Aalborg, Denmark, pp. 509–512
Davis S, Mermelstein P 1980 Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Speech Audio Process. 28(4): 357–366
Article Google Scholar
Dellaert F, Polzin T, Waibel A 1996 Recognising emotions in speech. In International Conference on Spoken Language Processing (ICSLP) 96, vol. 3, Philadelphia, PA, USA, pp. 1816–1819
Deller J R, Proakis J G, Hansen J H L 1993 Discrete-time processing of speech signals (New York, USA: Macmilan Publishing Company)
Google Scholar
Desai S, Black A W, Yegnanarayana B, Prahlad K 2010 Spectral mapping using artificial neural networks for voice conversion. IEEE Trans. Audio Speech Lang. Process. 18(5): 954–964
Article Google Scholar
Diamantaras K I, Kung S Y 1996 Principal component neural networks: Theory and applications (New York: John Wiley and Sons)
MATH Google Scholar
Douglas-Cowie E, Campbell N, Cowie R, Roach P 2003 Emotional speech: Towards a new generation of databases. Speech Commun. 40: 3360
Google Scholar
Douglas-Cowie R, Tsapatsoulis E, Votsis N, Kollias G, Fellenz S, Fellinge W, Taylor J 2001 Emotion recognition in human computer interaction. IEEE Signal Process. Mag., Stockholm, Sweden, 23–25 April 2001
Dusterhoff K E, Black A W, Taylor P A 1999 Using decision trees within the Tilt intonation model to predict F0 contour. In Proc. Eurospeech, Budapest, Hungary
Fujisaki H 1983 Dynamic characteristics of voice fundamental frequency in speech and singing. In (ed) P F MacNeilage, The production of speech, New York, USA: Springer-Verlag, pp. 39–55
Chapter Google Scholar
Fujisaki H 1988 A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour. In (ed) O Fujimura, Vocal physiology: Voice production, mechanisms and functions, New York, USA: Raven Press, pp. 347–355
Gangashetty S V, Sekhar C C, Yegnanarayana B 2004 Extraction of fixed dimension patterns from varying duration segments of consonant-vowel utterances. In Proc. IEEE Int. Conf. Intelligent Sensing and Information Processing, Chennai, India, pp. 159–164
Giannakopoulos T, Pikrakis A, Theodoridis S 2009 A dimensional approach to emotion recognition of speech from movies. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1: 65–68
Article Google Scholar
Goubanova O, King S 2008 Bayesian networks for phone duration prediction. Speech Commun. 50: 301–311
Article Google Scholar
Goubanova O, Taylor P 2000 Using bayesian belief networks for modeling duration in text-to-speech systems. In Proc. Int. Conf. Spoken Language Processing, vol. 2, Beijing, China, pp. 427–431
Grimm M, Kroschel K, Narayanan S 2007 Support vector regression for automatic recognition of spontaneous emotions in speech. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 4: 1085–1088
Google Scholar
Gunes H, Pantic M 2010 Automatic, dimensional and continuous emotion recognition. Int. J. Synthetic Emotions 1(1): 68–99
Article Google Scholar
Gupta C S 2003 Significance of source features for speaker recognition. Master’s thesis, MS thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India
Haykin S 1999 Neural networks: A comprehensive f oundation (New Delhi: Pearson Education Aisa, Inc.)
MATH Google Scholar
Hifny Y, Rashwan M 2002 Duration modeling of Arabic text-to-speech synthesis. In Proc. Int. Conf. Spoken Language Processing, Denver, Colorado, USA, pp. 1773–1776
Hogg R V, Ledolter J 1987 Engineering statistics (New York: Macmillan Publishing Company)
Google Scholar
Hwang S H, Chen S H 1994 Neural-network-based F0 text-to-speech synthesizer for Mandarin. IEE Proc. Image Signal Process. 141(6): 384–390
Article Google Scholar
Hwang S-H, Chen S-H 1995 A prosodic model for mandarin speech and its application to pitch level generation for text-to-speech. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1: 616–619
Google Scholar
Ikbal M S, Misra H, Yegnanarayana B 1999 Analysis of autoassociative mapping neural networks. In Int. Joint Conf. Neural Networks, USA, pp. 854–858
Iliev A I, Scordilis M S, Papa J P, Falcão A X 2010 Spoken emotion recognition through optimum-path forest classification using glottal features. Comput. Speech. Lang. 24(3): 445–460
Article Google Scholar
Inanoglu Z 2003 Transforming pitch in a voice conversion framework. Master’s thesis, St. Edmunds College, University of Cambridge
Itahashi S, Tanaka K 1993 A method of classification among Japanese dialects. Proc. Eurospeech 1: 639–642
Google Scholar
Itakura F 1975 Minimum prediction residual principle applied to speech recognition. IEEE Trans. Speech Audio Process. 23(1): 67–72
Article Google Scholar
Kain A 2001 High resolution voice transformation. PhD thesis, OGI School of Science and Engineering, Oregon Health and Science University, USA
Kain A, Macon M W 2001 Design and evaluation of a voice conversion algorithm based on spectral envelop mapping and residual prediction. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 2, Salt Lake City, UT, USA, pp. 813–816
Khan A N, Gangashetty S V, Yegnanarayana B 2003 Syllabic properties of three Indian languages: Implications for speech recognition and language identification. In Int. Conf. Natural Language Processing, Mysore, India, pp. 125–134
Kishore S P, Kumar R, Sangal R 2002 A data-driven synthesis approach for indian languages using syllable as basic unit. In Int. Conf. Natural Language Processing, Mumbai, India, pp. 311–316
Kishore S P, Yegnanarayana B 2001 Online text-independent speaker verification system using autoassociative neural network models. In Int. Joint Conf. Neural Networks, Washington, DC, USA
Krishna N S, Murthy H A 2004 Duration modeling of Indian languages Hindi and Telugu. In 5th ISCA Speech Synthesis Workshop, Pittsburgh, USA, pp. 197–202
Krishna N S, Murthy H A, Gonsalves T A 2002 Text-to-speech (tts) in Indian languages. In Int. Conf. Natural Language Processing, Mumbai, pp. 317–326, 18–21 December 2002
Krishna N, Tulukdar P, Bali K, Ramakrishnan A 2004 Duration modeling for Hindi text-to-speech synthesis system. In Proc. Int. Conf. Spoken Language Processing, Denver, USA
Kwon O, Chan K, Hao J, Lee T 2003 Emotion recognition by speech signals. In Eurospeech, Geneva, pp. 125–128
Lee C M, Narayanan S 2005a Toward detecting emotions in spoken dialogs. IEEEAUP 13(2): 293–303
Google Scholar
Lee C M, Narayanan S S 2005b Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2): 293–303
Article Google Scholar
Lee K 2007 Statistical approach for voice personality transformation. IEEE Trans. Audio Speech Lang. Process. 15: 641–651
Article Google Scholar
Lee S, Yildirim S, Kazemzadeh A, Narayanan S 2007 An articulatory study of emotional speech production. Proc. Interspeech 4: 497–500
Google Scholar
Leung C-C, Ferras M, Barras C, Gauvain J-L 2008 Comparing prosodic models for speaker recognition. In Interspeech, vol. 1, Brisbane, Australia, pp. 1945–1948, September 2008
Lugger M, Yang B 2008 Cascaded emotion classification via psychological emotion dimensions using a large set of voice quality parameters. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 4: 4945– 4948
Article Google Scholar
Mary L, Yegnanarayana B 2008 Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10): 782–796
Article Google Scholar
McGilloway S, Cowie R, Douglas-Cowie E, Gielen S, Westerdijk M, Stroeve S 2000 Approaching automatic recognition of emotion from voice: A rough benchmark. In ISCA Workshop on Speech and Emotion, Belfast
Mei X, Sun S 2000 An efficient method to compute lsfs from lpc coefficients. In ICSP-2000, pp. 655–658
Mixdorff H, Jokisch O 2001 Building an integrated prosodic model of German. In Proc. European Conf. Speech Communication and Technology, vol. 2, Aalborg, Denmark, pp. 947–950
Murthy P S, Yegnanarayana B 1999 Robustness of group-delay-based method for extraction of significant excitation from speech signals. IEEE Trans. Speech Audio Process. 7: 609–619
Article Google Scholar
Narendranadh M, Murthy H A, Rajendran S, Yegnanarayana B 1995 Transformation of formants for voice conversion using artificial neural networks. Speech Commun. 16(2): 206–216
Google Scholar
Nicholson J, Takahashi K, Nakatsu R 1999 Emotion recognition in speech using neural networks. In 6th International Conference on Neural Information Processing, ICONIP-99, pp. 495–501
Nwe T L, Foo S W, Silva L C D 2003 Speech emotion recognition using hidden Markov models. Speech Commun. 41(4): 603–623
Article Google Scholar
Oppenheim A V, Schafer R W, Buck J R 1999 Discrete-time signal processing (NJ: Prentice-Hall)
Google Scholar
Ostendorfy M, Shafranz I, Bates R 2003 Prosody models for conversational speech recognition. In Symposium on Prosody and Speech, Tokyo, Japan
Pantic M, Bartlett M 2007 Machine analysis of facial expressions. In (eds) K Delac, M Grgic, Face recognition (Vienna: I-Tech Education) pp. 377–416
Pantic M, Rothkrantz L J M 2003 Toward an affect-sensitive multimodal human–computer interaction. Proc. IEEE 91: 1370–1390
Article Google Scholar
Percybrooks W S, Moore-II E 2008 Voice conversion with linear prediction residual estimation. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 4: 4673–4676
Article Google Scholar
Pierrehumbert J B 1980 The phonology and p honetics of English i ntonation. PhD thesis, MIT, MA, USA
Prasanna S R M 2004 Event-based analysis of speech. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India
Prasanna S R M, Yegnanarayana B 2004a Extraction of pitch in adverse conditions. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, Montreal, Canada
Prasanna S R M, Yegnanarayana B 2004b Extraction of pitch in adverse conditions. In IEEE Int. Conf. Acoust. Speech Audio Process., vol. 1, Montreal, Canada
Prasanna S R M, Zachariah J M 2002 Detection of vowel onset point in speech. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 3, Orlando, Florida, USA
Rabiner L R, Juang B H 1993 Fundamentals of speech recognition (Englewood Cliffs, New Jersey: Prentice-Hall)
Google Scholar
Rao K S 2011 Application of prosody models for developing speech systems in Indian languages. Int. J. Speech Technol. 14: 19–33
Article Google Scholar
Rao K S 2009 Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Comput. Speech Lang. 23(2): 240–256
Article Google Scholar
Rao K S, Laskar R H, Koolagudi S G 2007 Voice transformation by mapping the features at syllable level. In 2nd International Conference on Pattern Recognition and Machine Intelligence (Premi-2007), Kolkata, India
Rao K S, Nandy S, Koolagudi S G 2010 Identification of Hindi dialects using speech. In WMSCI-2010, Orlando, Florida, USA
Rao K S, Reddy R, Maity S, Koolagudi S G 2010 Characterization of emotions using dynamics of prosodic features. In Speech Prosody, Chicago, USA
Rao K S, Yegnanarayana B 2003 Prosodic manipulation using instants of significant excitation. In Proc. IEEE Int. Conf. Multimedia and Expo, Baltimore, Maryland, USA, pp. 389–392
Rao K S, Yegnanarayana B 2006a Prosody modification using instants of significant excitation. IEEE Trans. Audio Speech Lang. Process. 14(3): 972–980
Article Google Scholar
Rao K S, Yegnanarayana B 2006b Voice conversion by prosody and vocal tract modification. In 9th Int. Conf. Information Technology, Bhubaneswar, Orissa, India
Rao K S, Yegnanarayana B 2007 Modeling durations of syllables using neural networks. Comput. Speech Lang. 21: 282–295
Article Google Scholar
Rao K S, Yegnanarayana B 2009 Intonation modeling for Indian languages. Comput. Speech Lang. 23: 240–256
Article Google Scholar
Riley M 1992 Tree-based modeling of segmental durations. In Talking machines: Theories, models and designs, Elsevier Publishers, Amsterdam, pp. 265–273
Santen J P H V 1994 Assignment of segment duration in text-to-speech synthesis. Comput. Speech Lang. 8: 95–128
Article Google Scholar
Schrder M 2001 Emotional speech synthesis: A review. In Eurospeech, Aalborg, Denmark
Schroder M 1996 Can emotions be synthesized without controlling voice quality? Phonus 4, Research report of the Institute of Phonetics, University of Saarland
Schroder M, Cowie R 2006 Issues in emotion-oriented computing towards a shared understanding. Workshop on Emotion and Computing, Bremen, Germany
Schroder M, Cowie R, Douglas-Cowie E, Westerdijk M, Gielen S 2001 Acoustic correlates of emotion dimensions in view of speech synthesis. In 7th European Conference on Speech Communication and Technology. Eurospeech 2001 Scandinavia, 2nd Interspeech Event, Aalborg, Denmark
Scordilis M S, Gowdy J N 1989 Neural network based generation of fundamental frequency contours. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, Glasgow, Scotland, pp. 219–222
Shriberg E, Stolcke A 2001 Prosody modeling for automatic speech understanding: An overview of recent research at SRI, In Prosody in Speech Recognition and Understanding, ISCA Tutorial and Research Workshop (ITRW), Molly Pitcher Inn, Red Bank, NJ, USA
Shriberg E, Stolcke A 2004 Mathematical foundations of speech and language processing (Philadelphia, PA, USA: Springer)
Google Scholar
Smits R, Yegnanarayana B 1995 Determination of instants of significant excitation in speech using group delay function. IEEE Trans. Speech Audio Process. 3(5): 325–333
Article Google Scholar
Sonntag G P, Portele T, Heuft B 1997 Prosody generation with a neural network: Weighing the importance of input parameters. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Munich, Germany, pp. 931–934
Sontag E D 1992 Feedback stabilization using two hidden layer nets. IEEE Trans. Neural Networks 3: 981–990
Article Google Scholar
Srikanth S, Kumar S R R, Sundar R, Yegnanarayana B 1989 A text-to-speech conversion system for Indian languages based on waveform concatenation model. Technical report no.11, Project VOIS, Department of Computer Science and Engineering, Indian Institute of Technology Madras
Stylianou Y, Cappe O, Moulines E 1995 Statistical methods for voice quality transformation. In Eurospeech, Madrid, Spain, pp. 447–450
Stylianou Y, Cappe Y, Moulines E 1998 Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6: 131–142
Article Google Scholar
Sun R, Moore E, Torres J F 2009 Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 4: 4509–4512
Article Google Scholar
Sundermann D 2005 Voice conversion: State-of-the-art and future work. In Proc. DAGA: 31st German Annual Conf. on Acoustics, Munich, Germany
Sundermann D, Bonafonte A, Ney H 2005a A study on residual prediction techniques for voice conversion, Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1: 13–16
Google Scholar
Sundermann D, Bonafonte A, Duxans H, Hoege H 2005b Tc-star: Evaluation plan for voice conversion technology. In Proc. DAGA: 31st German Annual Conf. on Acoustics, Munich, Germany
Taylor P A 2000 Analysis and synthesis of intonation using the Tilt model. J. Acoust. Soc. Am. 107(3): 1697–1714
Article Google Scholar
Teixeira J P, Freitas D 2003 Segmental durations predicted with a neural network. In Proc. European Conf. Speech Communication and Technology, Geneva, Switzerland, pp. 169–172
Tesser F, Cosi P, Drioli C, Tisato G 2004 Prosodic data driven modeling of a narrative style in Festival TTS. In 5th ESCA Speech Synthesis Workshop, Pittsburgh, USA, pp. 185–190
t’Hart J, Collier R, Cohen A 1990 A perceptual study of intonation (Cambridge: Cambridge University Press)
Book Google Scholar
Toda T, Saruwatari H, Shikano K 2001 Voice conversion algorithm based on gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 2: 841–844
Google Scholar
Torres-Carrasquillo P A, Gleason T P, Reynolds D A 2004 Dialect identification using gaussian mixture models. In Proc. Odyssey: The Speaker and Language Recognition Workshop, Toledo, Spain, pp. 297–300
Torres-Carrasquillo P A, Sturim D, Reynolds D A, McCree A 2008 Eigen channel compensation and discriminatively trained gaussian mixture models for dialect and accent recognition. In Interspeech, Brisbane, Australia, 22–26 September 2008
Toth A R, Black A W 2008 Incorporating durational modification in voice transformation. In Interspeech, Brisbane, Australia, vol. 2, pp. 1088–1091
Turk O 2007 Cross-lingual voice conversion. Ph.D. thesis, Institute for Graduate Studies in Science and Engineering, Bogazii University, Berlin, Germany
Turk O, Arslan L M 2005 Donor selection for voice conversion. In Proc. EUSIPCO, Antalya, Turkey
Turk O, Arslan L M 2006 Robust processing techniques for voice conversion. Comput. Speech Lang. 20: 441–467
Article Google Scholar
Turk O, Schroder M 2010 Evaluation of expressive speech synthesis with voice conversion and copy re-synthesis techniques. IEEE Trans. Speech Audio Process. 18(5): 965–973
Article Google Scholar
Vainio M 2001 Artificial neural network based prosody models for Finnish text-to-speech synthesis. PhD thesis, Dept. of Phonetics, University of Helsinki, Finland
Vainio M, Altosaar T 1998 Modeling the microprosody of pitch and loudness for speech synthesis with neural networks. In Proc. Int. Conf. Spoken Language Processing, Sydney, Australia
Valbret H, Moulines E, Tubach J P 1992 Voice transformation using PSOLA techniques. Speech Commun. 11: 175–187
Article Google Scholar
Vegnaduzzo M 2003 Modeling intonation for the Italian Festival TTS using linear regression. Master’s thesis, Dept. of Linguistics, University of Edinburgh
Ververidis D, Kotropoulos C 2006a Emotional speech recognition: Resources, features, and methods. Speech Commun. 48: 11621181
Article Google Scholar
Ververidis D, Kotropoulos C 2006b A state of the art review on emotional speech databases. In Eleventh Australasian International Conference on Speech Science and Technology, Auckland, New Zealand
Wang Y, Guan L 2004 An investigation of speech-based human emotion recognition. In IEEE 6th Workshop on Multimedia Signal Processing, pp. 15–18
Watanabe T, Murakami T, Namba M, Hoya T, Ishida Y 2002 Transformation of spectral envelope for voice conversion based on radial basis function networks. In Proc. Int. Conf. Spoken Language Processing, Denver, CO, USA, pp. 285–288
Weber F, Manganaro L, Peskin B, Shriberg E 2002 Using prosodic and lexical information for speaker identification. Proc. ICASSP, vol. 1, Orlando, pp. 141–144
Wua S, Falk b T H, Chan W-Y 2010 Automatic speech emotion recognition using modulation spectral features. Speech Commun. doi: 10.1016/j.specom.2010.08.013
Yoon W, Kyu-SikPark 2007 Modelling decisions for artificial intelligency (Berlin, Germany: Springer) pp. 455–462
Book Google Scholar
Ye H, Young S 2004 High quality voice morphing. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1: 9–12
Google Scholar
Yegnanarayana B 1999 Artificial neural networks (New Delhi: Prentice-Hall)
Google Scholar
Yegnanarayana B, Kishore S P 2002 AANN an alternative to GMM for pattern recognition. Neural Networks 15: 459–469
Article Google Scholar
Yegnanarayana B, Reddy K S, Kishore S P 2001a Source and system features for speaker recognition using aann models. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Salt Lake City, UT
Yegnanarayana B, Reddy K S, Kishore S P 2001b Source and system features for speaker recognition using AANN models. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, Salt Lake City, Utah, USA, pp. 409–412
Yegnanarayana B, Veldhuis R N J 1998 Extraction of vocal-tract system characteristics from speech signals. IEEE Trans. Speech Audio Process. 6(4): 313–327
Article Google Scholar
Yin1 B, Ambikairajah E, Chen F 2006 Combining cepstral and prosodic features in language identification. In 18th Int. Conf. on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006
Zissman M A, Gleason T P, Rekart D M, Losiewicz B L 1996 Automatic dialect identification of extemporaneous, conversational, latin american spanish speech. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 2, Atlanta, Georgia, USA

Download references

Author information

Authors and Affiliations

School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, West Bengal, India
K SREENIVASA RAO

Authors

K SREENIVASA RAO
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K SREENIVASA RAO.

Rights and permissions

Reprints and permissions

About this article

Cite this article

RAO, K.S. Role of neural network models for developing speech systems. Sadhana 36, 783–836 (2011). https://doi.org/10.1007/s12046-011-0047-z

Download citation

Published: 22 November 2011
Issue Date: October 2011
DOI: https://doi.org/10.1007/s12046-011-0047-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Role of neural network models for developing speech systems

Abstract

Access this article

Similar content being viewed by others

A Comparative Study on Selecting Acoustic Modeling Units in Deep Neural Networks Based Large Vocabulary Chinese Speech Recognition

Automatic Speech Recognition Based on Neural Networks

The Representation of Speech and Its Processing in the Human Brain and Deep Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Role of neural network models for developing speech systems

Abstract

Access this article

Similar content being viewed by others

A Comparative Study on Selecting Acoustic Modeling Units in Deep Neural Networks Based Large Vocabulary Chinese Speech Recognition

Automatic Speech Recognition Based on Neural Networks

The Representation of Speech and Its Processing in the Human Brain and Deep Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation