Skip to main content
Log in

Role of neural network models for developing speech systems

  • Published:
Sadhana Aims and scope Submit manuscript

Abstract

This paper discusses the application of neural networks for developing different speech systems. Prosodic parameters of speech at syllable level depend on positional, contextual and phonological features of the syllables. In this paper, neural networks are explored to model the prosodic parameters of the syllables from their positional, contextual and phonological features. The prosodic parameters considered in this work are duration and sequence of pitch (F 0) values of the syllables. These prosody models are further examined for applications such as text to speech synthesis, speech recognition, speaker recognition and language identification. Neural network models in voice conversion system are explored for capturing the mapping functions between source and target speakers at source, system and prosodic levels. We have also used neural network models for characterizing the emotions present in speech. For identification of dialects in Hindi, neural network models are used to capture the dialect specific information from spectral and prosodic features of speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abe M, Nakanura S, Shikano K, Kuwabara H 1998 Voice conversion through vector quantization. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1: 655–658

    Google Scholar 

  • Angkititrakul P, Hansen J L 2002 Stochastic trajectory model analysis for accent classification. ICSLP, Denver, CO, USA, pp. 493–496

  • Angkititrakul P, Hansen J L 2003 Use of trajectory models for automatic accent classification. Proc. Eurospeech, Geneva, Switzerland, pp. 1353–1356

  • Anjani A V N S 2000 Autoassociate neural network models for processing degraded speech. Master’s thesis, MS thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India

  • Arslan L, Hansen J 1996 Language accent classification in American English. Speech Commun. 18(4): 353–367

    Article  Google Scholar 

  • Arslan L, Hansen J 1997 A study of temporal features and frequency characteristics in American English foreign accent. J. Acoust. Soc. Am. 102: 28–40

    Article  Google Scholar 

  • Barbosa P A, Bailly G 1992 Generating segmental duration by p-centers. Proc. of the Fourth Workshop on Rhythm Perception and Production, Bourges, France, pp. 163–168

  • Barbosa P A, Bailly G 1994 Characterization of rhythmic patterns for text-to-speech synthesis. Speech Commun. 15: 127–137

    Article  Google Scholar 

  • Batliner A, Mobius B, Mohler G, Schweitzer A, Noth E 2001 Prosodic models, automatic speech understanding, and speech synthesis: Towards the common ground. Proc. Eurospeech, Scandinavia

  • Benesty J, Sondhi M M, Huang Y (eds) 2008 Springer handbook on speech processing, (Berlin, Germany: Springer Publishers)

    Google Scholar 

  • Black A W, Taylor P, Caley R 2000 The festival speech synthesis system: System documentation. The Centre for Speech Technology Research (CSTR), University of Edinburgh, 1.4.0 edition. http:// www.cstr.ed.ac.uk/projects/festival/manual/festival_toc.html

  • Blackburn C S, Vonwiller J P, King R W 1993 Automatic accent classification using artificial neural networks. In Proc. Eurospeech, vol. 2, pp. 1241–1244

  • Blin L, Boeffard O, Barreaud V 2008 Web-based listening test system for speech synthesis and speech conversion evaluation. In Proc. LREC, Marrakech (Morocco)

  • Breiman L, Friedman N, Olshen R 1984 Classification and regression trees (Pacific Grove, CA, USA: Wadsworth and Brooks)

    MATH  Google Scholar 

  • Buhmann J, Vereecken H, Fackrell J, Martens J P, Coile B V 2000 Data driven intonation modeling of 6 languages. Proc. Int. Conf. Spoken Language Processing, vol. 3, Beijing, China, pp. 179–183

  • Busso C, Deng Z, Yildirim S, Bulut M, Lee C M, Kazemzadeh A, Lee S, Neumann U, Narayanan S 2004 Analysis of emotion recognition using facial expressions, speech and multimodal information. In ACM 6th Int. Conf. on Multimodal Interfaces (ICMI 2004), ACM, State College, PA

  • Busso C, Lee S, Narayanan S 2009 Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans. Speech Audio Process. 17: 582–596

    Article  Google Scholar 

  • Campbell W N 1990 Analog i/o nets for syllable timing. Speech Commun. 9(1): 57–61

    Article  Google Scholar 

  • Campbell W N 1992 Syllable based segment duration. In (eds) G Bailly, C Benoit, T R Sawallis, Talking machines: Theories, models and designs, Elsevier, Amsterdam, pp. 211–224

  • Campbell W N 1993 Predicting segmental durations for accommodation within a syllable-level timing framework. In Proc. European Conf. Speech Communication and Technology, vol. 2, Berlin, Germany, pp. 1081–1084

  • Campbell W N, Isard S D 1991 Segment durations in a syllable frame. J. Phonetics: Special issue on speech synthesis 19: 37–47

    Google Scholar 

  • Chopde A 2001 Itrans Indian language transliteration package version 5.2 source. http://www. aczone.con/itrans/

  • Chung H 2002a Duration models and the perceptual evaluation of spoken Korean. In Proc. Speech Prosody, Aix-en-Provence, France, pp. 219–222

  • Chung H 2002b Perceptual evaluation of duration models in spoken Korean. Korean J. Speech Sci. 9: 207–215

    Google Scholar 

  • Cordoba R, Vallejo J A, Montero J M, Gutierrezarriola J, Lopez M A, Pardo J M 1999 Automatic modeling of duration in a Spanish text-to-speech system using neural networks. In Proc. European Conf. Speech Communication and Technology, Budapest, Hungary

  • Cosi P, Tesser F, Gretter R 2001 Festival speaks Italian. In Proc. Eurospeech 2001, Aalborg, Denmark, pp. 509–512

  • Davis S, Mermelstein P 1980 Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Speech Audio Process. 28(4): 357–366

    Article  Google Scholar 

  • Dellaert F, Polzin T, Waibel A 1996 Recognising emotions in speech. In International Conference on Spoken Language Processing (ICSLP) 96, vol. 3, Philadelphia, PA, USA, pp. 1816–1819

  • Deller J R, Proakis J G, Hansen J H L 1993 Discrete-time processing of speech signals (New York, USA: Macmilan Publishing Company)

    Google Scholar 

  • Desai S, Black A W, Yegnanarayana B, Prahlad K 2010 Spectral mapping using artificial neural networks for voice conversion. IEEE Trans. Audio Speech Lang. Process. 18(5): 954–964

    Article  Google Scholar 

  • Diamantaras K I, Kung S Y 1996 Principal component neural networks: Theory and applications (New York: John Wiley and Sons)

    MATH  Google Scholar 

  • Douglas-Cowie E, Campbell N, Cowie R, Roach P 2003 Emotional speech: Towards a new generation of databases. Speech Commun. 40: 3360

    Google Scholar 

  • Douglas-Cowie R, Tsapatsoulis E, Votsis N, Kollias G, Fellenz S, Fellinge W, Taylor J 2001 Emotion recognition in human computer interaction. IEEE Signal Process. Mag., Stockholm, Sweden, 23–25 April 2001

  • Dusterhoff K E, Black A W, Taylor P A 1999 Using decision trees within the Tilt intonation model to predict F0 contour. In Proc. Eurospeech, Budapest, Hungary

  • Fujisaki H 1983 Dynamic characteristics of voice fundamental frequency in speech and singing. In (ed) P F MacNeilage, The production of speech, New York, USA: Springer-Verlag, pp. 39–55

    Chapter  Google Scholar 

  • Fujisaki H 1988 A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour. In (ed) O Fujimura, Vocal physiology: Voice production, mechanisms and functions, New York, USA: Raven Press, pp. 347–355

  • Gangashetty S V, Sekhar C C, Yegnanarayana B 2004 Extraction of fixed dimension patterns from varying duration segments of consonant-vowel utterances. In Proc. IEEE Int. Conf. Intelligent Sensing and Information Processing, Chennai, India, pp. 159–164

  • Giannakopoulos T, Pikrakis A, Theodoridis S 2009 A dimensional approach to emotion recognition of speech from movies. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1: 65–68

    Article  Google Scholar 

  • Goubanova O, King S 2008 Bayesian networks for phone duration prediction. Speech Commun. 50: 301–311

    Article  Google Scholar 

  • Goubanova O, Taylor P 2000 Using bayesian belief networks for modeling duration in text-to-speech systems. In Proc. Int. Conf. Spoken Language Processing, vol. 2, Beijing, China, pp. 427–431

  • Grimm M, Kroschel K, Narayanan S 2007 Support vector regression for automatic recognition of spontaneous emotions in speech. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 4: 1085–1088

    Google Scholar 

  • Gunes H, Pantic M 2010 Automatic, dimensional and continuous emotion recognition. Int. J. Synthetic Emotions 1(1): 68–99

    Article  Google Scholar 

  • Gupta C S 2003 Significance of source features for speaker recognition. Master’s thesis, MS thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India

  • Haykin S 1999 Neural networks: A comprehensive f oundation (New Delhi: Pearson Education Aisa, Inc.)

    MATH  Google Scholar 

  • Hifny Y, Rashwan M 2002 Duration modeling of Arabic text-to-speech synthesis. In Proc. Int. Conf. Spoken Language Processing, Denver, Colorado, USA, pp. 1773–1776

  • Hogg R V, Ledolter J 1987 Engineering statistics (New York: Macmillan Publishing Company)

    Google Scholar 

  • Hwang S H, Chen S H 1994 Neural-network-based F0 text-to-speech synthesizer for Mandarin. IEE Proc. Image Signal Process. 141(6): 384–390

    Article  Google Scholar 

  • Hwang S-H, Chen S-H 1995 A prosodic model for mandarin speech and its application to pitch level generation for text-to-speech. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1: 616–619

    Google Scholar 

  • Ikbal M S, Misra H, Yegnanarayana B 1999 Analysis of autoassociative mapping neural networks. In Int. Joint Conf. Neural Networks, USA, pp. 854–858

  • Iliev A I, Scordilis M S, Papa J P, Falcão A X 2010 Spoken emotion recognition through optimum-path forest classification using glottal features. Comput. Speech. Lang. 24(3): 445–460

    Article  Google Scholar 

  • Inanoglu Z 2003 Transforming pitch in a voice conversion framework. Master’s thesis, St. Edmunds College, University of Cambridge

  • Itahashi S, Tanaka K 1993 A method of classification among Japanese dialects. Proc. Eurospeech 1: 639–642

    Google Scholar 

  • Itakura F 1975 Minimum prediction residual principle applied to speech recognition. IEEE Trans. Speech Audio Process. 23(1): 67–72

    Article  Google Scholar 

  • Kain A 2001 High resolution voice transformation. PhD thesis, OGI School of Science and Engineering, Oregon Health and Science University, USA

  • Kain A, Macon M W 2001 Design and evaluation of a voice conversion algorithm based on spectral envelop mapping and residual prediction. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 2, Salt Lake City, UT, USA, pp. 813–816

  • Khan A N, Gangashetty S V, Yegnanarayana B 2003 Syllabic properties of three Indian languages: Implications for speech recognition and language identification. In Int. Conf. Natural Language Processing, Mysore, India, pp. 125–134

  • Kishore S P, Kumar R, Sangal R 2002 A data-driven synthesis approach for indian languages using syllable as basic unit. In Int. Conf. Natural Language Processing, Mumbai, India, pp. 311–316

  • Kishore S P, Yegnanarayana B 2001 Online text-independent speaker verification system using autoassociative neural network models. In Int. Joint Conf. Neural Networks, Washington, DC, USA

  • Krishna N S, Murthy H A 2004 Duration modeling of Indian languages Hindi and Telugu. In 5th ISCA Speech Synthesis Workshop, Pittsburgh, USA, pp. 197–202

  • Krishna N S, Murthy H A, Gonsalves T A 2002 Text-to-speech (tts) in Indian languages. In Int. Conf. Natural Language Processing, Mumbai, pp. 317–326, 18–21 December 2002

  • Krishna N, Tulukdar P, Bali K, Ramakrishnan A 2004 Duration modeling for Hindi text-to-speech synthesis system. In Proc. Int. Conf. Spoken Language Processing, Denver, USA

  • Kwon O, Chan K, Hao J, Lee T 2003 Emotion recognition by speech signals. In Eurospeech, Geneva, pp. 125–128

  • Lee C M, Narayanan S 2005a Toward detecting emotions in spoken dialogs. IEEEAUP 13(2): 293–303

    Google Scholar 

  • Lee C M, Narayanan S S 2005b Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2): 293–303

    Article  Google Scholar 

  • Lee K 2007 Statistical approach for voice personality transformation. IEEE Trans. Audio Speech Lang. Process. 15: 641–651

    Article  Google Scholar 

  • Lee S, Yildirim S, Kazemzadeh A, Narayanan S 2007 An articulatory study of emotional speech production. Proc. Interspeech 4: 497–500

    Google Scholar 

  • Leung C-C, Ferras M, Barras C, Gauvain J-L 2008 Comparing prosodic models for speaker recognition. In Interspeech, vol. 1, Brisbane, Australia, pp. 1945–1948, September 2008

  • Lugger M, Yang B 2008 Cascaded emotion classification via psychological emotion dimensions using a large set of voice quality parameters. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 4: 4945– 4948

    Article  Google Scholar 

  • Mary L, Yegnanarayana B 2008 Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10): 782–796

    Article  Google Scholar 

  • McGilloway S, Cowie R, Douglas-Cowie E, Gielen S, Westerdijk M, Stroeve S 2000 Approaching automatic recognition of emotion from voice: A rough benchmark. In ISCA Workshop on Speech and Emotion, Belfast

  • Mei X, Sun S 2000 An efficient method to compute lsfs from lpc coefficients. In ICSP-2000, pp. 655–658

  • Mixdorff H, Jokisch O 2001 Building an integrated prosodic model of German. In Proc. European Conf. Speech Communication and Technology, vol. 2, Aalborg, Denmark, pp. 947–950

  • Murthy P S, Yegnanarayana B 1999 Robustness of group-delay-based method for extraction of significant excitation from speech signals. IEEE Trans. Speech Audio Process. 7: 609–619

    Article  Google Scholar 

  • Narendranadh M, Murthy H A, Rajendran S, Yegnanarayana B 1995 Transformation of formants for voice conversion using artificial neural networks. Speech Commun. 16(2): 206–216

    Google Scholar 

  • Nicholson J, Takahashi K, Nakatsu R 1999 Emotion recognition in speech using neural networks. In 6th International Conference on Neural Information Processing, ICONIP-99, pp. 495–501

  • Nwe T L, Foo S W, Silva L C D 2003 Speech emotion recognition using hidden Markov models. Speech Commun. 41(4): 603–623

    Article  Google Scholar 

  • Oppenheim A V, Schafer R W, Buck J R 1999 Discrete-time signal processing (NJ: Prentice-Hall)

    Google Scholar 

  • Ostendorfy M, Shafranz I, Bates R 2003 Prosody models for conversational speech recognition. In Symposium on Prosody and Speech, Tokyo, Japan

  • Pantic M, Bartlett M 2007 Machine analysis of facial expressions. In (eds) K Delac, M Grgic, Face recognition (Vienna: I-Tech Education) pp. 377–416

  • Pantic M, Rothkrantz L J M 2003 Toward an affect-sensitive multimodal human–computer interaction. Proc. IEEE 91: 1370–1390

    Article  Google Scholar 

  • Percybrooks W S, Moore-II E 2008 Voice conversion with linear prediction residual estimation. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 4: 4673–4676

    Article  Google Scholar 

  • Pierrehumbert J B 1980 The phonology and p honetics of English i ntonation. PhD thesis, MIT, MA, USA

  • Prasanna S R M 2004 Event-based analysis of speech. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India

  • Prasanna S R M, Yegnanarayana B 2004a Extraction of pitch in adverse conditions. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, Montreal, Canada

  • Prasanna S R M, Yegnanarayana B 2004b Extraction of pitch in adverse conditions. In IEEE Int. Conf. Acoust. Speech Audio Process., vol. 1, Montreal, Canada

  • Prasanna S R M, Zachariah J M 2002 Detection of vowel onset point in speech. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 3, Orlando, Florida, USA

  • Rabiner L R, Juang B H 1993 Fundamentals of speech recognition (Englewood Cliffs, New Jersey: Prentice-Hall)

    Google Scholar 

  • Rao K S 2011 Application of prosody models for developing speech systems in Indian languages. Int. J. Speech Technol. 14: 19–33

    Article  Google Scholar 

  • Rao K S 2009 Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Comput. Speech Lang. 23(2): 240–256

    Article  Google Scholar 

  • Rao K S, Laskar R H, Koolagudi S G 2007 Voice transformation by mapping the features at syllable level. In 2nd International Conference on Pattern Recognition and Machine Intelligence (Premi-2007), Kolkata, India

  • Rao K S, Nandy S, Koolagudi S G 2010 Identification of Hindi dialects using speech. In WMSCI-2010, Orlando, Florida, USA

  • Rao K S, Reddy R, Maity S, Koolagudi S G 2010 Characterization of emotions using dynamics of prosodic features. In Speech Prosody, Chicago, USA

  • Rao K S, Yegnanarayana B 2003 Prosodic manipulation using instants of significant excitation. In Proc. IEEE Int. Conf. Multimedia and Expo, Baltimore, Maryland, USA, pp. 389–392

  • Rao K S, Yegnanarayana B 2006a Prosody modification using instants of significant excitation. IEEE Trans. Audio Speech Lang. Process. 14(3): 972–980

    Article  Google Scholar 

  • Rao K S, Yegnanarayana B 2006b Voice conversion by prosody and vocal tract modification. In 9th Int. Conf. Information Technology, Bhubaneswar, Orissa, India

  • Rao K S, Yegnanarayana B 2007 Modeling durations of syllables using neural networks. Comput. Speech Lang. 21: 282–295

    Article  Google Scholar 

  • Rao K S, Yegnanarayana B 2009 Intonation modeling for Indian languages. Comput. Speech Lang. 23: 240–256

    Article  Google Scholar 

  • Riley M 1992 Tree-based modeling of segmental durations. In Talking machines: Theories, models and designs, Elsevier Publishers, Amsterdam, pp. 265–273

  • Santen J P H V 1994 Assignment of segment duration in text-to-speech synthesis. Comput. Speech Lang. 8: 95–128

    Article  Google Scholar 

  • Schrder M 2001 Emotional speech synthesis: A review. In Eurospeech, Aalborg, Denmark

  • Schroder M 1996 Can emotions be synthesized without controlling voice quality? Phonus 4, Research report of the Institute of Phonetics, University of Saarland

  • Schroder M, Cowie R 2006 Issues in emotion-oriented computing towards a shared understanding. Workshop on Emotion and Computing, Bremen, Germany

  • Schroder M, Cowie R, Douglas-Cowie E, Westerdijk M, Gielen S 2001 Acoustic correlates of emotion dimensions in view of speech synthesis. In 7th European Conference on Speech Communication and Technology. Eurospeech 2001 Scandinavia, 2nd Interspeech Event, Aalborg, Denmark

  • Scordilis M S, Gowdy J N 1989 Neural network based generation of fundamental frequency contours. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, Glasgow, Scotland, pp. 219–222

  • Shriberg E, Stolcke A 2001 Prosody modeling for automatic speech understanding: An overview of recent research at SRI, In Prosody in Speech Recognition and Understanding, ISCA Tutorial and Research Workshop (ITRW), Molly Pitcher Inn, Red Bank, NJ, USA

  • Shriberg E, Stolcke A 2004 Mathematical foundations of speech and language processing (Philadelphia, PA, USA: Springer)

    Google Scholar 

  • Smits R, Yegnanarayana B 1995 Determination of instants of significant excitation in speech using group delay function. IEEE Trans. Speech Audio Process. 3(5): 325–333

    Article  Google Scholar 

  • Sonntag G P, Portele T, Heuft B 1997 Prosody generation with a neural network: Weighing the importance of input parameters. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Munich, Germany, pp. 931–934

  • Sontag E D 1992 Feedback stabilization using two hidden layer nets. IEEE Trans. Neural Networks 3: 981–990

    Article  Google Scholar 

  • Srikanth S, Kumar S R R, Sundar R, Yegnanarayana B 1989 A text-to-speech conversion system for Indian languages based on waveform concatenation model. Technical report no.11, Project VOIS, Department of Computer Science and Engineering, Indian Institute of Technology Madras

  • Stylianou Y, Cappe O, Moulines E 1995 Statistical methods for voice quality transformation. In Eurospeech, Madrid, Spain, pp. 447–450

  • Stylianou Y, Cappe Y, Moulines E 1998 Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6: 131–142

    Article  Google Scholar 

  • Sun R, Moore E, Torres J F 2009 Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 4: 4509–4512

    Article  Google Scholar 

  • Sundermann D 2005 Voice conversion: State-of-the-art and future work. In Proc. DAGA: 31st German Annual Conf. on Acoustics, Munich, Germany

  • Sundermann D, Bonafonte A, Ney H 2005a A study on residual prediction techniques for voice conversion, Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1: 13–16

    Google Scholar 

  • Sundermann D, Bonafonte A, Duxans H, Hoege H 2005b Tc-star: Evaluation plan for voice conversion technology. In Proc. DAGA: 31st German Annual Conf. on Acoustics, Munich, Germany

  • Taylor P A 2000 Analysis and synthesis of intonation using the Tilt model. J. Acoust. Soc. Am. 107(3): 1697–1714

    Article  Google Scholar 

  • Teixeira J P, Freitas D 2003 Segmental durations predicted with a neural network. In Proc. European Conf. Speech Communication and Technology, Geneva, Switzerland, pp. 169–172

  • Tesser F, Cosi P, Drioli C, Tisato G 2004 Prosodic data driven modeling of a narrative style in Festival TTS. In 5th ESCA Speech Synthesis Workshop, Pittsburgh, USA, pp. 185–190

  • t’Hart J, Collier R, Cohen A 1990 A perceptual study of intonation (Cambridge: Cambridge University Press)

    Book  Google Scholar 

  • Toda T, Saruwatari H, Shikano K 2001 Voice conversion algorithm based on gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 2: 841–844

    Google Scholar 

  • Torres-Carrasquillo P A, Gleason T P, Reynolds D A 2004 Dialect identification using gaussian mixture models. In Proc. Odyssey: The Speaker and Language Recognition Workshop, Toledo, Spain, pp. 297–300

  • Torres-Carrasquillo P A, Sturim D, Reynolds D A, McCree A 2008 Eigen channel compensation and discriminatively trained gaussian mixture models for dialect and accent recognition. In Interspeech, Brisbane, Australia, 22–26 September 2008

  • Toth A R, Black A W 2008 Incorporating durational modification in voice transformation. In Interspeech, Brisbane, Australia, vol. 2, pp. 1088–1091

  • Turk O 2007 Cross-lingual voice conversion. Ph.D. thesis, Institute for Graduate Studies in Science and Engineering, Bogazii University, Berlin, Germany

  • Turk O, Arslan L M 2005 Donor selection for voice conversion. In Proc. EUSIPCO, Antalya, Turkey

  • Turk O, Arslan L M 2006 Robust processing techniques for voice conversion. Comput. Speech Lang. 20: 441–467

    Article  Google Scholar 

  • Turk O, Schroder M 2010 Evaluation of expressive speech synthesis with voice conversion and copy re-synthesis techniques. IEEE Trans. Speech Audio Process. 18(5): 965–973

    Article  Google Scholar 

  • Vainio M 2001 Artificial neural network based prosody models for Finnish text-to-speech synthesis. PhD thesis, Dept. of Phonetics, University of Helsinki, Finland

  • Vainio M, Altosaar T 1998 Modeling the microprosody of pitch and loudness for speech synthesis with neural networks. In Proc. Int. Conf. Spoken Language Processing, Sydney, Australia

  • Valbret H, Moulines E, Tubach J P 1992 Voice transformation using PSOLA techniques. Speech Commun. 11: 175–187

    Article  Google Scholar 

  • Vegnaduzzo M 2003 Modeling intonation for the Italian Festival TTS using linear regression. Master’s thesis, Dept. of Linguistics, University of Edinburgh

  • Ververidis D, Kotropoulos C 2006a Emotional speech recognition: Resources, features, and methods. Speech Commun. 48: 11621181

    Article  Google Scholar 

  • Ververidis D, Kotropoulos C 2006b A state of the art review on emotional speech databases. In Eleventh Australasian International Conference on Speech Science and Technology, Auckland, New Zealand

  • Wang Y, Guan L 2004 An investigation of speech-based human emotion recognition. In IEEE 6th Workshop on Multimedia Signal Processing, pp. 15–18

  • Watanabe T, Murakami T, Namba M, Hoya T, Ishida Y 2002 Transformation of spectral envelope for voice conversion based on radial basis function networks. In Proc. Int. Conf. Spoken Language Processing, Denver, CO, USA, pp. 285–288

  • Weber F, Manganaro L, Peskin B, Shriberg E 2002 Using prosodic and lexical information for speaker identification. Proc. ICASSP, vol. 1, Orlando, pp. 141–144

  • Wua S, Falk b T H, Chan W-Y 2010 Automatic speech emotion recognition using modulation spectral features. Speech Commun. doi: 10.1016/j.specom.2010.08.013

  • Yoon W, Kyu-SikPark 2007 Modelling decisions for artificial intelligency (Berlin, Germany: Springer) pp. 455–462

    Book  Google Scholar 

  • Ye H, Young S 2004 High quality voice morphing. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1: 9–12

    Google Scholar 

  • Yegnanarayana B 1999 Artificial neural networks (New Delhi: Prentice-Hall)

    Google Scholar 

  • Yegnanarayana B, Kishore S P 2002 AANN an alternative to GMM for pattern recognition. Neural Networks 15: 459–469

    Article  Google Scholar 

  • Yegnanarayana B, Reddy K S, Kishore S P 2001a Source and system features for speaker recognition using aann models. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Salt Lake City, UT

  • Yegnanarayana B, Reddy K S, Kishore S P 2001b Source and system features for speaker recognition using AANN models. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, Salt Lake City, Utah, USA, pp. 409–412

  • Yegnanarayana B, Veldhuis R N J 1998 Extraction of vocal-tract system characteristics from speech signals. IEEE Trans. Speech Audio Process. 6(4): 313–327

    Article  Google Scholar 

  • Yin1 B, Ambikairajah E, Chen F 2006 Combining cepstral and prosodic features in language identification. In 18th Int. Conf. on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006

  • Zissman M A, Gleason T P, Rekart D M, Losiewicz B L 1996 Automatic dialect identification of extemporaneous, conversational, latin american spanish speech. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 2, Atlanta, Georgia, USA

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K SREENIVASA RAO.

Rights and permissions

Reprints and permissions

About this article

Cite this article

RAO, K.S. Role of neural network models for developing speech systems. Sadhana 36, 783–836 (2011). https://doi.org/10.1007/s12046-011-0047-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12046-011-0047-z

Keywords

Navigation