Skip to main content
Erschienen in: International Journal of Speech Technology 4/2017

07.10.2017

A waveform concatenation technique for text-to-speech synthesis

verfasst von: Soumya Priyadarsini Panda, Ajit Kumar Nayak

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Designing text-to-speech systems capable of producing natural sounding speech segments in different Indian languages is a challenging and ongoing problem. Due to the large number of possible pronunciations in different Indian languages, a number of speech segments are needed to be stored in the speech database while a concatenative speech synthesis technique is used to achieve highly natural speech segments. However, the large speech database size makes it unusable for small hand held devices or human computer interactive systems with limited storage resources. In this paper, we proposed a fraction-based waveform concatenation technique to produce intelligible speech segments from a small footprint speech database. The results of all the experiments performed shows the effectiveness of the proposed technique in producing intelligible speech segments in different Indian languages even with very less storage and computation overhead compared to the existing syllable-based technique.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Adell, J., Escudero, D., & Bonafonte, A. (2012). Production of filled pauses in concatenative speech synthesis based on the underlying fluent sentence. Speech Communication, 54(3), 459–476.CrossRef Adell, J., Escudero, D., & Bonafonte, A. (2012). Production of filled pauses in concatenative speech synthesis based on the underlying fluent sentence. Speech Communication, 54(3), 459–476.CrossRef
Zurück zum Zitat Alías, F., Formiga, L., & Llora, X. (2011). Efficient and reliable perceptual weight tuning for unit-selection text-to-speech synthesis based on active interactive genetic algorithms: A proof-of-concept. Speech Communication, 53(5), 786–800.CrossRef Alías, F., Formiga, L., & Llora, X. (2011). Efficient and reliable perceptual weight tuning for unit-selection text-to-speech synthesis based on active interactive genetic algorithms: A proof-of-concept. Speech Communication, 53(5), 786–800.CrossRef
Zurück zum Zitat Bellur, A., Narayan, K. B., Krishnan, K. R., Murthy, H. (2011). Prosody modeling for syllable-based concatenative speech synthesis of Hindi and Tamil. In IEEE National conference on communications (NCC) (pp. 1–5). Bellur, A., Narayan, K. B., Krishnan, K. R., Murthy, H. (2011). Prosody modeling for syllable-based concatenative speech synthesis of Hindi and Tamil. In IEEE National conference on communications (NCC) (pp. 1–5).
Zurück zum Zitat Benoı̂t, C., & Le Goff, B. (1998). Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP. Speech Communication, 26(1), 117–129.CrossRef Benoı̂t, C., & Le Goff, B. (1998). Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP. Speech Communication, 26(1), 117–129.CrossRef
Zurück zum Zitat Black, A., & Tokuda, K. (2005). The blizzard challenge 2005: Evaluating corpus-based speech synthesis on common databases. In Proceedings of interspeech. Black, A., & Tokuda, K. (2005). The blizzard challenge 2005: Evaluating corpus-based speech synthesis on common databases. In Proceedings of interspeech.
Zurück zum Zitat Black, A. W., & Taylor, P. A. (1997). Automatically clustering similar units for unit selection in speech synthesis. Black, A. W., & Taylor, P. A. (1997). Automatically clustering similar units for unit selection in speech synthesis.
Zurück zum Zitat Cai, M. Q., Ling, Z. H., & Dai, L. R. (2015). Statistical parametric speech synthesis using a hidden trajectory model. Speech Communication, 72, 149–159.CrossRef Cai, M. Q., Ling, Z. H., & Dai, L. R. (2015). Statistical parametric speech synthesis using a hidden trajectory model. Speech Communication, 72, 149–159.CrossRef
Zurück zum Zitat Christiansen, C., Pedersen, M. S., & Dau, T. (2010). Prediction of speech intelligibility based on an auditory preprocessing model. Speech Communication, 52(7–8), 678–692.CrossRef Christiansen, C., Pedersen, M. S., & Dau, T. (2010). Prediction of speech intelligibility based on an auditory preprocessing model. Speech Communication, 52(7–8), 678–692.CrossRef
Zurück zum Zitat Handley, Z. (2009). Is text-to-speech synthesis ready for use in computer-assisted language learning? Speech Communication, 51(10), 906–919.CrossRef Handley, Z. (2009). Is text-to-speech synthesis ready for use in computer-assisted language learning? Speech Communication, 51(10), 906–919.CrossRef
Zurück zum Zitat Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International conference on acoustics, speech, and signal processing (pp. 373–376). Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International conference on acoustics, speech, and signal processing (pp. 373–376).
Zurück zum Zitat Iida, A., Campbell, N., Higuchi, F., & Yasumura, M. (2003). A corpus-based speech synthesis system with emotion. Speech Communication, 40(1), 161–187.CrossRefMATH Iida, A., Campbell, N., Higuchi, F., & Yasumura, M. (2003). A corpus-based speech synthesis system with emotion. Speech Communication, 40(1), 161–187.CrossRefMATH
Zurück zum Zitat Kishore, S. P., & Black, A. W. (2003). Unit size in unit selection speech synthesis. In INTERSPEECH. Kishore, S. P., & Black, A. W. (2003). Unit size in unit selection speech synthesis. In INTERSPEECH.
Zurück zum Zitat Kishore, S. P., Black, A. W., Kumar, R., & Sangal, R. (2003). Experiments with unit selection speech databases for Indian languages. In National seminar on language technology tools, Hyderabad, India. Kishore, S. P., Black, A. W., Kumar, R., & Sangal, R. (2003). Experiments with unit selection speech databases for Indian languages. In National seminar on language technology tools, Hyderabad, India.
Zurück zum Zitat Kishore, S. P., Kumar, R., & Sangal, R. (2002). A data driven synthesis approach for Indian languages using syllable as basic unit. In Proceedings of international conference on NLP (ICON) (pp. 311–316). Kishore, S. P., Kumar, R., & Sangal, R. (2002). A data driven synthesis approach for Indian languages using syllable as basic unit. In Proceedings of international conference on NLP (ICON) (pp. 311–316).
Zurück zum Zitat Li, Y., Tao, J., Hirose, K., Xu, X., & Lai, W. (2015). Hierarchical stress modeling and generation in mandarin for expressive text-to-speech. Speech Communication, 72, 59–73.CrossRef Li, Y., Tao, J., Hirose, K., Xu, X., & Lai, W. (2015). Hierarchical stress modeling and generation in mandarin for expressive text-to-speech. Speech Communication, 72, 59–73.CrossRef
Zurück zum Zitat Morton, H., Gunson, N., Marshall, D., McInnes, F., Ayres, A., & Jack, M. (2011). Usability assessment of text-to-speech synthesis for additional detail in an automated telephone banking system. Computer Speech & Language, 25(2), 341–362.CrossRef Morton, H., Gunson, N., Marshall, D., McInnes, F., Ayres, A., & Jack, M. (2011). Usability assessment of text-to-speech synthesis for additional detail in an automated telephone banking system. Computer Speech & Language, 25(2), 341–362.CrossRef
Zurück zum Zitat Murthy, H. A., Bellur, A., Viswanath, V., Narayanan, B., Susan, A., Kasthuri, G., …, Prahallad, K. (2010). Building unit selection speech synthesis in Indian languages: An initiative by an Indian consortium. In Proceedings of COCOSDA, Kathmandu, Nepal. Murthy, H. A., Bellur, A., Viswanath, V., Narayanan, B., Susan, A., Kasthuri, G., …, Prahallad, K. (2010). Building unit selection speech synthesis in Indian languages: An initiative by an Indian consortium. In Proceedings of COCOSDA, Kathmandu, Nepal.
Zurück zum Zitat Narendra, N. P., Rao, K. S., Ghosh, K., Vempada, R. R., & Maity, S. (2011). Development of syllable-based text to speech synthesis system in Bengali. International Journal of Speech Technology, 14, 167–181.CrossRef Narendra, N. P., Rao, K. S., Ghosh, K., Vempada, R. R., & Maity, S. (2011). Development of syllable-based text to speech synthesis system in Bengali. International Journal of Speech Technology, 14, 167–181.CrossRef
Zurück zum Zitat Panda, S. P., & Nayak, A. K. (2014). Integration of fuzzy if-then rule with waveform concatenation technique for text-to-speech synthesis in Odia. In IEEE International conference on information technology (ICIT) (pp. 88–93). Panda, S. P., & Nayak, A. K. (2014). Integration of fuzzy if-then rule with waveform concatenation technique for text-to-speech synthesis in Odia. In IEEE International conference on information technology (ICIT) (pp. 88–93).
Zurück zum Zitat Panda, S. P., & Nayak, A. K. (2014). A rule-based concatenative approach to speech synthesis in Indian language text-to-speech systems. In Intelligent computing, communication and devices (pp. 523–531). New Delhi: Springer. Panda, S. P., & Nayak, A. K. (2014). A rule-based concatenative approach to speech synthesis in Indian language text-to-speech systems. In Intelligent computing, communication and devices (pp. 523–531). New Delhi: Springer.
Zurück zum Zitat Panda, S. P., & Nayak, A. K. (2015). An efficient model for text-to-speech synthesis in Indian languages. International Journal of Speech Technology, 18(3), 305–315.CrossRef Panda, S. P., & Nayak, A. K. (2015). An efficient model for text-to-speech synthesis in Indian languages. International Journal of Speech Technology, 18(3), 305–315.CrossRef
Zurück zum Zitat Panda, S. P., & Nayak, A. K. (2016). Modified Rule-based concatenative technique for intelligible speech synthesis in Indian languages. Advanced Science Letters, 22(2), 557–563.CrossRef Panda, S. P., & Nayak, A. K. (2016). Modified Rule-based concatenative technique for intelligible speech synthesis in Indian languages. Advanced Science Letters, 22(2), 557–563.CrossRef
Zurück zum Zitat Panda, S. P., & Nayak, A. K. (2016). Automatic speech segmentation in syllable centric speech recognition system. International Journal of Speech Technology, 19(1), 9–18.CrossRef Panda, S. P., & Nayak, A. K. (2016). Automatic speech segmentation in syllable centric speech recognition system. International Journal of Speech Technology, 19(1), 9–18.CrossRef
Zurück zum Zitat Panda, S. P., Nayak, A. K., & Patnaik, S. (2015). Text-to-speech synthesis with an Indian language perspective. International Journal of Grid and Utility Computing, 6(3–4), 170–178.CrossRef Panda, S. P., Nayak, A. K., & Patnaik, S. (2015). Text-to-speech synthesis with an Indian language perspective. International Journal of Grid and Utility Computing, 6(3–4), 170–178.CrossRef
Zurück zum Zitat Patil, H., Patel, T. B., Shah, N. J., Sailor, H. B., Krishnan, R., Kasthuri, G. R., … Murthy, H. (2013). A syllable-based framework for unit selection synthesis in 13 Indian languages. In IEEE International conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE) (pp. 1–8). Patil, H., Patel, T. B., Shah, N. J., Sailor, H. B., Krishnan, R., Kasthuri, G. R., … Murthy, H. (2013). A syllable-based framework for unit selection synthesis in 13 Indian languages. In IEEE International conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE) (pp. 1–8).
Zurück zum Zitat Prahallad, K., Vadapalli, A., Elluru, N., Mantena, G., Pulugundla, B., Bhaskararao, P., … Black, A. W. (2013). The blizzard challenge 2013–Indian language task. In Blizzard challenge workshop. Prahallad, K., Vadapalli, A., Elluru, N., Mantena, G., Pulugundla, B., Bhaskararao, P., … Black, A. W. (2013). The blizzard challenge 2013–Indian language task. In Blizzard challenge workshop.
Zurück zum Zitat Prasanna, S. M., Reddy, B. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 556–565.CrossRef Prasanna, S. M., Reddy, B. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 556–565.CrossRef
Zurück zum Zitat Raghavendra, E. V., Desai, S., Yegnanarayana, B., Black, A. W., & Prahallad, K. (2008). Global syllable set for building speech synthesis in Indian languages. In IEEE Spoken language technology workshop, 2008 (SLT 2008) (pp. 49–52). Raghavendra, E. V., Desai, S., Yegnanarayana, B., Black, A. W., & Prahallad, K. (2008). Global syllable set for building speech synthesis in Indian languages. In IEEE Spoken language technology workshop, 2008 (SLT 2008) (pp. 49–52).
Zurück zum Zitat Rama, J., Ramakrishnan, A. G., Muralishankar, R., & Prathibha, R. (2002). A complete text-to-speech synthesis system in Tamil. In WSS’ proceedings (pp. 191–194). Rama, J., Ramakrishnan, A. G., Muralishankar, R., & Prathibha, R. (2002). A complete text-to-speech synthesis system in Tamil. In WSS’ proceedings (pp. 191–194).
Zurück zum Zitat Reddy, V. R., & Rao, K. S. (2013). Two-stage intonation modeling using feed forward neural networks for syllable based text-to-speech synthesis. Computer Speech & Language, 27(5), 1105–1126.CrossRef Reddy, V. R., & Rao, K. S. (2013). Two-stage intonation modeling using feed forward neural networks for syllable based text-to-speech synthesis. Computer Speech & Language, 27(5), 1105–1126.CrossRef
Zurück zum Zitat Rojc, M., & Kačič, Z. (2007). Time and space-efficient architecture for a corpus-based text-to-speech synthesis system. Speech Communication, 49(3), 230–249.CrossRef Rojc, M., & Kačič, Z. (2007). Time and space-efficient architecture for a corpus-based text-to-speech synthesis system. Speech Communication, 49(3), 230–249.CrossRef
Zurück zum Zitat Romsdorfer, H., & Pfister, B. (2007). Text analysis and language identification for polyglot text-to-speech synthesis. Speech communication, 49(9), 697–724.CrossRef Romsdorfer, H., & Pfister, B. (2007). Text analysis and language identification for polyglot text-to-speech synthesis. Speech communication, 49(9), 697–724.CrossRef
Zurück zum Zitat Talesara, S., Patil, H. A., Patel, T., Sailor, H., & Shah, N. A. (2013). Novel Gaussian filter-based automatic labeling of speech data for TTS system in Gujarati language. In ICALP proceedings (pp. 139–142). Talesara, S., Patil, H. A., Patel, T., Sailor, H., & Shah, N. A. (2013). Novel Gaussian filter-based automatic labeling of speech data for TTS system in Gujarati language. In ICALP proceedings (pp. 139–142).
Zurück zum Zitat Thomas, S., Rao, M. N., Murthy, H., & Ramalingam, C. S. (2006). Natural sounding TTS based on syllable-like units. In IEEE 14th European signal processing conference (pp. 1–5). Thomas, S., Rao, M. N., Murthy, H., & Ramalingam, C. S. (2006). Natural sounding TTS based on syllable-like units. In IEEE 14th European signal processing conference (pp. 1–5).
Zurück zum Zitat Tiomkin, S., Malah, D., Shechtman, S., & Kons, Z. (2011). A Hybrid Text-to-speech system that combines concatenative and statistical synthesis units. IEEE Transactions on Audio, Speech and Language Processing, 19, 1278–1288.CrossRef Tiomkin, S., Malah, D., Shechtman, S., & Kons, Z. (2011). A Hybrid Text-to-speech system that combines concatenative and statistical synthesis units. IEEE Transactions on Audio, Speech and Language Processing, 19, 1278–1288.CrossRef
Zurück zum Zitat Toman, M., Pucher, M., Moosmüller, S., & Schabus, D. (2015). Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis. Speech Communication, 72, 176–193.CrossRef Toman, M., Pucher, M., Moosmüller, S., & Schabus, D. (2015). Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis. Speech Communication, 72, 176–193.CrossRef
Zurück zum Zitat Torres, H. M., & Gurlekian, J. A. (2008). Acoustic speech unit segmentation for concatenative synthesis. Computer Speech & Language, 22(2), 196–206.CrossRef Torres, H. M., & Gurlekian, J. A. (2008). Acoustic speech unit segmentation for concatenative synthesis. Computer Speech & Language, 22(2), 196–206.CrossRef
Zurück zum Zitat Viswanathan, M. (2005). Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (MOS) scale. Computer Speech and Language, 19, 55–83.CrossRef Viswanathan, M. (2005). Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (MOS) scale. Computer Speech and Language, 19, 55–83.CrossRef
Zurück zum Zitat Xia, X. J., Ling, Z. H., Jiang, Y., & Dai, L. R. (2014). HMM-based unit selection speech synthesis using log likelihood ratios derived from perceptual data. Speech Communication, 63, 27–37.CrossRef Xia, X. J., Ling, Z. H., Jiang, Y., & Dai, L. R. (2014). HMM-based unit selection speech synthesis using log likelihood ratios derived from perceptual data. Speech Communication, 63, 27–37.CrossRef
Zurück zum Zitat Yeh, C. Y., Chang, S. C., & Hwang, S. H. (2013). A consistency analysis on an acoustic module for Mandarin text-to-speech. Speech Communication, 55(2), 266–277.CrossRef Yeh, C. Y., Chang, S. C., & Hwang, S. H. (2013). A consistency analysis on an acoustic module for Mandarin text-to-speech. Speech Communication, 55(2), 266–277.CrossRef
Zurück zum Zitat York, J., & Pendharkar, P. C. (2004). Human–computer interaction issues for mobile computing in a variable work context. International Journal of Human-Computer Studies, 60(5), 771–797.CrossRef York, J., & Pendharkar, P. C. (2004). Human–computer interaction issues for mobile computing in a variable work context. International Journal of Human-Computer Studies, 60(5), 771–797.CrossRef
Metadaten
Titel
A waveform concatenation technique for text-to-speech synthesis
verfasst von
Soumya Priyadarsini Panda
Ajit Kumar Nayak
Publikationsdatum
07.10.2017
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 4/2017
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-017-9463-8

Weitere Artikel der Ausgabe 4/2017

International Journal of Speech Technology 4/2017 Zur Ausgabe

Neuer Inhalt