Skip to main content

2022 | OriginalPaper | Buchkapitel

A Review on Speech Synthesis Based on Machine Learning

verfasst von : Ruchika Kumari, Amita Dev, Ashwni Kumar

Erschienen in: Artificial Intelligence and Speech Technology

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recently, Speech synthesis is one of the growing techniques in the research domain that takes input as text and provides output as acoustical form. The speech synthesis system is more advantageous to physically impaired people. In execution process, there arise some complications by surrounding noises and communication style. To neglect such unnecessary noises various machine learning techniques are employed. In this paper, we described various techniques adopted to improve the naturalness and quality of synthesized speech. The main contribution of this paper is to elaborate and compare the characteristics of techniques utilized in speech synthesis for different languages. The techniques such as support vector machine, Artificial Neural Network, Gaussian mixture modeling, Generative adversarial network, Deep Neural Network and Hidden Markov Model are employed in this work to enhance the speech naturalness and quality of synthesized speech signals.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Kumari, R., Dev, A., Kumar, A.: Automatic segmentation of hindi speech into syllable-like units. Int. J. Adv. Comput. Sci. Appl. 11(6), 400–406 (2020) Kumari, R., Dev, A., Kumar, A.: Automatic segmentation of hindi speech into syllable-like units. Int. J. Adv. Comput. Sci. Appl. 11(6), 400–406 (2020)
2.
Zurück zum Zitat Kumari, R., Dev, A., Kumar, A.: Development of syllable dominated Hindi speech corpora. Int. Conf. Artif. Intell. Speech Technol. (AIST2019) 8(3), 1−9 (2019) Kumari, R., Dev, A., Kumar, A.: Development of syllable dominated Hindi speech corpora. Int. Conf. Artif. Intell. Speech Technol. (AIST2019) 8(3), 1−9 (2019)
3.
Zurück zum Zitat Macchi, M.: Issues in text-to-speech synthesis. In: Proceedings of the IEEE International Joint Symposia on Intelligence and Systems (Cat. No. 98EX174), pp. 318–325 (1998) Macchi, M.: Issues in text-to-speech synthesis. In: Proceedings of the IEEE International Joint Symposia on Intelligence and Systems (Cat. No. 98EX174), pp. 318–325 (1998)
4.
Zurück zum Zitat Baby, A., Prakash, J.J., Subramanian, A.S., Murthy, H.A.: Significance of spectral cues in automatic speech segmentation for Indian language speech synthesizers. Speech Commun. 123, 10–25 (2020)CrossRef Baby, A., Prakash, J.J., Subramanian, A.S., Murthy, H.A.: Significance of spectral cues in automatic speech segmentation for Indian language speech synthesizers. Speech Commun. 123, 10–25 (2020)CrossRef
5.
Zurück zum Zitat Kumari, R., Dev, A., Bayana, A., Kumar, A.: Machine learning techniques in speech generation: a review. J. Adv. Res. Dyn. Control Syst. 9, 1095–1110 (2019)CrossRef Kumari, R., Dev, A., Bayana, A., Kumar, A.: Machine learning techniques in speech generation: a review. J. Adv. Res. Dyn. Control Syst. 9, 1095–1110 (2019)CrossRef
6.
Zurück zum Zitat Balyan, A.: An overview on resources for development of Hindi speech synthesis system. New Ideas Concerning Sci. Technol. 11, 57–63 (2021) Balyan, A.: An overview on resources for development of Hindi speech synthesis system. New Ideas Concerning Sci. Technol. 11, 57–63 (2021)
7.
Zurück zum Zitat Bhatt, S., Jain, A., Dev, A.: Syllable based Hindi speech recognition. J. Inf. Optim. Sci. 41, 1333–1351 (2020) Bhatt, S., Jain, A., Dev, A.: Syllable based Hindi speech recognition. J. Inf. Optim. Sci. 41, 1333–1351 (2020)
8.
Zurück zum Zitat Ramteke, G.D., Ramteke, R.J.: Efficient model for numerical text-to-speech synthesis system in Marathi, Hindi and English languages. Int. J. Image Graphics Sig. Process. 9(3), 1–13 (2017)CrossRef Ramteke, G.D., Ramteke, R.J.: Efficient model for numerical text-to-speech synthesis system in Marathi, Hindi and English languages. Int. J. Image Graphics Sig. Process. 9(3), 1–13 (2017)CrossRef
9.
Zurück zum Zitat Begum, A., Askari, S.M.: Text-to-speech synthesis system for mymensinghiya dialect of Bangla language. In: Panigrahi, C.R., Pujari, A.K., Misra, S., Pati, B., Li, K.-C. (eds.) Progress in Advanced Computing and Intelligent Engineering, pp. 291–303. Springer Singapore, Singapore (2019). https://doi.org/10.1007/978-981-13-0224-4_27CrossRef Begum, A., Askari, S.M.: Text-to-speech synthesis system for mymensinghiya dialect of Bangla language. In: Panigrahi, C.R., Pujari, A.K., Misra, S., Pati, B., Li, K.-C. (eds.) Progress in Advanced Computing and Intelligent Engineering, pp. 291–303. Springer Singapore, Singapore (2019). https://​doi.​org/​10.​1007/​978-981-13-0224-4_​27CrossRef
10.
Zurück zum Zitat Rajendran, V., Kumar, G.B.: A Robust syllable centric pronunciation model for Tamil text to speech synthesizer. IETE J. Res. 65(5), 601–612 (2019)CrossRef Rajendran, V., Kumar, G.B.: A Robust syllable centric pronunciation model for Tamil text to speech synthesizer. IETE J. Res. 65(5), 601–612 (2019)CrossRef
11.
Zurück zum Zitat Ramteke, R.J., Ramteke, G.D.: Hindi spoken signals for speech synthesizer. In: 2016 2nd International Conference on Next Generation Computing Technologies (NGCT), pp. 323–328. IEEE (2016) Ramteke, R.J., Ramteke, G.D.: Hindi spoken signals for speech synthesizer. In: 2016 2nd International Conference on Next Generation Computing Technologies (NGCT), pp. 323–328. IEEE (2016)
12.
Zurück zum Zitat Balyan, A., Agrawal, S.S., Dev, A.: Speech synthesis: a review. Int. J. Eng. Res. Technol. (IJERT) 2(6), 57–75 (2013) Balyan, A., Agrawal, S.S., Dev, A.: Speech synthesis: a review. Int. J. Eng. Res. Technol. (IJERT) 2(6), 57–75 (2013)
13.
Zurück zum Zitat Keletay, M.A., Worku, H.S.: Developing concatenative based text to speech synthesizer for Tigrigna. Internet Things Cloud Comput. 8(6), 24–30 (2020)CrossRef Keletay, M.A., Worku, H.S.: Developing concatenative based text to speech synthesizer for Tigrigna. Internet Things Cloud Comput. 8(6), 24–30 (2020)CrossRef
14.
Zurück zum Zitat Reddy, M.K., Rao, K.S.: Improved HMM-based mixed-language (Telugu–Hindi) polyglot speech synthesis. In: Advances in Communication, Signal Processing, VLSI, and Embedded Systems, pp. 279–287 (2020) Reddy, M.K., Rao, K.S.: Improved HMM-based mixed-language (Telugu–Hindi) polyglot speech synthesis. In: Advances in Communication, Signal Processing, VLSI, and Embedded Systems, pp. 279–287 (2020)
15.
Zurück zum Zitat Panda, S.P., Nayak, A.K.: Automatic speech segmentation in syllable centric speech recognition system. Int. J. Speech Technol. 19(1), 9–18 (2016)CrossRef Panda, S.P., Nayak, A.K.: Automatic speech segmentation in syllable centric speech recognition system. Int. J. Speech Technol. 19(1), 9–18 (2016)CrossRef
16.
Zurück zum Zitat Balyan, A., Agrawal, S.S., Dev, A.: Automatic phonetic segmentation of Hindi speech using hidden Markov model. AI Soc. 27, 543–549 (2012)CrossRef Balyan, A., Agrawal, S.S., Dev, A.: Automatic phonetic segmentation of Hindi speech using hidden Markov model. AI Soc. 27, 543–549 (2012)CrossRef
17.
Zurück zum Zitat Balyan, A., Dev, A., Kumari, R., Agrawal, S.S.: Labelling of Hindi speech. IETE J. Res. 62, 146–153 (2016)CrossRef Balyan, A., Dev, A., Kumari, R., Agrawal, S.S.: Labelling of Hindi speech. IETE J. Res. 62, 146–153 (2016)CrossRef
18.
Zurück zum Zitat Balyan, A.: Resources for development of Hindi speech synthesis system: an overview. Open J. Appl. Sci. 7(6), 233–241 (2017)CrossRef Balyan, A.: Resources for development of Hindi speech synthesis system: an overview. Open J. Appl. Sci. 7(6), 233–241 (2017)CrossRef
19.
Zurück zum Zitat Jalin, A.F., Jayakumari, J.: A Robust Tamil text to speech synthesizer using support vector machine (SVM). In: Advances in Communication Systems and Networks, pp. 809–819. Springer, Singapore (2020) Jalin, A.F., Jayakumari, J.: A Robust Tamil text to speech synthesizer using support vector machine (SVM). In: Advances in Communication Systems and Networks, pp. 809–819. Springer, Singapore (2020)
20.
Zurück zum Zitat Kinoshita, Y., Hirakawa, R., Kawano, H., Nakashi, K., Nakatoh, Y.: Speech enhancement system using SVM for train announcement. In: 2021 IEEE International Conference on Consumer Electronics (ICCE), pp. 1–3 (2021) Kinoshita, Y., Hirakawa, R., Kawano, H., Nakashi, K., Nakatoh, Y.: Speech enhancement system using SVM for train announcement. In: 2021 IEEE International Conference on Consumer Electronics (ICCE), pp. 1–3 (2021)
21.
Zurück zum Zitat Kumari, R., Dev, A., Kumar, A.: An efficient adaptive artificial neural network based text to speech synthesizer for Hindi language. Multimedia Tools Appl. 80(2), 24669–24695 (2021)CrossRef Kumari, R., Dev, A., Kumar, A.: An efficient adaptive artificial neural network based text to speech synthesizer for Hindi language. Multimedia Tools Appl. 80(2), 24669–24695 (2021)CrossRef
22.
Zurück zum Zitat Liu, R., Sisman, B., Li, H.: Graphspeech: syntax-aware graph attention network for neural speech synthesis. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 6059–6063 (2021) Liu, R., Sisman, B., Li, H.: Graphspeech: syntax-aware graph attention network for neural speech synthesis. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 6059–6063 (2021)
23.
Zurück zum Zitat Ramani, B., Jeeva, M.A., Vijayalakshmi, P., Nagarajan, T.: A multi-level GMM-based cross-lingual voice conversion using language-specific mixture weights for polyglot synthesis. Circuits Syst. Sign. Process. 35(4), 1283–1311 (2016)MathSciNetCrossRef Ramani, B., Jeeva, M.A., Vijayalakshmi, P., Nagarajan, T.: A multi-level GMM-based cross-lingual voice conversion using language-specific mixture weights for polyglot synthesis. Circuits Syst. Sign. Process. 35(4), 1283–1311 (2016)MathSciNetCrossRef
24.
Zurück zum Zitat Popov, V., Kudinov, M., Sadekova, T.: Gaussian LPCNet for multisample speech synthesis. In: CASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6204–6208 (2020) Popov, V., Kudinov, M., Sadekova, T.: Gaussian LPCNet for multisample speech synthesis. In: CASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6204–6208 (2020)
25.
Zurück zum Zitat Zhou, S., Jia, J., Zhang, L., Wang, Y., Chen, W., Meng, F., Yu, F., Shen, J.: Inferring emphasis for real voice data: an attentive multimodal neural network approach. In: Ro, Y.M., Cheng, W.-H., Kim, J., Chu, W.-T., Cui, P., Choi, J.-W., Hu, M.-C., De Neve, W. (eds.) MMM 2020. LNCS, vol. 11962, pp. 52–62. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37734-2_5CrossRef Zhou, S., Jia, J., Zhang, L., Wang, Y., Chen, W., Meng, F., Yu, F., Shen, J.: Inferring emphasis for real voice data: an attentive multimodal neural network approach. In: Ro, Y.M., Cheng, W.-H., Kim, J., Chu, W.-T., Cui, P., Choi, J.-W., Hu, M.-C., De Neve, W. (eds.) MMM 2020. LNCS, vol. 11962, pp. 52–62. Springer, Cham (2020). https://​doi.​org/​10.​1007/​978-3-030-37734-2_​5CrossRef
26.
Zurück zum Zitat Kaliyev, A., Zeno, B., Rybin, S.V., Matveev, Y.N., Lyakso, E.: GAN acoustic model for Kazakh speech synthesis. Int. J. Speech Technol. 24, 729–735 (2021)CrossRef Kaliyev, A., Zeno, B., Rybin, S.V., Matveev, Y.N., Lyakso, E.: GAN acoustic model for Kazakh speech synthesis. Int. J. Speech Technol. 24, 729–735 (2021)CrossRef
27.
Zurück zum Zitat Inoue, K., Hara, S., Abe, M., Hojo, N., Ijima, Y.: Model architectures to extrapolate emotional expressions in DNN-based text-to-speech. Speech Commun. 126, 35–43 (2021)CrossRef Inoue, K., Hara, S., Abe, M., Hojo, N., Ijima, Y.: Model architectures to extrapolate emotional expressions in DNN-based text-to-speech. Speech Commun. 126, 35–43 (2021)CrossRef
28.
Zurück zum Zitat Zangar, I., Mnasri, Z., Colotte, V., Jouvet, D.: Duration modelling and evaluation for Arabic statistical parametric speech synthesis. Multimedia Tools Appl. 80(6), 8331–8353 (2021)CrossRef Zangar, I., Mnasri, Z., Colotte, V., Jouvet, D.: Duration modelling and evaluation for Arabic statistical parametric speech synthesis. Multimedia Tools Appl. 80(6), 8331–8353 (2021)CrossRef
29.
Zurück zum Zitat Lorenzo-Trueba, J., Henter, G.E., Takahashi, S., Yamagishi, J., Morino, Y., Ochiai, Y.: Investigating different representations for modeling multiple emotions in DNN-based speech synthesis. In: 3rd International Workshop on The Affective Social Multimedia Computing (2017) Lorenzo-Trueba, J., Henter, G.E., Takahashi, S., Yamagishi, J., Morino, Y., Ochiai, Y.: Investigating different representations for modeling multiple emotions in DNN-based speech synthesis. In: 3rd International Workshop on The Affective Social Multimedia Computing (2017)
30.
Zurück zum Zitat Reddy, R., Sreenivasa, V., Rao, K.: Prosody modeling for syllable based text-to-speech synthesis using feedforward neural networks. Neurocomputing 171, 1323–1334 (2016)CrossRef Reddy, R., Sreenivasa, V., Rao, K.: Prosody modeling for syllable based text-to-speech synthesis using feedforward neural networks. Neurocomputing 171, 1323–1334 (2016)CrossRef
31.
Zurück zum Zitat Maeno, Y., Nose, T., Kobayashi, T., Koriyama, T., Ijima, Y., Nakajima, H., Mizuno, H., Yoshioka, O.: Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis. Speech Commun. 57, 144–154 (2014)CrossRef Maeno, Y., Nose, T., Kobayashi, T., Koriyama, T., Ijima, Y., Nakajima, H., Mizuno, H., Yoshioka, O.: Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis. Speech Commun. 57, 144–154 (2014)CrossRef
32.
Zurück zum Zitat Houidhek, A., Colotte, V., Mnasri, Z., Jouvet, D.: Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic. Int. J. Speech Technol. 21(4), 895–906 (2018)CrossRef Houidhek, A., Colotte, V., Mnasri, Z., Jouvet, D.: Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic. Int. J. Speech Technol. 21(4), 895–906 (2018)CrossRef
33.
Zurück zum Zitat Chen, C.-H., Wu, Y.C., Huang, S.-L., Lin, J.-F.: Candidate expansion and prosody adjustment for natural speech synthesis using a small corpus. IEEE/ACM Trans. Audio Speech Lang. Process. 24(6), 1052–1065 (2016)CrossRef Chen, C.-H., Wu, Y.C., Huang, S.-L., Lin, J.-F.: Candidate expansion and prosody adjustment for natural speech synthesis using a small corpus. IEEE/ACM Trans. Audio Speech Lang. Process. 24(6), 1052–1065 (2016)CrossRef
34.
Zurück zum Zitat Karhila, R., Remes, U., Kurimo, M.: Noise in HMM-based speech synthesis adaptation: analysis, evaluation methods and experiments. IEEE J. Sel. Top. Sign. Process. 8(5), 285–295 (2014)CrossRef Karhila, R., Remes, U., Kurimo, M.: Noise in HMM-based speech synthesis adaptation: analysis, evaluation methods and experiments. IEEE J. Sel. Top. Sign. Process. 8(5), 285–295 (2014)CrossRef
35.
Zurück zum Zitat He, M., Yang, J., He, L., Soong, F.K.: Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis (2021). arXiv preprint arXiv:2103.03541 He, M., Yang, J., He, L., Soong, F.K.: Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis (2021). arXiv preprint arXiv:​2103.​03541
36.
Zurück zum Zitat Yang, M., Ding, S., Chen, T., Wang, T., Wang, Z.: Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis (2021). arXiv preprint arXiv:2110.04482 Yang, M., Ding, S., Chen, T., Wang, T., Wang, Z.: Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis (2021). arXiv preprint arXiv:​2110.​04482
37.
Zurück zum Zitat De Korte, M., Kim, J., Klabbers, E.: Efficient neural speech synthesis for low-resource languages through multilingual modelling (2020). arXiv preprint arXiv:2008.09659 De Korte, M., Kim, J., Klabbers, E.: Efficient neural speech synthesis for low-resource languages through multilingual modelling (2020). arXiv preprint arXiv:​2008.​09659
Metadaten
Titel
A Review on Speech Synthesis Based on Machine Learning
verfasst von
Ruchika Kumari
Amita Dev
Ashwni Kumar
Copyright-Jahr
2022
DOI
https://doi.org/10.1007/978-3-030-95711-7_3

Premium Partner