Skip to main content
Top

2023 | OriginalPaper | Chapter

Literature Review of Research on Common Methods of Grapheme-To-Phoneme

Authors : Yating Zhang, Han Zhang, Shaozhong Cao

Published in: IEIS 2022

Publisher: Springer Nature Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Grapheme-to-phoneme (G2P) conversion techniques have been used in many fields, most notably speech synthesis (text-to-speech, TTS). Nowadays, the development of speech synthesis is facilitated with the continuous improvement of G2P conversion techniques. The purpose of the paper is to provide an review of grapheme-to-phoneme conversion methods. First, the grapheme-to-phoneme conversion methods in recent years are sorted out; then the relevant data sets and evaluation metrics are listed, and finally the problems and development trends faced by grapheme-to-phoneme conversion are described.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Kaplan, R.M., Kay, M.: Regular models of phonological rule systems. Comput. Linguist. 20(3), 331–378 (1994) Kaplan, R.M., Kay, M.: Regular models of phonological rule systems. Comput. Linguist. 20(3), 331–378 (1994)
2.
go back to reference Kominek, J., Black, A.W.: Learning pronunciation dictionaries: language complexity and word selection strategies. In: Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pp. 232–239 (2006) Kominek, J., Black, A.W.: Learning pronunciation dictionaries: language complexity and word selection strategies. In: Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pp. 232–239 (2006)
3.
go back to reference Galescu, L., Allen, J.F.: Bi-directional conversion between graphemes and phonemes using a joint n-gram model. In: 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis (2001) Galescu, L., Allen, J.F.: Bi-directional conversion between graphemes and phonemes using a joint n-gram model. In: 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis (2001)
4.
go back to reference Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)CrossRef Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)CrossRef
5.
go back to reference Yang, D., Dixon, P., Sadahi, F.: Rapid development of a grapheme-to-phoneme system based on weighted finite state transducer (WFST) framework. In: The Japanese Acoustical Society 2009 Fall Lecture Proceedings, no. 3, pp. 111–112 (2009) Yang, D., Dixon, P., Sadahi, F.: Rapid development of a grapheme-to-phoneme system based on weighted finite state transducer (WFST) framework. In: The Japanese Acoustical Society 2009 Fall Lecture Proceedings, no. 3, pp. 111–112 (2009)
6.
go back to reference Novak, J.R., Dixon, P.R., Minematsu, N., et al.: Improving WFST-based G2P conversion with alignment constraints and RNNLM N-best rescoring. In: Thirteenth Annual Conference of the International Speech Communication Association (2012) Novak, J.R., Dixon, P.R., Minematsu, N., et al.: Improving WFST-based G2P conversion with alignment constraints and RNNLM N-best rescoring. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
7.
go back to reference Novak, J.R., Minematsu, N., Hirose, K.: Failure transitions for joint n-gram models and G2p conversion. In: Interspeech, pp. 1821–1825 (2013) Novak, J.R., Minematsu, N., Hirose, K.: Failure transitions for joint n-gram models and G2p conversion. In: Interspeech, pp. 1821–1825 (2013)
8.
go back to reference Novak, J.R., Minematsu, N., Hirose, K.: WFST-based grapheme-to-phoneme conversion: open source tools for alignment, model-building and decoding. In: Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pp. 45–49 (2012) Novak, J.R., Minematsu, N., Hirose, K.: WFST-based grapheme-to-phoneme conversion: open source tools for alignment, model-building and decoding. In: Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pp. 45–49 (2012)
9.
go back to reference Mikolov, T., Karafiát, M., Burget, L., et al.: Recurrent neural network based language model. In: Interspeech, vol. 2, no. 3, pp. 1045–1048 (2010) Mikolov, T., Karafiát, M., Burget, L., et al.: Recurrent neural network based language model. In: Interspeech, vol. 2, no. 3, pp. 1045–1048 (2010)
10.
go back to reference Chen, S.F.: Conditional and joint models for grapheme-to-phoneme conversion. In: INTERSPEECH (2003) Chen, S.F.: Conditional and joint models for grapheme-to-phoneme conversion. In: INTERSPEECH (2003)
11.
go back to reference Lehnen, P., Hahn, S., Guta, V.A., et al.: Hidden conditional random fields with m-to-n alignments for grapheme-to-phoneme conversion. In: Thirteenth Annual Conference of the International Speech Communication Association (2012) Lehnen, P., Hahn, S., Guta, V.A., et al.: Hidden conditional random fields with m-to-n alignments for grapheme-to-phoneme conversion. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
12.
go back to reference Lehnen, P., Allauzen, A., Lavergne, T., et al.: Structure learning in hidden conditional random fields for grapheme-to-phoneme conversion. In: Annual Conference of the International Speech Communication Association (2013) Lehnen, P., Allauzen, A., Lavergne, T., et al.: Structure learning in hidden conditional random fields for grapheme-to-phoneme conversion. In: Annual Conference of the International Speech Communication Association (2013)
13.
go back to reference Taylor, P.: Hidden Markov models for grapheme to phoneme conversion. In: Ninth European Conference on Speech Communication and Technology (2005) Taylor, P.: Hidden Markov models for grapheme to phoneme conversion. In: Ninth European Conference on Speech Communication and Technology (2005)
14.
go back to reference Ogbureke, K.U., Cahill, P., Carson-Berndsen, J.: Hidden Markov models with context-sensitive observations for grapheme-to-phoneme conversion. In: Eleventh Annual Conference of the International Speech Communication Association (2010) Ogbureke, K.U., Cahill, P., Carson-Berndsen, J.: Hidden Markov models with context-sensitive observations for grapheme-to-phoneme conversion. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
15.
go back to reference Jiampojamarn, S., Kondrak, G., Sherif, T.: Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 372–379 (2007) Jiampojamarn, S., Kondrak, G., Sherif, T.: Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 372–379 (2007)
16.
go back to reference Hahn, S., Vozila, P., Bisani, M.: Comparison of grapheme-to-phoneme methods on large pronunciation dictionaries and LVCSR tasks. In: Thirteenth Annual Conference of the International Speech Communication Association (2012) Hahn, S., Vozila, P., Bisani, M.: Comparison of grapheme-to-phoneme methods on large pronunciation dictionaries and LVCSR tasks. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
17.
go back to reference Wu, K., Allauzen, C., Hall, K., et al.: Encoding linear models as weighted finite-state transducers (2014) Wu, K., Allauzen, C., Hall, K., et al.: Encoding linear models as weighted finite-state transducers (2014)
18.
go back to reference Mnih, V., Heess, N., Graves, A.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, vol. 27 (2014) Mnih, V., Heess, N., Graves, A.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
19.
go back to reference Rao, K., Peng, F., Sak, H., et al.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229. IEEE (2015) Rao, K., Peng, F., Sak, H., et al.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229. IEEE (2015)
20.
go back to reference Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef
21.
22.
go back to reference Graves, A., Fernández, S., Gomez, F., et al.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006) Graves, A., Fernández, S., Gomez, F., et al.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
23.
go back to reference Mousa, A.E.D., Schuller, B.: Deep bidirectional long short-term memory recurrent neural networks for grapheme-to-phoneme conversion utilizing complex many-to-many alignments (2016) Mousa, A.E.D., Schuller, B.: Deep bidirectional long short-term memory recurrent neural networks for grapheme-to-phoneme conversion utilizing complex many-to-many alignments (2016)
24.
go back to reference Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, vol. 27 (2014) Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
25.
go back to reference Robinson, A.J., Fallside, F.: The Utility Driven Dynamic Error Propagation Network. University of Cambridge Department of Engineering, Cambridge (1987) Robinson, A.J., Fallside, F.: The Utility Driven Dynamic Error Propagation Network. University of Cambridge Department of Engineering, Cambridge (1987)
26.
go back to reference Williams, R.J., Peng, J.: An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Comput. 2(4), 490–501 (1990)CrossRef Williams, R.J., Peng, J.: An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Comput. 2(4), 490–501 (1990)CrossRef
27.
go back to reference Yolchuyeva, S., Németh, G., Gyires-Tóth, B.: Grapheme-to-phoneme conversion with convolutional neural networks. Appl. Sci. 9(6), 1143 (2019)CrossRef Yolchuyeva, S., Németh, G., Gyires-Tóth, B.: Grapheme-to-phoneme conversion with convolutional neural networks. Appl. Sci. 9(6), 1143 (2019)CrossRef
28.
go back to reference Chae, M.J., Park, K., Bang, J., et al.: Convolutional sequence to sequence model with non-sequential greedy decoding for grapheme to phoneme conversion. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2486–2490. IEEE (2018) Chae, M.J., Park, K., Bang, J., et al.: Convolutional sequence to sequence model with non-sequential greedy decoding for grapheme to phoneme conversion. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2486–2490. IEEE (2018)
29.
go back to reference Toshniwal, S., Livescu, K.: Jointly learning to align and convert graphemes to phonemes with neural attention models. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 76–82. IEEE (2016) Toshniwal, S., Livescu, K.: Jointly learning to align and convert graphemes to phonemes with neural attention models. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 76–82. IEEE (2016)
30.
go back to reference Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:​1409.​0473 (2014)
31.
go back to reference Sun, H., Tan, X., Gan, J.W., et al.: Token-level ensemble distillation for grapheme-to-phoneme conversion. arXiv preprint arXiv:1904.03446 (2019) Sun, H., Tan, X., Gan, J.W., et al.: Token-level ensemble distillation for grapheme-to-phoneme conversion. arXiv preprint arXiv:​1904.​03446 (2019)
32.
go back to reference Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
33.
go back to reference Wu, Y., Schuster, M., Chen, Z., et al.: Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016) Wu, Y., Schuster, M., Chen, Z., et al.: Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:​1609.​08144 (2016)
34.
go back to reference Gehring, J., Auli, M., Grangier, D., et al.: Convolutional sequence to sequence learning. In: International Conference on Machine Learning, pp. 1243–1252. PMLR (2017) Gehring, J., Auli, M., Grangier, D., et al.: Convolutional sequence to sequence learning. In: International Conference on Machine Learning, pp. 1243–1252. PMLR (2017)
35.
go back to reference Yolchuyeva, S., Németh, G., Gyires-Tóth, B.: Transformer based grapheme-to-phoneme conversion. arXiv preprint arXiv:2004.06338 (2020) Yolchuyeva, S., Németh, G., Gyires-Tóth, B.: Transformer based grapheme-to-phoneme conversion. arXiv preprint arXiv:​2004.​06338 (2020)
36.
go back to reference Řezáčková, M., Švec, J., Tihelka, D.: T5g2p: using text-to-text transfer transformer for grapheme-to-phoneme conversion (2021) Řezáčková, M., Švec, J., Tihelka, D.: T5g2p: using text-to-text transfer transformer for grapheme-to-phoneme conversion (2021)
37.
go back to reference Dong, L., Guo, Z.Q., Tan, C.H., et al.: Neural grapheme-to-phoneme conversion with pre-trained grapheme models. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6202–6206. IEEE (2022) Dong, L., Guo, Z.Q., Tan, C.H., et al.: Neural grapheme-to-phoneme conversion with pre-trained grapheme models. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6202–6206. IEEE (2022)
38.
39.
go back to reference Peters, B., Dehdari, J., van Genabith, J.: Massively multilingual neural grapheme-to-phoneme conversion. arXiv preprint arXiv:1708.01464 (2017) Peters, B., Dehdari, J., van Genabith, J.: Massively multilingual neural grapheme-to-phoneme conversion. arXiv preprint arXiv:​1708.​01464 (2017)
40.
go back to reference Sokolov, A., Rohlin, T., Rastrow, A.: Neural machine translation for multilingual grapheme-to-phoneme conversion. arXiv preprint arXiv:2006.14194 (2020) Sokolov, A., Rohlin, T., Rastrow, A.: Neural machine translation for multilingual grapheme-to-phoneme conversion. arXiv preprint arXiv:​2006.​14194 (2020)
41.
go back to reference Yu, M., Nguyen, H.D., Sokolov, A., et al.: Multilingual grapheme-to-phoneme conversion with byte representation. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8234–8238. IEEE (2020) Yu, M., Nguyen, H.D., Sokolov, A., et al.: Multilingual grapheme-to-phoneme conversion with byte representation. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8234–8238. IEEE (2020)
42.
go back to reference Zhu, J., Zhang, C., Jurgens, D.: ByT5 model for massively multilingual grapheme-to-phoneme conversion. arXiv preprint arXiv:2204.03067 (2022) Zhu, J., Zhang, C., Jurgens, D.: ByT5 model for massively multilingual grapheme-to-phoneme conversion. arXiv preprint arXiv:​2204.​03067 (2022)
43.
go back to reference Kim, H.Y., Kim, J.H., Kim, J.M.: Fast bilingual grapheme-to-phoneme conversion. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 289–296. Industry Track (2022) Kim, H.Y., Kim, J.H., Kim, J.M.: Fast bilingual grapheme-to-phoneme conversion. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 289–296. Industry Track (2022)
45.
go back to reference CALLHOME American English Lexicon (PRONLEX). Linguistic Data Consortium CALLHOME American English Lexicon (PRONLEX). Linguistic Data Consortium
47.
go back to reference Sejnowski, T.J.: The NetTalk Corpus: Phonetic Transcription of 20008 English Words (1988) Sejnowski, T.J.: The NetTalk Corpus: Phonetic Transcription of 20008 English Words (1988)
48.
go back to reference Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady. 10(8), 707–710 (1966) Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady. 10(8), 707–710 (1966)
49.
go back to reference Zhu, R., Huang, Y.: Efficient privacy-preserving general edit distance and beyond. Cryptology ePrint Archive (2017) Zhu, R., Huang, Y.: Efficient privacy-preserving general edit distance and beyond. Cryptology ePrint Archive (2017)
Metadata
Title
Literature Review of Research on Common Methods of Grapheme-To-Phoneme
Authors
Yating Zhang
Han Zhang
Shaozhong Cao
Copyright Year
2023
Publisher
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-99-3618-2_16

Premium Partner