Skip to main content
Erschienen in: International Journal of Speech Technology 1/2019

08.11.2018

Long short-term memory recurrent neural network architectures for Urdu acoustic modeling

verfasst von: Tehseen Zia, Usman Zahid

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recurrent neural networks (RNNs) have achieved remarkable improvements in acoustic modeling recently. However, the potential of RNNs have not been utilized for modeling Urdu acoustics. The connectionist temporal classification and attention based RNNs are suffered due to the unavailability of lexicon and computational cost of training, respectively. Therefore, we explored contemporary long short-term memory and gated recurrent neural networks Urdu acoustic modeling. The efficacies of plain, deep, bidirectional and deep-directional network architectures are evaluated empirically. Results indicate that deep-directional has an advantage over the other architectures. A word error rate of 20% was achieved on a hundred words dataset of twenty speakers. It shows 15% improvement over the baseline single-layer LSTMs. It has been observed that two-layer architectures can improve performance over single-layer, however the performance is degraded with further layers. LSTM architectures were compared with gated recurrent unit (GRU) based architectures and it was found that LSTM has an advantage over GRU.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The sequence processing using neural networks is usually performed by operating over a context window at the first layer. We have not considered context window in this section for notational convenience.
 
2
Biases are omitted throughout the paper for simplicity.
 
3
“Center for Language Engineering” [Online]. Available: http://​www.​cle.​org.​pk.
 
4
“Python_speech_features toolkit” [Online]. Available: https://​python-speech-features.​readthedocs.​io/​en/​latest/​.
 
Literatur
Zurück zum Zitat Ahad, A., Fayyaz, A., & Mehmood, T. (2002). Speech recognition using multilayer perceptron. In Proceedings of IEEE students conference (Vol. 1, pp 103–109). Ahad, A., Fayyaz, A., & Mehmood, T. (2002). Speech recognition using multilayer perceptron. In Proceedings of IEEE students conference (Vol. 1, pp 103–109).
Zurück zum Zitat Ali, H., Ahmad, N., & Hafeez, A. (2016). Urdu speech corpus and preliminary results on speech recognition. In International conference on engineering applications of neural networks (pp 317–325). New York: Springer. Ali, H., Ahmad, N., & Hafeez, A. (2016). Urdu speech corpus and preliminary results on speech recognition. In International conference on engineering applications of neural networks (pp 317–325). New York: Springer.
Zurück zum Zitat Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., & Chen, J. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. In International conference on machine Learning (pp 173–182). Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., & Chen, J. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. In International conference on machine Learning (pp 173–182).
Zurück zum Zitat Ashraf, J., Iqbal, N., Khattak, N. S., & Zaidi, A. M. (2010). Speaker independent Urdu speech recognition using HMM. In 7th IEEE international conference on informatics and systems (INFOS) (pp 1–5). Ashraf, J., Iqbal, N., Khattak, N. S., & Zaidi, A. M. (2010). Speaker independent Urdu speech recognition using HMM. In 7th IEEE international conference on informatics and systems (INFOS) (pp 1–5).
Zurück zum Zitat Azam, S. M., Mansoor, Z. A., Mughal, M. S., & Mohsin, S. (2007). Urdu spoken digits recognition using classified MFCC and backpropgation neural network. In IEEE conference on computer graphics, imaging and visualisation (pp 414–418). Azam, S. M., Mansoor, Z. A., Mughal, M. S., & Mohsin, S. (2007). Urdu spoken digits recognition using classified MFCC and backpropgation neural network. In IEEE conference on computer graphics, imaging and visualisation (pp 414–418).
Zurück zum Zitat Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4945–4949). IEEE. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4945–4949). IEEE.
Zurück zum Zitat Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4960–4964). IEEE. Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4960–4964). IEEE.
Zurück zum Zitat Chan, W., & Lane, I. (2015), Deep recurrent neural networks for acoustic modelling. arXiv Preprint arXiv:1504.01482. Chan, W., & Lane, I. (2015), Deep recurrent neural networks for acoustic modelling. arXiv Preprint arXiv:1504.01482.
Zurück zum Zitat Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., & Jaitly, N. (2017). State-of-the-art speech recognition with sequence-to-sequence models. arXiv Preprint arXiv:1712.01769. Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., & Jaitly, N. (2017). State-of-the-art speech recognition with sequence-to-sequence models. arXiv Preprint arXiv:1712.01769.
Zurück zum Zitat Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv Preprint arXiv:1412.3555. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv Preprint arXiv:1412.3555.
Zurück zum Zitat Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st international conference on machine learning (ICML-14) (pp 1764–1772). Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st international conference on machine learning (ICML-14) (pp 1764–1772).
Zurück zum Zitat Graves, A., Mohamed, A. R., & Hinton, G. (2013a). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (pp 6645–6649). Graves, A., Mohamed, A. R., & Hinton, G. (2013a). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (pp 6645–6649).
Zurück zum Zitat Graves, A., Jaitly, N., & Mohamed, A. R. (2013b). Hybrid speech recognition with deep bidirectional LSTM. In IEEE workshop on automatic speech recognition and understanding (ASRU), pp 273–278. Graves, A., Jaitly, N., & Mohamed, A. R. (2013b). Hybrid speech recognition with deep bidirectional LSTM. In IEEE workshop on automatic speech recognition and understanding (ASRU), pp 273–278.
Zurück zum Zitat Graves, A., & Schmidhuber, J. (2009). Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in neural information processing systems (pp 545–552). Graves, A., & Schmidhuber, J. (2009). Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in neural information processing systems (pp 545–552).
Zurück zum Zitat Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016). LSTM: A search space odyssey. In IEEE transactions on neural networks and learning systems. Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016). LSTM: A search space odyssey. In IEEE transactions on neural networks and learning systems.
Zurück zum Zitat Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., & Ng, A. Y. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv Preprint arXiv:1412.5567. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., & Ng, A. Y. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv Preprint arXiv:1412.5567.
Zurück zum Zitat Hasnain, S. K., & Awan, M. S. (2008). Recognizing spoken Urdu numbers using Fourier descriptor and neural networks with Matlab. In Second international IEEE conference on electrical engineering (pp 1–6). Hasnain, S. K., & Awan, M. S. (2008). Recognizing spoken Urdu numbers using Fourier descriptor and neural networks with Matlab. In Second international IEEE conference on electrical engineering (pp 1–6).
Zurück zum Zitat Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef
Zurück zum Zitat Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRef Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRef
Zurück zum Zitat Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp 1097–1105). Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp 1097–1105).
Zurück zum Zitat Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv Preprint arXiv:1506.00019. Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv Preprint arXiv:1506.00019.
Zurück zum Zitat Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. Interspeech, 2, 3. Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. Interspeech, 2, 3.
Zurück zum Zitat Pascanu, R., Gulcehre, C., Cho, K., & Bengio, Y. (2013). How to construct deep recurrent neural networks. arXiv Preprint arXiv:1312.6026. Pascanu, R., Gulcehre, C., Cho, K., & Bengio, Y. (2013). How to construct deep recurrent neural networks. arXiv Preprint arXiv:1312.6026.
Zurück zum Zitat Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In International conference on machine learning (pp 1310–1318). Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In International conference on machine learning (pp 1310–1318).
Zurück zum Zitat Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.CrossRef Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.CrossRef
Zurück zum Zitat Rao, K., & Sak, H. (2017). Multi-accent speech recognition with hierarchical grapheme based models. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 4815–4819). IEEE. Rao, K., & Sak, H. (2017). Multi-accent speech recognition with hierarchical grapheme based models. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 4815–4819). IEEE.
Zurück zum Zitat Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth Annual Conference of the International Speech Communication Association. Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth Annual Conference of the International Speech Communication Association.
Zurück zum Zitat Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv Preprint arXiv:1507.06947. Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv Preprint arXiv:1507.06947.
Zurück zum Zitat Sarfraz, H., Hussain, S., Bokhari, R., Raza, A. A., Ullah, I., Sarfraz, Z., Pervez, S., Mustafa, A., Javed, I., & Parveen, R. (2010). Large vocabulary continuous speech recognition for Urdu. In Proceedings of the 8th ACM international conference on frontiers of information technology (p 1). Sarfraz, H., Hussain, S., Bokhari, R., Raza, A. A., Ullah, I., Sarfraz, Z., Pervez, S., Mustafa, A., Javed, I., & Parveen, R. (2010). Large vocabulary continuous speech recognition for Urdu. In Proceedings of the 8th ACM international conference on frontiers of information technology (p 1).
Zurück zum Zitat Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.CrossRef Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.CrossRef
Zurück zum Zitat Sutskever, I., Vinyals, O., & Le, Q. V. (2014), Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp 3104–3112). Sutskever, I., Vinyals, O., & Le, Q. V. (2014), Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp 3104–3112).
Zurück zum Zitat Williams, R. J., & Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural computation, 2(4), 490–501.CrossRef Williams, R. J., & Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural computation, 2(4), 490–501.CrossRef
Zurück zum Zitat Yu, D., & Li, J. (2017). Recent progresses in deep learning based acoustic models. IEEE/CAA Journal of Automatica Sinica, 4(3), 396–409.CrossRef Yu, D., & Li, J. (2017). Recent progresses in deep learning based acoustic models. IEEE/CAA Journal of Automatica Sinica, 4(3), 396–409.CrossRef
Zurück zum Zitat Zweig, G., Yu, C., Stolcke, D. J., A. (2017). Advances in all-neural speech recognition. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4805–4809). IEEE. Zweig, G., Yu, C., Stolcke, D. J., A. (2017). Advances in all-neural speech recognition. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4805–4809). IEEE.
Metadaten
Titel
Long short-term memory recurrent neural network architectures for Urdu acoustic modeling
verfasst von
Tehseen Zia
Usman Zahid
Publikationsdatum
08.11.2018
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 1/2019
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-018-09573-7

Weitere Artikel der Ausgabe 1/2019

International Journal of Speech Technology 1/2019 Zur Ausgabe

Neuer Inhalt