Skip to main content
Erschienen in: Arabian Journal for Science and Engineering 8/2022

31.03.2022 | Research Article-Computer Engineering and Computer Science

Spoken Utterance Classification Task of Arabic Numerals and Selected Isolated Words

verfasst von: Karim dabbabi, Abdelkarim Mars

Erschienen in: Arabian Journal for Science and Engineering | Ausgabe 8/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Nowadays, sensory organs are becoming essential means for controlling modern machines which require human intervention. Among these means, we can cite the sense of voice which can be used to control and monitor modern interfaces. In this regard, Automatic Speech Recognition (ASR) is mainly explored to accomplish many tasks, such as translating natural voice into computer text and performing actions based on human commands. In this paper, a system for recognizing spoken Arabic numerals and words based on two classification methods is proposed. The first classification approach is a combination of Convolutional Neural Network (CNN) with Long Short-Term Memory (LSTM) and Fully Connected (FC) network (CNN-LSTM-FC), while the second is based on the conventional Dense Network (DenseNet). These classification approaches are integrated into the proposed Arabic speech recognition system to perform the classification task by exploring uniform length sequences of speech utterances extracted from the Mel-frequency Cepstral Coefficients (MFCCs). Regarding the CNN-LSTM-FC approach, it is offered with the objective of learning high-level features that contain long-term contextual dependencies and local information. These features include less information than raw data, which helps to reduce the training time. Also, the CNN-LSTM-FC method allows capturing global contextual information and local correlation results from MFCC coefficients. With respect to the DenseNet model, it is explored to benefit from the direct connections \(\frac{{L\left( {L + 1} \right)}}{2}\) between its layers in addition to its ability to alleviate the problem of the vanishing of gradient and the reduction in the number of its explored parameters. The training time is therefore reduced. Our models were evaluated on two databases: The first is a database of English voice commands, while the second is that of spoken Arabic numerals and words. Experimental tests showed that the CNN-LSTM-FC model with MFCC coefficients performed best on the database of spoken Arabic numerals and words in terms of evaluated performances (accuracy = 88.04%, precision = 88.56%, recall = 87.78%, F1 = 88.17, and error = 1.10%) compared to those obtained with the DenseNet model. Additionally, the best results on the database of English voice command for precision (87.15%), F1 (85.66), and error (0.58%) were obtained by the CNN-LSTM-FC model, while those for accuracy (85.40%) and recall (85.40%) were achieved using the DenseNet model. Even the two proposed models led to acceptable results on both databases; however, they require less computation to achieve higher performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386 (1958)CrossRef Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386 (1958)CrossRef
2.
Zurück zum Zitat Rumelhart, D.E.; Hinton, G.E.; Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533 (1986)CrossRef Rumelhart, D.E.; Hinton, G.E.; Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533 (1986)CrossRef
3.
Zurück zum Zitat Hinton, G.; Deng, L., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012)CrossRef Hinton, G.; Deng, L., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012)CrossRef
4.
Zurück zum Zitat Hochreiter, S.; Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRef Hochreiter, S.; Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRef
5.
Zurück zum Zitat Mohamed, A.R.; Dahl, G.; Hinton, G.: Deep belief networks for phone recognition. In: NIPS workshop on deep learning for speech recognition and related applications, vol. 1, pp. 39–47 (2009) Mohamed, A.R.; Dahl, G.; Hinton, G.: Deep belief networks for phone recognition. In: NIPS workshop on deep learning for speech recognition and related applications, vol. 1, pp. 39–47 (2009)
6.
Zurück zum Zitat LeCun, Y.; Boser, B., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989)CrossRef LeCun, Y.; Boser, B., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989)CrossRef
7.
Zurück zum Zitat Purwins, H.; Li, B.; Virtanen, T.; Schlüter, J.; Chang, S.; Sainath, T.: Deep learning for audio signal processing. J. Sel. Top. Signal Process. 13, 206–219 (2019)CrossRef Purwins, H.; Li, B.; Virtanen, T.; Schlüter, J.; Chang, S.; Sainath, T.: Deep learning for audio signal processing. J. Sel. Top. Signal Process. 13, 206–219 (2019)CrossRef
8.
Zurück zum Zitat Jouppi, N.P.; Young, C., et al.: In-datacenter performance analysis of a tensor processing unit. In: ISCA, pp. 1–12. IEEE (2017) Jouppi, N.P.; Young, C., et al.: In-datacenter performance analysis of a tensor processing unit. In: ISCA, pp. 1–12. IEEE (2017)
9.
Zurück zum Zitat Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K.: Speech recognition using deep neural networks: a systematic review. IEEE Access 7, 19143–19165 (2019)CrossRef Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K.: Speech recognition using deep neural networks: a systematic review. IEEE Access 7, 19143–19165 (2019)CrossRef
10.
Zurück zum Zitat Singh, H.; Bathla, A.K.: A survey on speech recognition. Int. J. Adv. Res. Comput. Eng. Technol. 2, 2186–2189 (2013) Singh, H.; Bathla, A.K.: A survey on speech recognition. Int. J. Adv. Res. Comput. Eng. Technol. 2, 2186–2189 (2013)
11.
Zurück zum Zitat Zhang, Y.: Speech recognition using deep learning algorithms, Stanford Univ., Stanford, CA, USA, Tech. Rep., pp. 1–5 (2013) Zhang, Y.: Speech recognition using deep learning algorithms, Stanford Univ., Stanford, CA, USA, Tech. Rep., pp. 1–5 (2013)
12.
Zurück zum Zitat Padmanabhan, J.; Premkumar, M.J.J.: Machine learning in automatic speech recognition: a survey. IETE Tech. Rev. 32, 240–251 (2015)CrossRef Padmanabhan, J.; Premkumar, M.J.J.: Machine learning in automatic speech recognition: a survey. IETE Tech. Rev. 32, 240–251 (2015)CrossRef
13.
Zurück zum Zitat Lippmann, R.P.: Review of neural networks for speech recognition. Neural Comput. 1(1), 1–38 (1989)CrossRef Lippmann, R.P.: Review of neural networks for speech recognition. Neural Comput. 1(1), 1–38 (1989)CrossRef
14.
Zurück zum Zitat Biing-Hang, J.; Rabiner, L.: Automatic speech recognition—a brief history of the technology development, Georgia Institute of Technology, Atlanta, Rutgers University and the University of California, Santa Barbara, p. 67 (2005) Biing-Hang, J.; Rabiner, L.: Automatic speech recognition—a brief history of the technology development, Georgia Institute of Technology, Atlanta, Rutgers University and the University of California, Santa Barbara, p. 67 (2005)
15.
Zurück zum Zitat Anusuya, M.A.; Katti, S.K.: Speech recognition by machine, a review. arXiv preprint arXiv: 1001.2267 (2010) Anusuya, M.A.; Katti, S.K.: Speech recognition by machine, a review. arXiv preprint arXiv: 1001.2267 (2010)
16.
Zurück zum Zitat Zerari, N.; Abdelhamid, S.; Bouzgou, H.; Raymond, C.: Bidirectional deep architecture for Arabic speech recognition. Open Comput. Sci. 9, 92–102 (2019)CrossRef Zerari, N.; Abdelhamid, S.; Bouzgou, H.; Raymond, C.: Bidirectional deep architecture for Arabic speech recognition. Open Comput. Sci. 9, 92–102 (2019)CrossRef
17.
Zurück zum Zitat Saeed, K.; Nammous, M.K.: A speech-and-speaker identification system: feature extraction, description, and classification of speech signal image. IEEE Trans. Ind. Electron. 54, 887–897 (2007)CrossRef Saeed, K.; Nammous, M.K.: A speech-and-speaker identification system: feature extraction, description, and classification of speech signal image. IEEE Trans. Ind. Electron. 54, 887–897 (2007)CrossRef
18.
Zurück zum Zitat Hammami, N.; Sellam, M.: Tree distribution classifier for automatic spoken Arabic digit recognition. In: IEEE International Conference for Internet Technology and Secured Transactions, pp. 1–4 (2004) Hammami, N.; Sellam, M.: Tree distribution classifier for automatic spoken Arabic digit recognition. In: IEEE International Conference for Internet Technology and Secured Transactions, pp. 1–4 (2004)
19.
Zurück zum Zitat Hammami, N.; Bedda, M.: Improved tree model for Arabic speech recognition. In: International Conference on Computer Science and Information Technology, vol. 5, pp. 521–526 Hammami, N.; Bedda, M.: Improved tree model for Arabic speech recognition. In: International Conference on Computer Science and Information Technology, vol. 5, pp. 521–526
20.
Zurück zum Zitat Daqrouq, K.; Alfaouri, M.; Alkhateeb, A.; Khalaf, E.; Morfeq, A.: Wavelet lpc with neural network for spoken Arabic digits recognition system, British. J. Appl. Sci. Technol. 4, 1238–1255 (2014) Daqrouq, K.; Alfaouri, M.; Alkhateeb, A.; Khalaf, E.; Morfeq, A.: Wavelet lpc with neural network for spoken Arabic digits recognition system, British. J. Appl. Sci. Technol. 4, 1238–1255 (2014)
21.
Zurück zum Zitat Satori, H.; Harti, M.; Chenfour, N.: Introduction to Arabic speech recognition using cmu sphinx system. arXiv preprint arXiv: 0704.2083 (2007) Satori, H.; Harti, M.; Chenfour, N.: Introduction to Arabic speech recognition using cmu sphinx system. arXiv preprint arXiv: 0704.2083 (2007)
22.
Zurück zum Zitat LeCun, Y.; Bengio, Y.; Hinton, G.: Deep learning. Nature 521, 436–444 (2015)CrossRef LeCun, Y.; Bengio, Y.; Hinton, G.: Deep learning. Nature 521, 436–444 (2015)CrossRef
23.
Zurück zum Zitat Graves, A.; Mohamed, A.R.; Hinton, G.: Speech recognition with deep recurrent neural networks. In: IEEE International conference on acoustics, speech and signal processing, pp. 6645–6649 (2013) Graves, A.; Mohamed, A.R.; Hinton, G.: Speech recognition with deep recurrent neural networks. In: IEEE International conference on acoustics, speech and signal processing, pp. 6645–6649 (2013)
24.
Zurück zum Zitat Dahl, G.E.; Yu, D.; Deng, L.; Acero, A.: Context-dependent pretrained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 30–42 (2012)CrossRef Dahl, G.E.; Yu, D.; Deng, L.; Acero, A.: Context-dependent pretrained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 30–42 (2012)CrossRef
25.
Zurück zum Zitat Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; Mohamed, A.R.; Jaitly, N., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29, 82–97 (2012)CrossRef Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; Mohamed, A.R.; Jaitly, N., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29, 82–97 (2012)CrossRef
26.
Zurück zum Zitat Ali, A.; Bell, P.; Glass, J.; Messaoui, Y.; Mubarak, H.; Renals, S. et al.: The MGB-2 challenge: Arabic multi-dialect broadcast media recognition. In: IEEE Spoken Language Technology Workshop, pp. 279–284 (2016) Ali, A.; Bell, P.; Glass, J.; Messaoui, Y.; Mubarak, H.; Renals, S. et al.: The MGB-2 challenge: Arabic multi-dialect broadcast media recognition. In: IEEE Spoken Language Technology Workshop, pp. 279–284 (2016)
27.
Zurück zum Zitat Ali, A.; Vogel, S.; Renals, S.: Speech recognition challenge in the wild: Arabic MGB-3. In: IEEE Automatic Speech Recognition and Understanding Workshop, pp. 316–322 (2017) Ali, A.; Vogel, S.; Renals, S.: Speech recognition challenge in the wild: Arabic MGB-3. In: IEEE Automatic Speech Recognition and Understanding Workshop, pp. 316–322 (2017)
28.
Zurück zum Zitat Afify, M.; Nguyen, L.; Xiang, B.; Abdou, S.; Makhoul, J.: Recent progress in Arabic broadcast news transcription at BBN. In: Ninth European Conference on Speech Communication and Technology (2005) Afify, M.; Nguyen, L.; Xiang, B.; Abdou, S.; Makhoul, J.: Recent progress in Arabic broadcast news transcription at BBN. In: Ninth European Conference on Speech Communication and Technology (2005)
29.
Zurück zum Zitat Manohar, V.; Povey, D.; Khudanpur, S.:JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning. In: Automatic Speech Recognition and Understanding Workshop, pp. 346–352 (2017) Manohar, V.; Povey, D.; Khudanpur, S.:JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning. In: Automatic Speech Recognition and Understanding Workshop, pp. 346–352 (2017)
30.
Zurück zum Zitat Young, S.J.; Young, S.: The HTK hidden Markov model toolkit: design and philosophy, University of Cambridge, Department of Engineering (1993) Young, S.J.; Young, S.: The HTK hidden Markov model toolkit: design and philosophy, University of Cambridge, Department of Engineering (1993)
31.
Zurück zum Zitat Sak, H.; Senior, A.; Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: 15th Proc. Inter-speech, Singapore, Singapore, pp. 338–342 (2014) Sak, H.; Senior, A.; Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: 15th Proc. Inter-speech, Singapore, Singapore, pp. 338–342 (2014)
32.
Zurück zum Zitat Graves, A.; Mohamed, A.R.; Hinton, G.: Speech recognition with deep recurrent neural networks. In: Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, pp. 6645–6649 (2013) Graves, A.; Mohamed, A.R.; Hinton, G.: Speech recognition with deep recurrent neural networks. In: Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, pp. 6645–6649 (2013)
33.
Zurück zum Zitat Li, X.G.; Wu, X.H.: Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. In: Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, pp. 4520–4524 (2015) Li, X.G.; Wu, X.H.: Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. In: Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, pp. 4520–4524 (2015)
34.
Zurück zum Zitat Miao, Y.J.; Metze, F.: On speaker adaptation of long shortterm memory recurrent neural networks. In: 16th Proc. Inter-speech, Dresden, Germany, pp. 1101–1105 (2015) Miao, Y.J.; Metze, F.: On speaker adaptation of long shortterm memory recurrent neural networks. In: 16th Proc. Inter-speech, Dresden, Germany, pp. 1101–1105 (2015)
35.
Zurück zum Zitat Miao, Y.J.; Li, J.; Wang, Y.Q.; Zhang, S.X.; Gong, Y.F.: Simplifying long short-term memory acoustic models for fast training and decoding. In: Prco. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China (2016) Miao, Y.J.; Li, J.; Wang, Y.Q.; Zhang, S.X.; Gong, Y.F.: Simplifying long short-term memory acoustic models for fast training and decoding. In: Prco. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China (2016)
36.
Zurück zum Zitat Zhao, Y.Y.; Xu, S.; Xu, B.: Multidimensional residual learning based on recurrent neural networks for acoustic modeling. In: 17th Proc. Inter-speech, San Francisco, USA, pp. 3419–3423 (2016) Zhao, Y.Y.; Xu, S.; Xu, B.: Multidimensional residual learning based on recurrent neural networks for acoustic modeling. In: 17th Proc. Inter-speech, San Francisco, USA, pp. 3419–3423 (2016)
37.
Zurück zum Zitat Kim, J.; El-Khamy, M.; Lee, J.: Residual LSTM: design of a deep recurrent architecture for distant speech recognition. arXiv:1701.03360 (2017) Kim, J.; El-Khamy, M.; Lee, J.: Residual LSTM: design of a deep recurrent architecture for distant speech recognition. arXiv:1701.03360 (2017)
38.
Zurück zum Zitat Zhang, Y.; Chen, G.G.; Yu, D.; Yao, K.S.; Khudanpur, S.; Glass, J.: Highway long short-term memory RNNS for distant speech recognition. In: Proc. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China (2016) Zhang, Y.; Chen, G.G.; Yu, D.; Yao, K.S.; Khudanpur, S.; Glass, J.: Highway long short-term memory RNNS for distant speech recognition. In: Proc. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China (2016)
39.
Zurück zum Zitat He, K.; Zhang, X.Y.; Ren, S.Q.; Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 (2015) He, K.; Zhang, X.Y.; Ren, S.Q.; Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 (2015)
40.
Zurück zum Zitat Li, J.; Mohamed, A.; Zweig, G.; Gong, Y.F.: LSTM time and frequency recurrence for automatic speech recognition. In: Proc. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, AZ, USA (2015) Li, J.; Mohamed, A.; Zweig, G.; Gong, Y.F.: LSTM time and frequency recurrence for automatic speech recognition. In: Proc. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, AZ, USA (2015)
41.
Zurück zum Zitat Li, J.; Mohamed, A.; Zweig, G.; Gong, Y.F.: Exploring multidimensional LSTMs for large vocabulary ASR. In: Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China (2016) Li, J.; Mohamed, A.; Zweig, G.; Gong, Y.F.: Exploring multidimensional LSTMs for large vocabulary ASR. In: Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China (2016)
42.
Zurück zum Zitat Sainath, T.N.; Li, B.: Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In: 17th Proc. Inter-speech, San Francisco, USA (2016) Sainath, T.N.; Li, B.: Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In: 17th Proc. Inter-speech, San Francisco, USA (2016)
43.
Zurück zum Zitat Graves, A.; Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005)CrossRef Graves, A.; Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005)CrossRef
44.
Zurück zum Zitat Yu, D.; Li, J.: Recent progresses in deep learning based acoustic models. IEEE/CAA J. Autom. Sin. 4, 396–409 (2017)CrossRef Yu, D.; Li, J.: Recent progresses in deep learning based acoustic models. IEEE/CAA J. Autom. Sin. 4, 396–409 (2017)CrossRef
45.
Zurück zum Zitat Sercu, T.; Puhrsch, C.; Kingsbury, B.; LeCun, Y.: Very deep multilingual convolutional neural networks for LVCSR. In: Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, pp. 4955–4959 (2016) Sercu, T.; Puhrsch, C.; Kingsbury, B.; LeCun, Y.: Very deep multilingual convolutional neural networks for LVCSR. In: Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, pp. 4955–4959 (2016)
46.
Zurück zum Zitat Yu, D.; Xiong, W.; Droppo, J.; Stolcke, A.; Ye, G.; Li, J.: Deep convolutional neural networks with layer-wise context expansion and attention. In: 17th Proc. Interspeech. San Francisco, USA (2016) Yu, D.; Xiong, W.; Droppo, J.; Stolcke, A.; Ye, G.; Li, J.: Deep convolutional neural networks with layer-wise context expansion and attention. In: 17th Proc. Interspeech. San Francisco, USA (2016)
47.
Zurück zum Zitat Sercu, T.; Goel, V.: Dense prediction on sequences with time-dilated convolutions for speech recognition. arXiv: 1611.09288 (2016) Sercu, T.; Goel, V.: Dense prediction on sequences with time-dilated convolutions for speech recognition. arXiv: 1611.09288 (2016)
48.
Zurück zum Zitat Zhao, T.; Zhao, Y.X.; Chen, X.: Time-frequency kernel-based CNN for speech recognition. In: 16th Proc. Interspeech, Dresden, Germany, (2015) Zhao, T.; Zhao, Y.X.; Chen, X.: Time-frequency kernel-based CNN for speech recognition. In: 16th Proc. Interspeech, Dresden, Germany, (2015)
49.
Zurück zum Zitat Jaitly, N.; Hinton, G.: Learning a better representation of speech soundwaves using restricted boltzmann machines. In: Proc. 2011 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Prague, Czech Republic, pp. 5884–5887 (2011) Jaitly, N.; Hinton, G.: Learning a better representation of speech soundwaves using restricted boltzmann machines. In: Proc. 2011 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Prague, Czech Republic, pp. 5884–5887 (2011)
50.
Zurück zum Zitat LeCun, Y.; Bengio, Y.: Convolutional networks for images, speech, and time-series. In: Arbib, M.A. (Ed.) The Handbook of Brain Theory and Neural Networks. MIT Press, Cambridge (1995) LeCun, Y.; Bengio, Y.: Convolutional networks for images, speech, and time-series. In: Arbib, M.A. (Ed.) The Handbook of Brain Theory and Neural Networks. MIT Press, Cambridge (1995)
51.
Zurück zum Zitat Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang Process. 22, 1533–1545 (2014)CrossRef Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang Process. 22, 1533–1545 (2014)CrossRef
52.
Zurück zum Zitat Sainath, T.N.; Vinyals, O.; Senior, A.; Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, pp. 4580–4584 (2015) Sainath, T.N.; Vinyals, O.; Senior, A.; Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, pp. 4580–4584 (2015)
53.
Zurück zum Zitat Peddinti, V.; Povey, D.; Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: 16th Proc. Inter-speech, Dresden, Germany, pp. 3214–3218 (2015) Peddinti, V.; Povey, D.; Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: 16th Proc. Inter-speech, Dresden, Germany, pp. 3214–3218 (2015)
54.
Zurück zum Zitat Toth, L.: Modeling long temporal contexts in convolutional neural network-based phone recognition. In: Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, pp. 4575–4579 (2015) Toth, L.: Modeling long temporal contexts in convolutional neural network-based phone recognition. In: Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, pp. 4575–4579 (2015)
55.
Zurück zum Zitat Xue, S.F.; Yan, Z.J.: Improving latency-controlled BLSTM acoustic models for online speech recognition. In: Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing. New Orleans, USA (2017) Xue, S.F.; Yan, Z.J.: Improving latency-controlled BLSTM acoustic models for online speech recognition. In: Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing. New Orleans, USA (2017)
56.
Zurück zum Zitat Amodei, D.; Anubhai, R.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Chen, J.; Chrzanowski, M.; Coates, A.; Diamos, G., et al.: Deep speech 2: End-to-end speech recognition in English and mandarin. arXiv preprint arXiv:1512.02595 (2015) Amodei, D.; Anubhai, R.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Chen, J.; Chrzanowski, M.; Coates, A.; Diamos, G., et al.: Deep speech 2: End-to-end speech recognition in English and mandarin. arXiv preprint arXiv:1512.02595 (2015)
57.
Zurück zum Zitat Zhao, J.; Mao, X.; Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)CrossRef Zhao, J.; Mao, X.; Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)CrossRef
58.
Zurück zum Zitat Wang, Y.; Zhang, L.; Zhang, B.; Li, Z.: End-to-end mandarin recognition based on convolution input. In: Proceedings of the 2018 2nd International Conference on Information Processing and Control Engineering (ICIPCE 2018), Shanghai, China, p. 01004 (2018) Wang, Y.; Zhang, L.; Zhang, B.; Li, Z.: End-to-end mandarin recognition based on convolution input. In: Proceedings of the 2018 2nd International Conference on Information Processing and Control Engineering (ICIPCE 2018), Shanghai, China, p. 01004 (2018)
59.
Zurück zum Zitat Li, M.; Liu, M.: End-to-end speech recognition with adaptive computation steps. arXiv:1808.10088(2018) Li, M.; Liu, M.: End-to-end speech recognition with adaptive computation steps. arXiv:1808.10088(2018)
60.
Zurück zum Zitat Bu, H.; Du, J.; Na, X.; Wu, B; Zheng, H.: AIShell-1: An open-source Mandarin speech corpus and a speech recognition baseline. In: Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Korea, pp. 1–5 (2017) Bu, H.; Du, J.; Na, X.; Wu, B; Zheng, H.: AIShell-1: An open-source Mandarin speech corpus and a speech recognition baseline. In: Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Korea, pp. 1–5 (2017)
61.
Zurück zum Zitat Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Qadir, J.; Schuller, B.W.: Deep representation learning in speech processing: challenges, recent advances, and future trends. IEEE Trans. Affect. Comput. 10, 359–365 (2021) Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Qadir, J.; Schuller, B.W.: Deep representation learning in speech processing: challenges, recent advances, and future trends. IEEE Trans. Affect. Comput. 10, 359–365 (2021)
62.
Zurück zum Zitat Kaiming, H., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016) Kaiming, H., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
63.
Zurück zum Zitat Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q.: Deep networks with stochastic depth. In: ECCV (2016) Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q.: Deep networks with stochastic depth. In: ECCV (2016)
64.
Zurück zum Zitat Zagoruyko, S.; Komodakis, N.: Wide residual networks. arXiv preprint arXiv:1605.07146 (2016) Zagoruyko, S.; Komodakis, N.: Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)
65.
Zurück zum Zitat Larsson, G.; Maire, M.; Shakhnarovich, G.: Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648 (2016) Larsson, G.; Maire, M.; Shakhnarovich, G.: Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648 (2016)
66.
Zurück zum Zitat Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015) Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)
67.
Zurück zum Zitat Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016) Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
68.
Zurück zum Zitat Li, C.Y.; Vu, N.T.: Densely connected convolutional networks for speech recognition. In: Speech Communication; 13th ITG-Symposium, pp 1–5. VDE (2018) Li, C.Y.; Vu, N.T.: Densely connected convolutional networks for speech recognition. In: Speech Communication; 13th ITG-Symposium, pp 1–5. VDE (2018)
69.
Zurück zum Zitat Strake, M.; Behr, P.; Lohrenz, T.; Fingscheidt, T.: DenseNet BLSTM for acoustic modeling in robust ASR. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 6–12. IEEE Strake, M.; Behr, P.; Lohrenz, T.; Fingscheidt, T.: DenseNet BLSTM for acoustic modeling in robust ASR. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 6–12. IEEE
70.
Zurück zum Zitat Feng, D.; Xu, K., Mi, H.; Lia, F.; Zhou, Y.: Sample dropout for audio scene classification using multi-scale dense connected convolutional neural network. In: Pacific Rim Knowledge Acquisition Workshop, pp. 114–123. Springer, Cham (2018) Feng, D.; Xu, K., Mi, H.; Lia, F.; Zhou, Y.: Sample dropout for audio scene classification using multi-scale dense connected convolutional neural network. In: Pacific Rim Knowledge Acquisition Workshop, pp. 114–123. Springer, Cham (2018)
71.
Zurück zum Zitat Angrick, M.; Herff, C.; Mugler, E.; Tate, M.C.; Slutzky, M.W.; Krusienski, D.J.; Schultz, T.: Speech synthesis from ECoG using densely connected 3D convolutional neural networks. J. Neural Eng. 16, 036019 (2019)CrossRef Angrick, M.; Herff, C.; Mugler, E.; Tate, M.C.; Slutzky, M.W.; Krusienski, D.J.; Schultz, T.: Speech synthesis from ECoG using densely connected 3D convolutional neural networks. J. Neural Eng. 16, 036019 (2019)CrossRef
72.
Zurück zum Zitat Huang, G.; Liu, Z.; Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708 (2017) Huang, G.; Liu, Z.; Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708 (2017)
73.
Zurück zum Zitat Solovyev, R.A.; Vakhrushev, M.; Radionov, A.; Aliev, V.; Shvets, A.A.: Deep learning approaches for understanding simple speech commands. In: 2020 IEEE 40th International Conference on Electronics and Nanotechnology (ELNANO), pp. 688–693. IEEE (2020) Solovyev, R.A.; Vakhrushev, M.; Radionov, A.; Aliev, V.; Shvets, A.A.: Deep learning approaches for understanding simple speech commands. In: 2020 IEEE 40th International Conference on Electronics and Nanotechnology (ELNANO), pp. 688–693. IEEE (2020)
74.
Zurück zum Zitat Rawat, W.; Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29, 2352–2449 (2017)MathSciNetCrossRef Rawat, W.; Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29, 2352–2449 (2017)MathSciNetCrossRef
75.
Zurück zum Zitat Lecun, Y.; Boser, B.E.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. NeuralComputation 1, 541–551 (1989) Lecun, Y.; Boser, B.E.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. NeuralComputation 1, 541–551 (1989)
76.
Zurück zum Zitat Boser, B.E.; Sackinger, E.; Bromley, J.; Lecun, Y.: An analog neural network processor and its application to high-speed character recognition. In: Ijcnn-91-Seattle International Joint Conference on Neural Networks IEEE, pp.415–420 (1991) Boser, B.E.; Sackinger, E.; Bromley, J.; Lecun, Y.: An analog neural network processor and its application to high-speed character recognition. In: Ijcnn-91-Seattle International Joint Conference on Neural Networks IEEE, pp.415–420 (1991)
77.
Zurück zum Zitat Behnke, S.: Discovering hierarchical speech features using convolutional non-negative matrix factorization. In: International Joint Conference on Neural Networks IEEE, pp. 2758–2763 (2003) Behnke, S.: Discovering hierarchical speech features using convolutional non-negative matrix factorization. In: International Joint Conference on Neural Networks IEEE, pp. 2758–2763 (2003)
78.
Zurück zum Zitat Palaz, D.; Collobert, R.; Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In: Conference of the International Speech Communication Association, pp. 1766–1770 (2013) Palaz, D.; Collobert, R.; Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In: Conference of the International Speech Communication Association, pp. 1766–1770 (2013)
79.
Zurück zum Zitat Ioffe, S.; Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015) Ioffe, S.; Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
80.
Zurück zum Zitat Glorot, X.; Bordes, A.; Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, pp. 315–323 (2011) Glorot, X.; Bordes, A.; Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, pp. 315–323 (2011)
81.
Zurück zum Zitat Labach, A.; Salehinejad, H.: Survey of Dropout Methods for Deep Neural Networks, arXiv:1904.13310(2019). Labach, A.; Salehinejad, H.: Survey of Dropout Methods for Deep Neural Networks, arXiv:1904.13310(2019).
82.
Zurück zum Zitat Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012) Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
83.
Zurück zum Zitat Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)MathSciNetMATH Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)MathSciNetMATH
84.
Zurück zum Zitat Khamparia, A.; Pandey, B.; Tiwari, S.; Gupta, D.; Khanna, A.; Rodrigues, J.J.: An integrated hybrid CNN–RNN model for visual description and generation of captions. Circ. Syst. Signal Process. 39, 776–788 (2020)CrossRef Khamparia, A.; Pandey, B.; Tiwari, S.; Gupta, D.; Khanna, A.; Rodrigues, J.J.: An integrated hybrid CNN–RNN model for visual description and generation of captions. Circ. Syst. Signal Process. 39, 776–788 (2020)CrossRef
86.
Zurück zum Zitat Alalshekmubarak, A.; Smith, L.S.: On improving the classification capability of reservoir computing for Arabic speech recognition. In: International Conference on Artificial Neural Networks, pp. 225–232. Springer, Cham (2014) Alalshekmubarak, A.; Smith, L.S.: On improving the classification capability of reservoir computing for Arabic speech recognition. In: International Conference on Artificial Neural Networks, pp. 225–232. Springer, Cham (2014)
87.
Zurück zum Zitat Jiang, H.: Confidence measures for speech recognition: a survey. Speech Commun. 45, 455–470 (2005)CrossRef Jiang, H.: Confidence measures for speech recognition: a survey. Speech Commun. 45, 455–470 (2005)CrossRef
88.
Zurück zum Zitat Hammami, N.; Bedda, M.: Improved tree model for arabic speech recognition. In: 2010 3rd IEEE International Conference on Computer Science and Information Technology (ICCSIT), vol. 5, pp. 521–526 (2010) Hammami, N.; Bedda, M.: Improved tree model for arabic speech recognition. In: 2010 3rd IEEE International Conference on Computer Science and Information Technology (ICCSIT), vol. 5, pp. 521–526 (2010)
89.
Zurück zum Zitat Hammami, N.; Bedda, M.; Nadir, F.: The second-order derivatives of mfcc for improving spoken arabic digits recognition using tree distributions approximation model and hmms. In: 2012 International Conference on Communications and Information Technology (ICCIT), vol. 12, pp. 1–5 (2012) Hammami, N.; Bedda, M.; Nadir, F.: The second-order derivatives of mfcc for improving spoken arabic digits recognition using tree distributions approximation model and hmms. In: 2012 International Conference on Communications and Information Technology (ICCIT), vol. 12, pp. 1–5 (2012)
90.
Zurück zum Zitat Cavalin, P.R.; Sabourin, R.; Suen, C.Y.: Logid: An adaptive framework combining local and global incremental learning for dynamic selection of ensembles of hmms. Pattern Recogn. 45, 3544–3556 (2012)CrossRef Cavalin, P.R.; Sabourin, R.; Suen, C.Y.: Logid: An adaptive framework combining local and global incremental learning for dynamic selection of ensembles of hmms. Pattern Recogn. 45, 3544–3556 (2012)CrossRef
91.
Zurück zum Zitat Dendani, B.; Bahi, H.; Sari, T.: Speech enhancement based on deep autoencoder for remote Arabic speech recognition. In: Image and Signal Processing, 2020. ICISP 2020. Lecture Notes in Computer Science, vol. 12, p. 12119 (2020) Dendani, B.; Bahi, H.; Sari, T.: Speech enhancement based on deep autoencoder for remote Arabic speech recognition. In: Image and Signal Processing, 2020. ICISP 2020. Lecture Notes in Computer Science, vol. 12, p. 12119 (2020)
92.
Zurück zum Zitat Abdelkbir, O.; Said, S.: A comparative study for Arabic speech recognition system in noisy environments. Int. J. Speech Technol. 21, 10772 (2021) Abdelkbir, O.; Said, S.: A comparative study for Arabic speech recognition system in noisy environments. Int. J. Speech Technol. 21, 10772 (2021)
93.
Zurück zum Zitat Zerari, N., et al.: Bi-directional recurrent end-to-end neural network classifier for spoken Arab digit recognition. In: 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), pp.1–6. IEEE (2018) Zerari, N., et al.: Bi-directional recurrent end-to-end neural network classifier for spoken Arab digit recognition. In: 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), pp.1–6. IEEE (2018)
94.
Zurück zum Zitat Zerari, N., et al.: Bidirectional deep architecture for Arabic speech recognition. Open Comput. Sci. 9, 92–102 (2019)CrossRef Zerari, N., et al.: Bidirectional deep architecture for Arabic speech recognition. Open Comput. Sci. 9, 92–102 (2019)CrossRef
95.
Zurück zum Zitat Wazir, A.S.; Chuah, J.H.: Spoken Arabic digits recognition using deep learning. In: 2019 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), pp. 339–344 IEEE (2019) Wazir, A.S.; Chuah, J.H.: Spoken Arabic digits recognition using deep learning. In: 2019 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), pp. 339–344 IEEE (2019)
96.
Zurück zum Zitat Raidah, S.; Khudeyer, M.A.; Mustafa, R.: Multi-font arabic isolated character recognition using combining machine learning classifiers. J. Southwest Jiaotong Univ. 1, 0258–0272 (2020) Raidah, S.; Khudeyer, M.A.; Mustafa, R.: Multi-font arabic isolated character recognition using combining machine learning classifiers. J. Southwest Jiaotong Univ. 1, 0258–0272 (2020)
Metadaten
Titel
Spoken Utterance Classification Task of Arabic Numerals and Selected Isolated Words
verfasst von
Karim dabbabi
Abdelkarim Mars
Publikationsdatum
31.03.2022
Verlag
Springer Berlin Heidelberg
Erschienen in
Arabian Journal for Science and Engineering / Ausgabe 8/2022
Print ISSN: 2193-567X
Elektronische ISSN: 2191-4281
DOI
https://doi.org/10.1007/s13369-022-06649-0

Weitere Artikel der Ausgabe 8/2022

Arabian Journal for Science and Engineering 8/2022 Zur Ausgabe

Research Article-Computer Engineering and Computer Science

A Feature Selection Method Using Dynamic Dependency and Redundancy Analysis

Research Article-Computer Engineering and Computer Science

Enhanced Border and Hole Detection for Energy Utilization in Wireless Sensor Networks

Research Article-Computer Engineering and Computer Science

A Multi-level Correlation-Based Feature Selection for Intrusion Detection

Research Article-Computer Engineering and Computer Science

A Chaos–Infused Moth–Flame Optimizer

Research Article-Computer Engineering and Computer Science

Multiple Ant Colony Algorithm Combining Community Relationship Network

    Marktübersichten

    Die im Laufe eines Jahres in der „adhäsion“ veröffentlichten Marktübersichten helfen Anwendern verschiedenster Branchen, sich einen gezielten Überblick über Lieferantenangebote zu verschaffen.