Skip to main content
Erschienen in:

24.12.2023

RAttSR: A Novel Low-Cost Reconstructed Attention-Based End-to-End Speech Recognizer

verfasst von: Bachchu Paul, Santanu Phadikar

Erschienen in: Circuits, Systems, and Signal Processing | Ausgabe 4/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

People are curious about voice commands for the next generation of interaction. It will play a dominant role in communicating with smart devices in the future. However, language remains a significant barrier to the widespread use of these devices. Even the existing models for the traditional languages need to compute extensive parameters, resulting in higher computational costs. The most inconvenient in the latest advanced models is that they are unable to function on devices with constrained resources. This paper proposes a novel end-to-end speech recognition based on a low-cost Bidirectional Long Short Term Memory (BiLSTM) attention model. The mel-spectrogram of the speech signals has been generated to feed into the proposed neural attention model to classify isolated words. It consists of three convolution layers followed by two layers of BiLSTM that encode a vector of length 64 to get attention against the input sequence. The convolution layers characterize the relationship among the energy bins in the spectrogram. The BiLSTM network removes the prolonged reliance on the input sequence, and the attention block finds the most significant region in the input sequence, reducing the computational cost in the classification process. The encoded vector by the attention head is fed to three-layered fully connected networks for recognition. The model takes only 133K parameters, less than several current state-of-the-art models for isolated word recognition. Two datasets, the Speech Command Dataset (SCD), and a self-made dataset we developed for fifteen spoken colors in the Bengali dialect, are utilized in this study. Applying the proposed technique, the performance evaluation with validation and test accuracy in the Bengali color dataset reaches 98.82% and 98.95%, respectively, which outperforms the current state-of-the-art models regarding accuracy and model size. When the SCD has been trained using the same network model, the average test accuracy obtained is 96.95%. To underpin the proposed model, the outcome is compared with the recent state-of-the-art models, and the result shows the superiority of the proposed model.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

ATZelektronik

Die Fachzeitschrift ATZelektronik bietet für Entwickler und Entscheider in der Automobil- und Zulieferindustrie qualitativ hochwertige und fundierte Informationen aus dem gesamten Spektrum der Pkw- und Nutzfahrzeug-Elektronik. 

Lassen Sie sich jetzt unverbindlich 2 kostenlose Ausgabe zusenden.

ATZelectronics worldwide

ATZlectronics worldwide is up-to-speed on new trends and developments in automotive electronics on a scientific level with a high depth of information. 

Order your 30-days-trial for free and without any commitment.

Weitere Produktempfehlungen anzeigen
Literatur
1.
Zurück zum Zitat T. Afouras, J.S. Chung, A. Senior, O. Vinyals, A. Zisserman, Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44, 8717–8727 (2018)CrossRef T. Afouras, J.S. Chung, A. Senior, O. Vinyals, A. Zisserman, Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44, 8717–8727 (2018)CrossRef
2.
3.
Zurück zum Zitat D. Andrade, D. Coimbra, S. Leo, M.L.D.S. Viana, C. Bernkopf, A neural attention model for speech command recognition (2018). arXiv:1808.08929 D. Andrade, D. Coimbra, S. Leo, M.L.D.S. Viana, C. Bernkopf, A neural attention model for speech command recognition (2018). arXiv:​1808.​08929
4.
Zurück zum Zitat N.T. Anh, Y. Hu, Q. He, T.T.N. Linh, H.T.K. Dung, C. Guang, Lis-net: an end-to-end light interior search network for speech command recognition. Comput. Speech Lang. 65, 101131 (2021)CrossRef N.T. Anh, Y. Hu, Q. He, T.T.N. Linh, H.T.K. Dung, C. Guang, Lis-net: an end-to-end light interior search network for speech command recognition. Comput. Speech Lang. 65, 101131 (2021)CrossRef
6.
Zurück zum Zitat A. Canavan, D. Graff, G. Zipperlen. CALLHOME American English Speech LDC97S42 (Linguistic Data Consortium, Philadelphia, 1997) A. Canavan, D. Graff, G. Zipperlen. CALLHOME American English Speech LDC97S42 (Linguistic Data Consortium, Philadelphia, 1997)
7.
Zurück zum Zitat X. Chang, T. Maekaku, Y. Fujita, S. Watanabe, End-to-end integration of speech recognition, speech enhancement, and self-supervised learning representation (2022). arXiv:2204.00540 X. Chang, T. Maekaku, Y. Fujita, S. Watanabe, End-to-end integration of speech recognition, speech enhancement, and self-supervised learning representation (2022). arXiv:​2204.​00540
8.
Zurück zum Zitat K. Choi, D. Joo, J. Kim, Kapre: on-gpu audio preprocessing layers for a quick implementation of deep neural network models with keras (2017). arXiv:1706.05781 K. Choi, D. Joo, J. Kim, Kapre: on-gpu audio preprocessing layers for a quick implementation of deep neural network models with keras (2017). arXiv:​1706.​05781
9.
Zurück zum Zitat S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim, S. Ha. Temporal convolution for real-time keyword spotting on mobile devices (2019). arXiv:1904.03814 S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim, S. Ha. Temporal convolution for real-time keyword spotting on mobile devices (2019). arXiv:​1904.​03814
10.
Zurück zum Zitat J. Chorowski, D. Bahdanau, K. Cho, Y. Bengio, End-to-end continuous speech recognition using attention-based recurrent NN: First results (2014). arXiv:1412.1602 J. Chorowski, D. Bahdanau, K. Cho, Y. Bengio, End-to-end continuous speech recognition using attention-based recurrent NN: First results (2014). arXiv:​1412.​1602
11.
Zurück zum Zitat J.K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-based models for speech recognition. Adv. Neural Inf. Process. Syst. 28 (2015) J.K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-based models for speech recognition. Adv. Neural Inf. Process. Syst. 28 (2015)
12.
Zurück zum Zitat A. Dobashi, C.S. Leow, H. Nishizaki, Frequency-directional attention model for multilingual automatic speech recognition (2022). arXiv:2203.15473 A. Dobashi, C.S. Leow, H. Nishizaki, Frequency-directional attention model for multilingual automatic speech recognition (2022). arXiv:​2203.​15473
13.
Zurück zum Zitat C. Fan, J. Yi, J. Tao, Z. Tian, B. Liu, Z. Wen, Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 198–209 (2020)CrossRef C. Fan, J. Yi, J. Tao, Z. Tian, B. Liu, Z. Wen, Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 198–209 (2020)CrossRef
14.
Zurück zum Zitat Y. Fujita, S. Watanabe, M. Omachi, X. Chan, Insertion-based modeling for end-to-end automatic speech recognition (2020). arXiv:2005.13211 Y. Fujita, S. Watanabe, M. Omachi, X. Chan, Insertion-based modeling for end-to-end automatic speech recognition (2020). arXiv:​2005.​13211
15.
Zurück zum Zitat J.S. Garofolo et al., TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1 (Linguistic Data Consortium, Philadelphia, 1993) J.S. Garofolo et al., TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1 (Linguistic Data Consortium, Philadelphia, 1993)
16.
Zurück zum Zitat A.L. Georgescu, A. Pappalardo, H. Cucu, M. Blott, Performance vs. hardware requirements in state-of-the-art automatic speech recognition. EURASIP J. Audio Speech Music Process. 2021(1), 1–30 (2021)CrossRef A.L. Georgescu, A. Pappalardo, H. Cucu, M. Blott, Performance vs. hardware requirements in state-of-the-art automatic speech recognition. EURASIP J. Audio Speech Music Process. 2021(1), 1–30 (2021)CrossRef
17.
Zurück zum Zitat A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (2006, June), pp. 369–376 A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (2006, June), pp. 369–376
19.
Zurück zum Zitat A. Gulati, J. Qin, C.C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, R. Pang. Conformer: Convolution-augmented transformer for speech recognition (2020). arXiv:2005.08100 A. Gulati, J. Qin, C.C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, R. Pang. Conformer: Convolution-augmented transformer for speech recognition (2020). arXiv:​2005.​08100
20.
Zurück zum Zitat K.J. Han, R. Prieto, T. Ma, State-of-the-art speech recognition using multi-stream self-attention with dilated 1d convolutions. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (IEEE, 2019, December), pp. 54–61 K.J. Han, R. Prieto, T. Ma, State-of-the-art speech recognition using multi-stream self-attention with dilated 1d convolutions. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (IEEE, 2019, December), pp. 54–61
21.
Zurück zum Zitat T. Hori, S. Watanabe, Y. Zhang, W. Chan, Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM (2017). arXiv:1706.02737 T. Hori, S. Watanabe, Y. Zhang, W. Chan, Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM (2017). arXiv:​1706.​02737
22.
Zurück zum Zitat W. Hou, Y. Dong, B. Zhuang, L. Yang, J. Shi, T. Shinozaki, Large-scale end-to-end multilingual speech recognition and language identification with multi-task learning. Babel 37(4k), 10k (2020) W. Hou, Y. Dong, B. Zhuang, L. Yang, J. Shi, T. Shinozaki, Large-scale end-to-end multilingual speech recognition and language identification with multi-task learning. Babel 37(4k), 10k (2020)
23.
Zurück zum Zitat A. Hussein, S. Watanabe, A. Ali, Arabic speech recognition by end-to-end, modular systems and human. Comput. Speech Lang. 71, 101272 (2022)CrossRef A. Hussein, S. Watanabe, A. Ali, Arabic speech recognition by end-to-end, modular systems and human. Comput. Speech Lang. 71, 101272 (2022)CrossRef
24.
Zurück zum Zitat S. Kim, T. Hori, S. Watanabe, Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (IEEE, 2017, March), pp. 4835–4839 S. Kim, T. Hori, S. Watanabe, Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (IEEE, 2017, March), pp. 4835–4839
25.
Zurück zum Zitat J. Li, Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal Inf. Process. 11(1), (2022) J. Li, Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal Inf. Process. 11(1), (2022)
26.
Zurück zum Zitat S. Liang, W. Yan, Multilingual speech recognition based on the end-to-end framework. Multimed. Tools Appl. (2022) S. Liang, W. Yan, Multilingual speech recognition based on the end-to-end framework. Multimed. Tools Appl. (2022)
27.
Zurück zum Zitat L. Lu, X. Zhang, K. Cho, S. Renals, A study of the recurrent neural network encoder–decoder for large vocabulary speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association (2015) L. Lu, X. Zhang, K. Cho, S. Renals, A study of the recurrent neural network encoder–decoder for large vocabulary speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association (2015)
28.
Zurück zum Zitat B. Paul, S. Phadikar, S. Bera, Indian regional spoken language identification using deep learning approach. In: Giri, D., Buyya, R., Ponnusamy, S., De, D., Adamatzky, A., Abawajy, J.H. (eds) Proceedings of the Sixth International Conference on Mathematics and Computing. Advances in Intelligent Systems and Computing, vol 1262. Springer, Singapore (2021) https://doi.org/10.1007/978-981-15-8061-1_21 B. Paul, S. Phadikar, S. Bera, Indian regional spoken language identification using deep learning approach. In: Giri, D., Buyya, R., Ponnusamy, S., De, D., Adamatzky, A., Abawajy, J.H. (eds) Proceedings of the Sixth International Conference on Mathematics and Computing. Advances in Intelligent Systems and Computing, vol 1262. Springer, Singapore (2021) https://​doi.​org/​10.​1007/​978-981-15-8061-1_​21
30.
Zurück zum Zitat D. Peter, W. Roth, F. Pernkopf, End-to-end keyword spotting using neural architecture search and quantization. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2022, May), pp. 3423–3427 D. Peter, W. Roth, F. Pernkopf, End-to-end keyword spotting using neural architecture search and quantization. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2022, May), pp. 3423–3427
31.
Zurück zum Zitat S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, M. Pantic, Audio-visual speech recognition with a hybrid ctc/attention architecture. In 2018 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018, December), pp. 513–520 S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, M. Pantic, Audio-visual speech recognition with a hybrid ctc/attention architecture. In 2018 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018, December), pp. 513–520
32.
Zurück zum Zitat V. Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhomanenko, A. Hannun, V. Liptchinsky, G. Synnaeve, R. Collobert. Scaling up online speech recognition using convnets (2020). arXiv:2001.09727 V. Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhomanenko, A. Hannun, V. Liptchinsky, G. Synnaeve, R. Collobert. Scaling up online speech recognition using convnets (2020). arXiv:​2001.​09727
33.
Zurück zum Zitat A. Rista, A. Kadriu, A model for albanian speech recognition using end-to-end deep learning techniques. Interdiscip. J. Res. Dev. 9(3), 1–1 (2022)CrossRef A. Rista, A. Kadriu, A model for albanian speech recognition using end-to-end deep learning techniques. Interdiscip. J. Res. Dev. 9(3), 1–1 (2022)CrossRef
34.
Zurück zum Zitat D.W. Romero, A. Kuzina, E.J. Bekkers, J.M. Tomczak, M. Hoogendoorn, Ckconv: Continuous kernel convolution for sequential data (2021). arXiv:2102.02611 D.W. Romero, A. Kuzina, E.J. Bekkers, J.M. Tomczak, M. Hoogendoorn, Ckconv: Continuous kernel convolution for sequential data (2021). arXiv:​2102.​02611
35.
Zurück zum Zitat R. Vygon, N. Mikhaylovskiy, Learning efficient representations for keyword spotting with triplet loss. In International Conference on Speech and Computer (Springer, Cham, 2021, September), pp. 773–785 R. Vygon, N. Mikhaylovskiy, Learning efficient representations for keyword spotting with triplet loss. In International Conference on Speech and Computer (Springer, Cham, 2021, September), pp. 773–785
36.
Zurück zum Zitat D. Wang, X. Wang, S. Lv, An overview of end-to-end automatic speech recognition. Symmetry 11(8), 1018 (2019)CrossRefADS D. Wang, X. Wang, S. Lv, An overview of end-to-end automatic speech recognition. Symmetry 11(8), 1018 (2019)CrossRefADS
38.
Zurück zum Zitat S. Watanabe, T. Hori, S. Kim, J.R. Hershey, T. Hayashi, Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Sel. Top. Signal Process. 11(8), 1240–1253 (2017)CrossRefADS S. Watanabe, T. Hori, S. Kim, J.R. Hershey, T. Hayashi, Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Sel. Top. Signal Process. 11(8), 1240–1253 (2017)CrossRefADS
39.
Zurück zum Zitat Y. Wei, Z. Gong, S. Yang, K. Ye, Y. Wen, EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting. J. Ambient. Intell. Humaniz. Comput. 13(3), 1525–1535 (2022)CrossRef Y. Wei, Z. Gong, S. Yang, K. Ye, Y. Wen, EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting. J. Ambient. Intell. Humaniz. Comput. 13(3), 1525–1535 (2022)CrossRef
40.
Zurück zum Zitat C. Yi, S. Zhou, B. Xu, Efficiently fusing pretrained acoustic and linguistic encoders for low-resource speech recognition. IEEE Signal Process. Lett. 28, 788–792 (2021)CrossRefADS C. Yi, S. Zhou, B. Xu, Efficiently fusing pretrained acoustic and linguistic encoders for low-resource speech recognition. IEEE Signal Process. Lett. 28, 788–792 (2021)CrossRefADS
41.
Zurück zum Zitat N. Zacarias-Morales, P. Pancardo, J.A. Hernández-Nolasco, M. Garcia-Constantino, Attention-inspired artificial neural networks for speech processing: a systematic review. Symmetry 13(2), 214 (2021)CrossRefADS N. Zacarias-Morales, P. Pancardo, J.A. Hernández-Nolasco, M. Garcia-Constantino, Attention-inspired artificial neural networks for speech processing: a systematic review. Symmetry 13(2), 214 (2021)CrossRefADS
42.
Zurück zum Zitat A. Zeyer, K. Irie, R. Schlüter, H. Ney, Improved training of end-to-end attention models for speech recognition (2018). arXiv:1805.03294 A. Zeyer, K. Irie, R. Schlüter, H. Ney, Improved training of end-to-end attention models for speech recognition (2018). arXiv:​1805.​03294
43.
Zurück zum Zitat S. Zhang, E. Loweimi, P. Bell, S. Renals, On the usefulness of self-attention for automatic speech recognition with transformers. In 2021 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2021, January), pp. 89–96 S. Zhang, E. Loweimi, P. Bell, S. Renals, On the usefulness of self-attention for automatic speech recognition with transformers. In 2021 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2021, January), pp. 89–96
44.
Zurück zum Zitat Y. Zhang, D.S. Park, W. Han, J. Qin, A. Gulati, J. Shor, Y. Wu et al., Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE J. Sel. Top. Signal Process. (2022) Y. Zhang, D.S. Park, W. Han, J. Qin, A. Gulati, J. Shor, Y. Wu et al., Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE J. Sel. Top. Signal Process. (2022)
Metadaten
Titel
RAttSR: A Novel Low-Cost Reconstructed Attention-Based End-to-End Speech Recognizer
verfasst von
Bachchu Paul
Santanu Phadikar
Publikationsdatum
24.12.2023
Verlag
Springer US
Erschienen in
Circuits, Systems, and Signal Processing / Ausgabe 4/2024
Print ISSN: 0278-081X
Elektronische ISSN: 1531-5878
DOI
https://doi.org/10.1007/s00034-023-02570-5