Skip to main content
Top

2020 | OriginalPaper | Chapter

Set-Top Box Automated Lip-Reading Controller Based on Convolutional Neural Network

Authors : Yuanyao Lu, Haiyang Jiang

Published in: Advances in Human Factors and Systems Interaction

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Automated lip reading (ALR) is a new mode of human-computer interaction and can directly recognize the speech content from the lip motion image sequences of the speaker. Using ALR based on convolutional neural network to realize the control of the set-top box to power on/off, change the channel and adjust the volume is a challenging and useful work. The paper uses a coupled three-dimensional convolutional neural network (3D-CNN) architecture that calculates the similarity between the captured lip instruction characteristic image and the standard image after training. We identify the control commands, change the channel and adjust the volume to achieve new control mode for set-top boxes. The experimental accuracy of lip-reading recognition is obviously improved compared to other similar methods. We are making a meaningful exploration for the practical use of ALR and our results and application experience can be easily extended to other machines and smart home systems.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Yadav, I.C., Shahnawazuddin, S., Pradhan, G.: Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing. Digit. Sig. Process. 86, 55–64 (2019)CrossRef Yadav, I.C., Shahnawazuddin, S., Pradhan, G.: Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing. Digit. Sig. Process. 86, 55–64 (2019)CrossRef
2.
go back to reference Huang, J., Kingsbury, B.: Audio-visual deep learning for noise robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7596–7599 (2013) Huang, J., Kingsbury, B.: Audio-visual deep learning for noise robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7596–7599 (2013)
3.
go back to reference Shahnawazuddin, S., Maity, K., Pradhan, G.: Improving the performance of keyword spotting system for children’s speech through prosody modification. Digit. Sig. Proc. 86, 11–18 (2019)CrossRef Shahnawazuddin, S., Maity, K., Pradhan, G.: Improving the performance of keyword spotting system for children’s speech through prosody modification. Digit. Sig. Proc. 86, 11–18 (2019)CrossRef
4.
go back to reference Yi, J.Y., Tao, J.H., Wen, Z.Q., Bai, Y.: Language-adversarial transfer learning for low-resource speech recognition. IEEE-ACM Trans. Audio Speech Lang. Process. 27, 621–630 (2019)CrossRef Yi, J.Y., Tao, J.H., Wen, Z.Q., Bai, Y.: Language-adversarial transfer learning for low-resource speech recognition. IEEE-ACM Trans. Audio Speech Lang. Process. 27, 621–630 (2019)CrossRef
5.
go back to reference Alsharhan, E., Ramsay, A.: Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions. Inf. Process. Manag. 56, 343–363 (2019)CrossRef Alsharhan, E., Ramsay, A.: Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions. Inf. Process. Manag. 56, 343–363 (2019)CrossRef
6.
go back to reference Zeiler, S., Nicheli, R., Ma, N., Brown, G.J., Kolossa, D.: Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2797–2801 (2016) Zeiler, S., Nicheli, R., Ma, N., Brown, G.J., Kolossa, D.: Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2797–2801 (2016)
7.
go back to reference Wang, J., Zhang, J., Honda, K., Wei, J., Dang, J.: Audio-visual speech recognition integrating 3D lip information obtained from the Kinect. Multimed. Syst. 22(3), 315–323 (2016)CrossRef Wang, J., Zhang, J., Honda, K., Wei, J., Dang, J.: Audio-visual speech recognition integrating 3D lip information obtained from the Kinect. Multimed. Syst. 22(3), 315–323 (2016)CrossRef
8.
go back to reference Petridis, S., Pantic, M.: Deep complementary bottleneck features for visual speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2304–2308 (2016) Petridis, S., Pantic, M.: Deep complementary bottleneck features for visual speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2304–2308 (2016)
9.
go back to reference Lee, K.D., Lee, M.J., Lee, S. Y.: Extraction of frame-difference features based on PCA and ICA for lip-reading. In: IEEE International Joint Conference on Neural Networks (IJCNN 2005), Montreal, Canada, pp. 232–237 (2005) Lee, K.D., Lee, M.J., Lee, S. Y.: Extraction of frame-difference features based on PCA and ICA for lip-reading. In: IEEE International Joint Conference on Neural Networks (IJCNN 2005), Montreal, Canada, pp. 232–237 (2005)
10.
go back to reference Nath, R., Rahman, F.S., Nath, S.: Lip contour extraction scheme using morphological reconstruction based segmentation. In: 1st International Conference on Electrical Engineering and Information and Communication Technology (ICEEICT), Dhaka, Bangladesh (2014) Nath, R., Rahman, F.S., Nath, S.: Lip contour extraction scheme using morphological reconstruction based segmentation. In: 1st International Conference on Electrical Engineering and Information and Communication Technology (ICEEICT), Dhaka, Bangladesh (2014)
11.
go back to reference Li, Y., Hang, Y., Wang, Y. K., Chang, Y.C.: A lip localization method based on HSV transformation in smart phone environment. In: 12th IEEE International Conference on Signal Processing (ICSP), HangZhou, Peoples Republic of China, pp. 1285–1290 (2014) Li, Y., Hang, Y., Wang, Y. K., Chang, Y.C.: A lip localization method based on HSV transformation in smart phone environment. In: 12th IEEE International Conference on Signal Processing (ICSP), HangZhou, Peoples Republic of China, pp. 1285–1290 (2014)
12.
go back to reference Gritzman, A.D., Rubin, D.M., Pantanowitz, A.: Comparison of colour transforms used in lip segmentation algorithms. Sig. Image Video Process. 9, 947–957 (2015)CrossRef Gritzman, A.D., Rubin, D.M., Pantanowitz, A.: Comparison of colour transforms used in lip segmentation algorithms. Sig. Image Video Process. 9, 947–957 (2015)CrossRef
13.
go back to reference Maulana, M.R.A.R., Fanany, M.I.: Sentence-level Indonesian lip reading with spatiotemporal CNN and gated RNN. In: International Conference on Advanced Computer Science and Information Systems. 9th International Conference on Advanced Computer Science and Information Systems (ICACSIS), pp. 375–380. Inst Pertanian, Dept Ilmu Komputer, Jakarta, Indonesia (2017) Maulana, M.R.A.R., Fanany, M.I.: Sentence-level Indonesian lip reading with spatiotemporal CNN and gated RNN. In: International Conference on Advanced Computer Science and Information Systems. 9th International Conference on Advanced Computer Science and Information Systems (ICACSIS), pp. 375–380. Inst Pertanian, Dept Ilmu Komputer, Jakarta, Indonesia (2017)
14.
go back to reference Rahmani, M.H., Almasganj, F.: Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: 3rd International Conference on Pattern Analysis and Image Analysis (IPRIA), pp. 195–199. Faculty of Technology and Engineering, Shahrekord University, Shahrekord, Iran (2017) Rahmani, M.H., Almasganj, F.: Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: 3rd International Conference on Pattern Analysis and Image Analysis (IPRIA), pp. 195–199. Faculty of Technology and Engineering, Shahrekord University, Shahrekord, Iran (2017)
15.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. IEEE Press, Santiago, Chile (2015) He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. IEEE Press, Santiago, Chile (2015)
Metadata
Title
Set-Top Box Automated Lip-Reading Controller Based on Convolutional Neural Network
Authors
Yuanyao Lu
Haiyang Jiang
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-20040-4_38

Premium Partner