Skip to main content
Top
Published in: Multimedia Systems 1/2024

01-02-2024 | Regular Paper

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Authors: Yu Li, Feng Xue, Lin Wu, Yincen Xie, Shujie Li

Published in: Multimedia Systems | Issue 1/2024

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Lipreading refers to translating the lip motion regarding a video speaker into the corresponding texts. Existing lipreading methods typically describe the lip motion using visual appearance variations. However, merely using the lip visual variations is prone to associating with inaccurate texts due to the similar lip shapes for different words. Also, visual features are hard to generalize to unseen speakers, especially when the training data is limited. In this paper, we leverage both lip visual motion and facial landmarks and propose an effective sentence-level end-to-end approach for lipreading. The facial landmarks are introduced to eliminate the irrelevant visual features which are sensitive to specific lip appearance of individual speakers. This enables the model to adapt to different lip shapes of speakers and generalize to unseen speakers. In specific, the proposed framework consists of two branches corresponding to the visual features and facial landmarks. The visual branch extracts high-level visual features from the lip movement, and the landmark branch learns to extract both spatial and temporal patterns described by the landmarks. The feature embeddings from two streams for each frame are fused to form its latent vector which can be decoded into texts. We employ a sequence-to-sequence model to operate the feature embeddings of all frames as input, and decode them to generate the texts. The proposed method is demonstrated to well generalize to unseen speakers on benchmark data sets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., Song, M.: Hearing lips: improving lip reading by distilling speech recognizers. In: AAAI Conference on Artificial Intelligence, pp. 6917–6924 (2020) Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., Song, M.: Hearing lips: improving lip reading by distilling speech recognizers. In: AAAI Conference on Artificial Intelligence, pp. 6917–6924 (2020)
2.
go back to reference Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 251–263. Springer (2017) Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 251–263. Springer (2017)
3.
go back to reference Kim, J.O., Lee, W., Hwang, J., Baik, K.S., Chung, C.H.: Lip print recognition for security systems by multi-resolution architecture. Future Gener. Comput. Syst. 20(2), 295–301 (2004)CrossRef Kim, J.O., Lee, W., Hwang, J., Baik, K.S., Chung, C.H.: Lip print recognition for security systems by multi-resolution architecture. Future Gener. Comput. Syst. 20(2), 295–301 (2004)CrossRef
4.
go back to reference Chen, X., Du, J., Zhang, H.: Lipreading with DenseNet and RESBI-LSTM. Signal Image Video Process. 14, 981–989 (2020)CrossRef Chen, X., Du, J., Zhang, H.: Lipreading with DenseNet and RESBI-LSTM. Signal Image Video Process. 14, 981–989 (2020)CrossRef
5.
go back to reference Lee, D., Lee, J., Kim, K.-E.: Multi-view automatic lip-reading using neural network. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 290–302. Springer (2017) Lee, D., Lee, J., Kim, K.-E.: Multi-view automatic lip-reading using neural network. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 290–302. Springer (2017)
6.
go back to reference Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE (2020) Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE (2020)
7.
go back to reference Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018) Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)
8.
go back to reference Zhang, T., He, L., Li, X., Feng, G.: Efficient end-to-end sentence-level lipreading with temporal convolutional networks. Appl. Sci. 11(15), 6975 (2021)CrossRef Zhang, T., He, L., Li, X., Feng, G.: Efficient end-to-end sentence-level lipreading with temporal convolutional networks. Appl. Sci. 11(15), 6975 (2021)CrossRef
9.
go back to reference Xu, K., Li, D., Cassimatis, N., Wang, X.: LCANet: End-to-end lipreading with cascaded attention-CTC. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 548–555. IEEE (2018) Xu, K., Li, D., Cassimatis, N., Wang, X.: LCANet: End-to-end lipreading with cascaded attention-CTC. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 548–555. IEEE (2018)
10.
go back to reference Margam, D.K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S., Venkatesan, S.M., et al.: Lipreading with 3D-2D-CNN BLSTM-HMM and word-CTC models. arXiv preprint arXiv:1906.12170 (2019) Margam, D.K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S., Venkatesan, S.M., et al.: Lipreading with 3D-2D-CNN BLSTM-HMM and word-CTC models. arXiv preprint arXiv:​1906.​12170 (2019)
11.
go back to reference Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: LIPNet: end-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016) Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: LIPNet: end-to-end sentence-level lipreading. arXiv preprint arXiv:​1611.​01599 (2016)
12.
go back to reference Zhao, Y., Xu, R., Song, M.: A cascade sequence-to-sequence model for Chinese mandarin lip reading. In: Proceedings of the ACM Multimedia Asia, pp. 1–6 (2019) Zhao, Y., Xu, R., Song, M.: A cascade sequence-to-sequence model for Chinese mandarin lip reading. In: Proceedings of the ACM Multimedia Asia, pp. 1–6 (2019)
13.
go back to reference Haghpanah, M.A., Saeedizade, E., Masouleh, M.T., Kalhor, A.: Real-time facial expression recognition using facial landmarks and neural networks. In: 2022 International Conference on Machine Vision and Image Processing (MVIP), pp. 1–7. IEEE (2022) Haghpanah, M.A., Saeedizade, E., Masouleh, M.T., Kalhor, A.: Real-time facial expression recognition using facial landmarks and neural networks. In: 2022 International Conference on Machine Vision and Image Processing (MVIP), pp. 1–7. IEEE (2022)
14.
go back to reference Lo, L., Xie, H.-X., Shuai, H.-H., Cheng, W.-H.: MER-GCN: micro-expression recognition based on relation modeling with graph convolutional networks. In: 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 79–84. IEEE (2020) Lo, L., Xie, H.-X., Shuai, H.-H., Cheng, W.-H.: MER-GCN: micro-expression recognition based on relation modeling with graph convolutional networks. In: 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 79–84. IEEE (2020)
15.
go back to reference Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
16.
17.
go back to reference Hao, M., Mamut, M., Yadikar, N., Aysa, A., Ubul, K.: A survey of research on lipreading technology. IEEE Access 8, 204518–204544 (2020)CrossRef Hao, M., Mamut, M., Yadikar, N., Aysa, A., Ubul, K.: A survey of research on lipreading technology. IEEE Access 8, 204518–204544 (2020)CrossRef
18.
go back to reference Xiao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370. IEEE (2020) Xiao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370. IEEE (2020)
19.
go back to reference Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 176, 22–32 (2018)CrossRef Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 176, 22–32 (2018)CrossRef
20.
go back to reference Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 420–427. IEEE (2020) Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 420–427. IEEE (2020)
21.
go back to reference Wand, M., Schmidhuber, J.: Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv:1708.01565 (2017) Wand, M., Schmidhuber, J.: Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv:​1708.​01565 (2017)
22.
go back to reference Zhang, X., Gong, H., Dai, X., Yang, F., Liu, N., Liu, M.: Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9211–9218 (2019) Zhang, X., Gong, H., Dai, X., Yang, F., Liu, N., Liu, M.: Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9211–9218 (2019)
23.
go back to reference Xue, F., Yang, T., Liu, K., Hong, Z., Cao, M., Guo, D., Hong, R.: LCSNet: end-to-end lipreading with channel-aware feature selection. ACM Trans. Multimedia. Comput. Commun. Appl. 9, 1–21 (2023) Xue, F., Yang, T., Liu, K., Hong, Z., Cao, M., Guo, D., Hong, R.: LCSNet: end-to-end lipreading with channel-aware feature selection. ACM Trans. Multimedia. Comput. Commun. Appl. 9, 1–21 (2023)
24.
go back to reference Kim, M., Yeo, J.H., Choi, J., Ro, Y.M.: Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15359–15371 (2023) Kim, M., Yeo, J.H., Choi, J., Ro, Y.M.: Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15359–15371 (2023)
25.
go back to reference Sun, B., Xie, D., Shi, H.: MALip: modal amplification lipreading based on reconstructed audio features. Signal Process. Image Commun. 117, 117002 (2023)CrossRef Sun, B., Xie, D., Shi, H.: MALip: modal amplification lipreading based on reconstructed audio features. Signal Process. Image Commun. 117, 117002 (2023)CrossRef
26.
go back to reference Santos, T.I., Abel, A., Wilson, N., Xu, Y.: Speaker-independent visual speech recognition with the inception v3 model. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 613–620. IEEE (2021) Santos, T.I., Abel, A., Wilson, N., Xu, Y.: Speaker-independent visual speech recognition with the inception v3 model. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 613–620. IEEE (2021)
27.
go back to reference Nemani, P., Krishna, G.S., Ramisetty, N., Sai, B.D.S., Kumar, S.: Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition. IEEE Trans. Artif. Intell. 4(6), 1705–1713 (2023)CrossRef Nemani, P., Krishna, G.S., Ramisetty, N., Sai, B.D.S., Kumar, S.: Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition. IEEE Trans. Artif. Intell. 4(6), 1705–1713 (2023)CrossRef
28.
go back to reference Huang, Y., Liang, X., Fang, C.: CALLip: Lipreading using contrastive and attribute learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2492–2500 (2021) Huang, Y., Liang, X., Fang, C.: CALLip: Lipreading using contrastive and attribute learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2492–2500 (2021)
29.
go back to reference Kim, M., Kim, H., Ro, Y.M.: Speaker-adaptive lip reading with user-dependent padding. In: European Conference on Computer Vision, pp. 576–593. Springer (2022) Kim, M., Kim, H., Ro, Y.M.: Speaker-adaptive lip reading with user-dependent padding. In: European Conference on Computer Vision, pp. 576–593. Springer (2022)
30.
go back to reference Kim, M., Kim, H.-I., Ro, Y.M.: Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition. arXiv preprint arXiv:2302.08102 (2023) Kim, M., Kim, H.-I., Ro, Y.M.: Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition. arXiv preprint arXiv:​2302.​08102 (2023)
31.
go back to reference Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 214–223 (2020) Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 214–223 (2020)
32.
go back to reference Tang, S., Guo, D., Hong, R., Wang, M.: Graph-based multimodal sequential embedding for sign language translation. IEEE Transac. Multimed. 24, 4433–4445 (2021)CrossRef Tang, S., Guo, D., Hong, R., Wang, M.: Graph-based multimodal sequential embedding for sign language translation. IEEE Transac. Multimed. 24, 4433–4445 (2021)CrossRef
33.
go back to reference Papadimitriou, K., Potamianos, G.: Sign language recognition via deformable 3D convolutions and modulated graph convolutional networks. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023) Papadimitriou, K., Potamianos, G.: Sign language recognition via deformable 3D convolutions and modulated graph convolutional networks. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
34.
go back to reference Ban, Y., Eckhoff, J.A., Ward, T.M., Hashimoto, D.A., Meireles, O.R., Rus, D., Rosman, G.: Concept graph neural networks for surgical video understanding. IEEE Trans. Med. Imaging. 43(1), 264–274 (2024)CrossRefPubMed Ban, Y., Eckhoff, J.A., Ward, T.M., Hashimoto, D.A., Meireles, O.R., Rus, D., Rosman, G.: Concept graph neural networks for surgical video understanding. IEEE Trans. Med. Imaging. 43(1), 264–274 (2024)CrossRefPubMed
35.
go back to reference Amodio, A., Ermidoro, M., Maggi, D., Formentin, S., Savaresi, S.M.: Automatic detection of driver impairment based on pupillary light reflex. IEEE Trans. Intell. Transp. Syst. 20(8), 3038–3048 (2018)CrossRef Amodio, A., Ermidoro, M., Maggi, D., Formentin, S., Savaresi, S.M.: Automatic detection of driver impairment based on pupillary light reflex. IEEE Trans. Intell. Transp. Syst. 20(8), 3038–3048 (2018)CrossRef
36.
go back to reference Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)CrossRef Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)CrossRef
37.
go back to reference Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:​1406.​1078 (2014)
38.
go back to reference Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014) Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
39.
go back to reference Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)ADSCrossRefPubMed Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)ADSCrossRefPubMed
40.
go back to reference Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710. Soviet Union (1966) Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710. Soviet Union (1966)
41.
go back to reference Son Chung, J., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6447–6456 (2017) Son Chung, J., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6447–6456 (2017)
Metadata
Title
Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach
Authors
Yu Li
Feng Xue
Lin Wu
Yincen Xie
Shujie Li
Publication date
01-02-2024
Publisher
Springer Berlin Heidelberg
Published in
Multimedia Systems / Issue 1/2024
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-023-01226-3

Other articles of this Issue 1/2024

Multimedia Systems 1/2024 Go to the issue