Top

Multimedia Systems

Published in:

01-02-2024 | Regular Paper

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Authors: Yu Li, Feng Xue, Lin Wu, Yincen Xie, Shujie Li

Published in: Multimedia Systems | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Lipreading refers to translating the lip motion regarding a video speaker into the corresponding texts. Existing lipreading methods typically describe the lip motion using visual appearance variations. However, merely using the lip visual variations is prone to associating with inaccurate texts due to the similar lip shapes for different words. Also, visual features are hard to generalize to unseen speakers, especially when the training data is limited. In this paper, we leverage both lip visual motion and facial landmarks and propose an effective sentence-level end-to-end approach for lipreading. The facial landmarks are introduced to eliminate the irrelevant visual features which are sensitive to specific lip appearance of individual speakers. This enables the model to adapt to different lip shapes of speakers and generalize to unseen speakers. In specific, the proposed framework consists of two branches corresponding to the visual features and facial landmarks. The visual branch extracts high-level visual features from the lip movement, and the landmark branch learns to extract both spatial and temporal patterns described by the landmarks. The feature embeddings from two streams for each frame are fused to form its latent vector which can be decoded into texts. We employ a sequence-to-sequence model to operate the feature embeddings of all frames as input, and decode them to generate the texts. The proposed method is demonstrated to well generalize to unseen speakers on benchmark data sets.

previous article Weighted bilinear factorization of low-rank matrix with structural smoothness for image denoising

next article STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., Song, M.: Hearing lips: improving lip reading by distilling speech recognizers. In: AAAI Conference on Artificial Intelligence, pp. 6917–6924 (2020)

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 251–263. Springer (2017)

Kim, J.O., Lee, W., Hwang, J., Baik, K.S., Chung, C.H.: Lip print recognition for security systems by multi-resolution architecture. Future Gener. Comput. Syst. 20(2), 295–301 (2004)CrossRef

Chen, X., Du, J., Zhang, H.: Lipreading with DenseNet and RESBI-LSTM. Signal Image Video Process. 14, 981–989 (2020)CrossRef

Lee, D., Lee, J., Kim, K.-E.: Multi-view automatic lip-reading using neural network. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 290–302. Springer (2017)

Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE (2020)

Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)

Zhang, T., He, L., Li, X., Feng, G.: Efficient end-to-end sentence-level lipreading with temporal convolutional networks. Appl. Sci. 11(15), 6975 (2021)CrossRef

Xu, K., Li, D., Cassimatis, N., Wang, X.: LCANet: End-to-end lipreading with cascaded attention-CTC. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 548–555. IEEE (2018)

10.

Margam, D.K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S., Venkatesan, S.M., et al.: Lipreading with 3D-2D-CNN BLSTM-HMM and word-CTC models. arXiv preprint arXiv:1906.12170 (2019)

11.

Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: LIPNet: end-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)

12.

Zhao, Y., Xu, R., Song, M.: A cascade sequence-to-sequence model for Chinese mandarin lip reading. In: Proceedings of the ACM Multimedia Asia, pp. 1–6 (2019)

13.

Haghpanah, M.A., Saeedizade, E., Masouleh, M.T., Kalhor, A.: Real-time facial expression recognition using facial landmarks and neural networks. In: 2022 International Conference on Machine Vision and Image Processing (MVIP), pp. 1–7. IEEE (2022)

14.

Lo, L., Xie, H.-X., Shuai, H.-H., Cheng, W.-H.: MER-GCN: micro-expression recognition based on relation modeling with graph convolutional networks. In: 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 79–84. IEEE (2020)

15.

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

16.

Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017)

17.

Hao, M., Mamut, M., Yadikar, N., Aysa, A., Ubul, K.: A survey of research on lipreading technology. IEEE Access 8, 204518–204544 (2020)CrossRef

18.

Xiao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370. IEEE (2020)

19.

Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 176, 22–32 (2018)CrossRef

20.

Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 420–427. IEEE (2020)

21.

Wand, M., Schmidhuber, J.: Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv:1708.01565 (2017)

22.

Zhang, X., Gong, H., Dai, X., Yang, F., Liu, N., Liu, M.: Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9211–9218 (2019)

23.

Xue, F., Yang, T., Liu, K., Hong, Z., Cao, M., Guo, D., Hong, R.: LCSNet: end-to-end lipreading with channel-aware feature selection. ACM Trans. Multimedia. Comput. Commun. Appl. 9, 1–21 (2023)

24.

Kim, M., Yeo, J.H., Choi, J., Ro, Y.M.: Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15359–15371 (2023)

25.

Sun, B., Xie, D., Shi, H.: MALip: modal amplification lipreading based on reconstructed audio features. Signal Process. Image Commun. 117, 117002 (2023)CrossRef

26.

Santos, T.I., Abel, A., Wilson, N., Xu, Y.: Speaker-independent visual speech recognition with the inception v3 model. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 613–620. IEEE (2021)

27.

Nemani, P., Krishna, G.S., Ramisetty, N., Sai, B.D.S., Kumar, S.: Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition. IEEE Trans. Artif. Intell. 4(6), 1705–1713 (2023)CrossRef

28.

Huang, Y., Liang, X., Fang, C.: CALLip: Lipreading using contrastive and attribute learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2492–2500 (2021)

29.

Kim, M., Kim, H., Ro, Y.M.: Speaker-adaptive lip reading with user-dependent padding. In: European Conference on Computer Vision, pp. 576–593. Springer (2022)

30.

Kim, M., Kim, H.-I., Ro, Y.M.: Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition. arXiv preprint arXiv:2302.08102 (2023)

31.

Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 214–223 (2020)

32.

Tang, S., Guo, D., Hong, R., Wang, M.: Graph-based multimodal sequential embedding for sign language translation. IEEE Transac. Multimed. 24, 4433–4445 (2021)CrossRef

33.

Papadimitriou, K., Potamianos, G.: Sign language recognition via deformable 3D convolutions and modulated graph convolutional networks. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

34.

Ban, Y., Eckhoff, J.A., Ward, T.M., Hashimoto, D.A., Meireles, O.R., Rus, D., Rosman, G.: Concept graph neural networks for surgical video understanding. IEEE Trans. Med. Imaging. 43(1), 264–274 (2024)CrossRefPubMed

35.

Amodio, A., Ermidoro, M., Maggi, D., Formentin, S., Savaresi, S.M.: Automatic detection of driver impairment based on pupillary light reflex. IEEE Trans. Intell. Transp. Syst. 20(8), 3038–3048 (2018)CrossRef

36.

Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)CrossRef

37.

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

38.

Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

39.

Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)ADSCrossRefPubMed

40.

Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710. Soviet Union (1966)

41.

Son Chung, J., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6447–6456 (2017)

Title: Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach
Authors: Yu Li
Feng Xue
Lin Wu
Yincen Xie
Shujie Li
Publication date: 01-02-2024
Publisher: Springer Berlin Heidelberg
Published in: Multimedia Systems / Issue 1/2024
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-023-01226-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 1/2024

Rendering acceleration based on JND-guided sampling prediction

Multi-label neural architecture search for chest radiography image classification

Event log anomaly detection method based on auto-encoder and control flow

BENet: bi-directional enhanced network for image captioning

Balanced sentimental information via multimodal interaction model

A defensive attention mechanism to detect deepfake content across multiple modalities