Skip to main content
Erschienen in: Multimedia Systems 1/2024

01.02.2024 | Regular Paper

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

verfasst von: Yu Li, Feng Xue, Lin Wu, Yincen Xie, Shujie Li

Erschienen in: Multimedia Systems | Ausgabe 1/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Lipreading refers to translating the lip motion regarding a video speaker into the corresponding texts. Existing lipreading methods typically describe the lip motion using visual appearance variations. However, merely using the lip visual variations is prone to associating with inaccurate texts due to the similar lip shapes for different words. Also, visual features are hard to generalize to unseen speakers, especially when the training data is limited. In this paper, we leverage both lip visual motion and facial landmarks and propose an effective sentence-level end-to-end approach for lipreading. The facial landmarks are introduced to eliminate the irrelevant visual features which are sensitive to specific lip appearance of individual speakers. This enables the model to adapt to different lip shapes of speakers and generalize to unseen speakers. In specific, the proposed framework consists of two branches corresponding to the visual features and facial landmarks. The visual branch extracts high-level visual features from the lip movement, and the landmark branch learns to extract both spatial and temporal patterns described by the landmarks. The feature embeddings from two streams for each frame are fused to form its latent vector which can be decoded into texts. We employ a sequence-to-sequence model to operate the feature embeddings of all frames as input, and decode them to generate the texts. The proposed method is demonstrated to well generalize to unseen speakers on benchmark data sets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., Song, M.: Hearing lips: improving lip reading by distilling speech recognizers. In: AAAI Conference on Artificial Intelligence, pp. 6917–6924 (2020) Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., Song, M.: Hearing lips: improving lip reading by distilling speech recognizers. In: AAAI Conference on Artificial Intelligence, pp. 6917–6924 (2020)
2.
Zurück zum Zitat Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 251–263. Springer (2017) Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 251–263. Springer (2017)
3.
Zurück zum Zitat Kim, J.O., Lee, W., Hwang, J., Baik, K.S., Chung, C.H.: Lip print recognition for security systems by multi-resolution architecture. Future Gener. Comput. Syst. 20(2), 295–301 (2004)CrossRef Kim, J.O., Lee, W., Hwang, J., Baik, K.S., Chung, C.H.: Lip print recognition for security systems by multi-resolution architecture. Future Gener. Comput. Syst. 20(2), 295–301 (2004)CrossRef
4.
Zurück zum Zitat Chen, X., Du, J., Zhang, H.: Lipreading with DenseNet and RESBI-LSTM. Signal Image Video Process. 14, 981–989 (2020)CrossRef Chen, X., Du, J., Zhang, H.: Lipreading with DenseNet and RESBI-LSTM. Signal Image Video Process. 14, 981–989 (2020)CrossRef
5.
Zurück zum Zitat Lee, D., Lee, J., Kim, K.-E.: Multi-view automatic lip-reading using neural network. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 290–302. Springer (2017) Lee, D., Lee, J., Kim, K.-E.: Multi-view automatic lip-reading using neural network. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 290–302. Springer (2017)
6.
Zurück zum Zitat Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE (2020) Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE (2020)
7.
Zurück zum Zitat Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018) Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)
8.
Zurück zum Zitat Zhang, T., He, L., Li, X., Feng, G.: Efficient end-to-end sentence-level lipreading with temporal convolutional networks. Appl. Sci. 11(15), 6975 (2021)CrossRef Zhang, T., He, L., Li, X., Feng, G.: Efficient end-to-end sentence-level lipreading with temporal convolutional networks. Appl. Sci. 11(15), 6975 (2021)CrossRef
9.
Zurück zum Zitat Xu, K., Li, D., Cassimatis, N., Wang, X.: LCANet: End-to-end lipreading with cascaded attention-CTC. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 548–555. IEEE (2018) Xu, K., Li, D., Cassimatis, N., Wang, X.: LCANet: End-to-end lipreading with cascaded attention-CTC. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 548–555. IEEE (2018)
10.
Zurück zum Zitat Margam, D.K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S., Venkatesan, S.M., et al.: Lipreading with 3D-2D-CNN BLSTM-HMM and word-CTC models. arXiv preprint arXiv:1906.12170 (2019) Margam, D.K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S., Venkatesan, S.M., et al.: Lipreading with 3D-2D-CNN BLSTM-HMM and word-CTC models. arXiv preprint arXiv:​1906.​12170 (2019)
11.
Zurück zum Zitat Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: LIPNet: end-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016) Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: LIPNet: end-to-end sentence-level lipreading. arXiv preprint arXiv:​1611.​01599 (2016)
12.
Zurück zum Zitat Zhao, Y., Xu, R., Song, M.: A cascade sequence-to-sequence model for Chinese mandarin lip reading. In: Proceedings of the ACM Multimedia Asia, pp. 1–6 (2019) Zhao, Y., Xu, R., Song, M.: A cascade sequence-to-sequence model for Chinese mandarin lip reading. In: Proceedings of the ACM Multimedia Asia, pp. 1–6 (2019)
13.
Zurück zum Zitat Haghpanah, M.A., Saeedizade, E., Masouleh, M.T., Kalhor, A.: Real-time facial expression recognition using facial landmarks and neural networks. In: 2022 International Conference on Machine Vision and Image Processing (MVIP), pp. 1–7. IEEE (2022) Haghpanah, M.A., Saeedizade, E., Masouleh, M.T., Kalhor, A.: Real-time facial expression recognition using facial landmarks and neural networks. In: 2022 International Conference on Machine Vision and Image Processing (MVIP), pp. 1–7. IEEE (2022)
14.
Zurück zum Zitat Lo, L., Xie, H.-X., Shuai, H.-H., Cheng, W.-H.: MER-GCN: micro-expression recognition based on relation modeling with graph convolutional networks. In: 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 79–84. IEEE (2020) Lo, L., Xie, H.-X., Shuai, H.-H., Cheng, W.-H.: MER-GCN: micro-expression recognition based on relation modeling with graph convolutional networks. In: 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 79–84. IEEE (2020)
15.
Zurück zum Zitat Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
16.
17.
Zurück zum Zitat Hao, M., Mamut, M., Yadikar, N., Aysa, A., Ubul, K.: A survey of research on lipreading technology. IEEE Access 8, 204518–204544 (2020)CrossRef Hao, M., Mamut, M., Yadikar, N., Aysa, A., Ubul, K.: A survey of research on lipreading technology. IEEE Access 8, 204518–204544 (2020)CrossRef
18.
Zurück zum Zitat Xiao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370. IEEE (2020) Xiao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370. IEEE (2020)
19.
Zurück zum Zitat Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 176, 22–32 (2018)CrossRef Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 176, 22–32 (2018)CrossRef
20.
Zurück zum Zitat Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 420–427. IEEE (2020) Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 420–427. IEEE (2020)
21.
Zurück zum Zitat Wand, M., Schmidhuber, J.: Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv:1708.01565 (2017) Wand, M., Schmidhuber, J.: Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv:​1708.​01565 (2017)
22.
Zurück zum Zitat Zhang, X., Gong, H., Dai, X., Yang, F., Liu, N., Liu, M.: Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9211–9218 (2019) Zhang, X., Gong, H., Dai, X., Yang, F., Liu, N., Liu, M.: Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9211–9218 (2019)
23.
Zurück zum Zitat Xue, F., Yang, T., Liu, K., Hong, Z., Cao, M., Guo, D., Hong, R.: LCSNet: end-to-end lipreading with channel-aware feature selection. ACM Trans. Multimedia. Comput. Commun. Appl. 9, 1–21 (2023) Xue, F., Yang, T., Liu, K., Hong, Z., Cao, M., Guo, D., Hong, R.: LCSNet: end-to-end lipreading with channel-aware feature selection. ACM Trans. Multimedia. Comput. Commun. Appl. 9, 1–21 (2023)
24.
Zurück zum Zitat Kim, M., Yeo, J.H., Choi, J., Ro, Y.M.: Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15359–15371 (2023) Kim, M., Yeo, J.H., Choi, J., Ro, Y.M.: Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15359–15371 (2023)
25.
Zurück zum Zitat Sun, B., Xie, D., Shi, H.: MALip: modal amplification lipreading based on reconstructed audio features. Signal Process. Image Commun. 117, 117002 (2023)CrossRef Sun, B., Xie, D., Shi, H.: MALip: modal amplification lipreading based on reconstructed audio features. Signal Process. Image Commun. 117, 117002 (2023)CrossRef
26.
Zurück zum Zitat Santos, T.I., Abel, A., Wilson, N., Xu, Y.: Speaker-independent visual speech recognition with the inception v3 model. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 613–620. IEEE (2021) Santos, T.I., Abel, A., Wilson, N., Xu, Y.: Speaker-independent visual speech recognition with the inception v3 model. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 613–620. IEEE (2021)
27.
Zurück zum Zitat Nemani, P., Krishna, G.S., Ramisetty, N., Sai, B.D.S., Kumar, S.: Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition. IEEE Trans. Artif. Intell. 4(6), 1705–1713 (2023)CrossRef Nemani, P., Krishna, G.S., Ramisetty, N., Sai, B.D.S., Kumar, S.: Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition. IEEE Trans. Artif. Intell. 4(6), 1705–1713 (2023)CrossRef
28.
Zurück zum Zitat Huang, Y., Liang, X., Fang, C.: CALLip: Lipreading using contrastive and attribute learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2492–2500 (2021) Huang, Y., Liang, X., Fang, C.: CALLip: Lipreading using contrastive and attribute learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2492–2500 (2021)
29.
Zurück zum Zitat Kim, M., Kim, H., Ro, Y.M.: Speaker-adaptive lip reading with user-dependent padding. In: European Conference on Computer Vision, pp. 576–593. Springer (2022) Kim, M., Kim, H., Ro, Y.M.: Speaker-adaptive lip reading with user-dependent padding. In: European Conference on Computer Vision, pp. 576–593. Springer (2022)
30.
Zurück zum Zitat Kim, M., Kim, H.-I., Ro, Y.M.: Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition. arXiv preprint arXiv:2302.08102 (2023) Kim, M., Kim, H.-I., Ro, Y.M.: Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition. arXiv preprint arXiv:​2302.​08102 (2023)
31.
Zurück zum Zitat Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 214–223 (2020) Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 214–223 (2020)
32.
Zurück zum Zitat Tang, S., Guo, D., Hong, R., Wang, M.: Graph-based multimodal sequential embedding for sign language translation. IEEE Transac. Multimed. 24, 4433–4445 (2021)CrossRef Tang, S., Guo, D., Hong, R., Wang, M.: Graph-based multimodal sequential embedding for sign language translation. IEEE Transac. Multimed. 24, 4433–4445 (2021)CrossRef
33.
Zurück zum Zitat Papadimitriou, K., Potamianos, G.: Sign language recognition via deformable 3D convolutions and modulated graph convolutional networks. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023) Papadimitriou, K., Potamianos, G.: Sign language recognition via deformable 3D convolutions and modulated graph convolutional networks. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
34.
Zurück zum Zitat Ban, Y., Eckhoff, J.A., Ward, T.M., Hashimoto, D.A., Meireles, O.R., Rus, D., Rosman, G.: Concept graph neural networks for surgical video understanding. IEEE Trans. Med. Imaging. 43(1), 264–274 (2024)CrossRefPubMed Ban, Y., Eckhoff, J.A., Ward, T.M., Hashimoto, D.A., Meireles, O.R., Rus, D., Rosman, G.: Concept graph neural networks for surgical video understanding. IEEE Trans. Med. Imaging. 43(1), 264–274 (2024)CrossRefPubMed
35.
Zurück zum Zitat Amodio, A., Ermidoro, M., Maggi, D., Formentin, S., Savaresi, S.M.: Automatic detection of driver impairment based on pupillary light reflex. IEEE Trans. Intell. Transp. Syst. 20(8), 3038–3048 (2018)CrossRef Amodio, A., Ermidoro, M., Maggi, D., Formentin, S., Savaresi, S.M.: Automatic detection of driver impairment based on pupillary light reflex. IEEE Trans. Intell. Transp. Syst. 20(8), 3038–3048 (2018)CrossRef
36.
Zurück zum Zitat Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)CrossRef Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)CrossRef
37.
Zurück zum Zitat Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:​1406.​1078 (2014)
38.
Zurück zum Zitat Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014) Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
39.
Zurück zum Zitat Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)ADSCrossRefPubMed Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)ADSCrossRefPubMed
40.
Zurück zum Zitat Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710. Soviet Union (1966) Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710. Soviet Union (1966)
41.
Zurück zum Zitat Son Chung, J., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6447–6456 (2017) Son Chung, J., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6447–6456 (2017)
Metadaten
Titel
Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach
verfasst von
Yu Li
Feng Xue
Lin Wu
Yincen Xie
Shujie Li
Publikationsdatum
01.02.2024
Verlag
Springer Berlin Heidelberg
Erschienen in
Multimedia Systems / Ausgabe 1/2024
Print ISSN: 0942-4962
Elektronische ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-023-01226-3

Weitere Artikel der Ausgabe 1/2024

Multimedia Systems 1/2024 Zur Ausgabe