Skip to main content
Top

2023 | OriginalPaper | Chapter

Literature Review of Audio-Driven 2D Avatar Video Generation Algorithms

Authors : Yuxuan Li, Han Zhang, Shaozhong Cao, Dan Jiang, Meng Wang, Weiqi Wang

Published in: IEIS 2022

Publisher: Springer Nature Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Audio-driven 2D avatar video generation algorithms have a wide range of applications in the media field. The technology of generating 2D avatar videos with only the input of compliant audio and images has been a positive boost to the development of online media and other fields. In such generation algorithms, the accurate coupling of speech audio and appearance changes such as faces and gestures in subtle movements has been a point of continuous improvement, with appearance changes moving from an early focus on matching speech content only to starting to incorporate human emotions expressed by speech. There has been a significant improvement in fidelity and synchronization compared to the early experimental results of the study, and the behavioral performance of the 2D avatars in the generated videos is getting closer to that of humans. This paper provides an overview of existing audio-driven 2D avatar generation algorithms and classifies their tasks into two categories: talking face generation and co-speech gesture generation. Firstly, the article describes the task specifically and describes its application areas. Secondly, we analyze the core algorithms in order of technological advancement and briefly describe the performance effects of the methods or models. Thirdly, we present common datasets for both types of tasks as well as evaluation metrics and compare the performance metrics of some recently proposed algorithms. Finally, the paper discusses the opportunities and challenges faced by the field and gives future research directions.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Chen, L., Cui, G., Kou, Z., Zheng, H., Xu, C.: What comprises a good talking-head video generation?: a survey and benchmark. arXiv preprint arXiv:2005.03201 (2020) Chen, L., Cui, G., Kou, Z., Zheng, H., Xu, C.: What comprises a good talking-head video generation?: a survey and benchmark. arXiv preprint arXiv:​2005.​03201 (2020)
2.
go back to reference Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (ToG) 36(4), 1–13 (2017)CrossRef Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (ToG) 36(4), 1–13 (2017)CrossRef
3.
go back to reference Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations (2016) Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations (2016)
4.
go back to reference Xie, L., Liu, Z.-Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recogn. 40(8), 2325–2340 (2007)CrossRef Xie, L., Liu, Z.-Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recogn. 40(8), 2325–2340 (2007)CrossRef
5.
go back to reference Simons, A.D.: Generation of mouthshape for a synthetic talking head. Inst. Acoust. (1990) Simons, A.D.: Generation of mouthshape for a synthetic talking head. Inst. Acoust. (1990)
6.
go back to reference Rabiner, L.R., Juang, B.-H.: An ntroduction to hidden Markov models. IEEE ASSP Mag. Citeseer 3(1), 4–16 (1986)CrossRef Rabiner, L.R., Juang, B.-H.: An ntroduction to hidden Markov models. IEEE ASSP Mag. Citeseer 3(1), 4–16 (1986)CrossRef
7.
go back to reference Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven facial animation with temporal GANs. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 37–40 (2019) Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven facial animation with temporal GANs. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 37–40 (2019)
8.
go back to reference Song, Y., Zhu, J., Li, D.: Talking face generation by conditional recurrent adversarial network. In: International Joint Conference on Artificial Intelligence, pp. 919–925 (2019) Song, Y., Zhu, J., Li, D.: Talking face generation by conditional recurrent adversarial network. In: International Joint Conference on Artificial Intelligence, pp. 919–925 (2019)
9.
go back to reference Chung, J.S., Jamaludin, A., Zisserman, A.: You said that?: synthesising talking faces from audio. Int. J. Comput. Vision 127, 1767–1779 (2017) Chung, J.S., Jamaludin, A., Zisserman, A.: You said that?: synthesising talking faces from audio. Int. J. Comput. Vision 127, 1767–1779 (2017)
10.
go back to reference Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: AAAI (2019) Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: AAAI (2019)
12.
go back to reference Zhu, H., Zheng, A., Huang, H.: High-resolution talking face generation via mutual information approximation. arXiv preprint arXiv:1812.06589 (2018) Zhu, H., Zheng, A., Huang, H.: High-resolution talking face generation via mutual information approximation. arXiv preprint arXiv:​1812.​06589 (2018)
13.
go back to reference Zhou, H., Liu, Y., Liu, Z.: Talking face generation by adversarially disentangled audio- visual representation. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 9299–9306 (2019) Zhou, H., Liu, Y., Liu, Z.: Talking face generation by adversarially disentangled audio- visual representation. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 9299–9306 (2019)
14.
go back to reference Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vision 128(5), 1398–1413 (2020)CrossRef Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vision 128(5), 1398–1413 (2020)CrossRef
15.
go back to reference Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:​1406.​1078 (2014)
16.
go back to reference Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z.: Generating talking face landmarks from speech. In: Proceedings of LV A/ICA (2018) Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z.: Generating talking face landmarks from speech. In: Proceedings of LV A/ICA (2018)
17.
go back to reference Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z.: Noise-resilient training method for face landmark generation from speech. IEEE/ACM Trans. Audio Speech Lang. Process. (2019) Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z.: Noise-resilient training method for face landmark generation from speech. IEEE/ACM Trans. Audio Speech Lang. Process. (2019)
18.
go back to reference Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of CVPR (2019) Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of CVPR (2019)
19.
go back to reference Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom. Intell. Lab. Syst. 2(1–3), 37–52 (1987)CrossRef Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom. Intell. Lab. Syst. 2(1–3), 37–52 (1987)CrossRef
20.
go back to reference Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 1735–1780 (1997) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 1735–1780 (1997)
21.
go back to reference Greenwood, D., Matthews, I., Laycock, S.: Joint learning of facial expression and head pose from speech. In: Interspeech (2018) Greenwood, D., Matthews, I., Laycock, S.: Joint learning of facial expression and head pose from speech. In: Interspeech (2018)
22.
go back to reference Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. TOG (2020) Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. TOG (2020)
23.
go back to reference Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AUTOVC: zero-shot voice style transfer with only autoencoder loss. In: Proceedings of ICML, pp. 5210–5219 (2019) Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AUTOVC: zero-shot voice style transfer with only autoencoder loss. In: Proceedings of ICML, pp. 5210–5219 (2019)
24.
go back to reference Song, L., Wu, W., Fu, C., Qian, C., Loy, C.C., He, R.: Everything’s Talkin’: pareidolia face reenactment. arXiv preprint arXiv:2104.03061 (2021) Song, L., Wu, W., Fu, C., Qian, C., Loy, C.C., He, R.: Everything’s Talkin’: pareidolia face reenactment. arXiv preprint arXiv:​2104.​03061 (2021)
25.
26.
go back to reference Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.V.: A lip sync expert is all you need for speech to lip generation in the wild. In: ACM Multimedia (2020) Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.V.: A lip sync expert is all you need for speech to lip generation in the wild. In: ACM Multimedia (2020)
27.
go back to reference Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X., Liu, Z.: Pose-controllable talking face generation by implicitly modularizedaudio-visual representation. In: CVPR (2021) Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X., Liu, Z.: Pose-controllable talking face generation by implicitly modularizedaudio-visual representation. In: CVPR (2021)
28.
go back to reference Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: NIPS (2019) Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: NIPS (2019)
29.
go back to reference Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: CVPR (2021) Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: CVPR (2021)
30.
go back to reference Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: Animating arbitrary objects via deep motion transfer. In: CVPR (2019) Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: Animating arbitrary objects via deep motion transfer. In: CVPR (2019)
31.
go back to reference Chung, J.S., Senior, A.W., Vinyals, O., et al.: Lip reading sentences in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3444–3453 (2017) Chung, J.S., Senior, A.W., Vinyals, O., et al.: Lip reading sentences in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3444–3453 (2017)
32.
go back to reference Cooke, M., Barker, J., Cunningham, S., et al.: An audio-visual corpus for speech perception and automatic speech recognition. Acoust. Soc. Am. 120, 2421–2424 (2006) Cooke, M., Barker, J., Cunningham, S., et al.: An audio-visual corpus for speech perception and automatic speech recognition. Acoust. Soc. Am. 120, 2421–2424 (2006)
33.
go back to reference Nagrani, A., Chung, J.S., Zisserman, A.: Vox Celeb: a large-scale speaker identification dataset. Comput. Speech Lang. 101027 (2017) Nagrani, A., Chung, J.S., Zisserman, A.: Vox Celeb: a large-scale speaker identification dataset. Comput. Speech Lang. 101027 (2017)
34.
go back to reference Chung, J.S., Nagrani, A., Zisserman, A.: Vox Celeb2: deep speaker recognition. In: Interspeech, pp. 1086–1090 (2018) Chung, J.S., Nagrani, A., Zisserman, A.: Vox Celeb2: deep speaker recognition. In: Interspeech, pp. 1086–1090 (2018)
35.
go back to reference Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS ONE e0196391 (2018) Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS ONE e0196391 (2018)
36.
go back to reference Afouras, T., Chung, J.S., Senior, A., et al.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018) Afouras, T., Chung, J.S., Senior, A., et al.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
37.
go back to reference Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018) Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:​1809.​00496 (2018)
38.
go back to reference Schwiebert, G., Weber, C., Qu, L., Siqueira, H., Wermter, S.: A multimodal German dataset for automatic lip reading systems and transfer learning. arXiv:2202.13403 Schwiebert, G., Weber, C., Qu, L., Siqueira, H., Wermter, S.: A multimodal German dataset for automatic lip reading systems and transfer learning. arXiv:​2202.​13403
39.
go back to reference Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
40.
go back to reference Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
41.
go back to reference Narvekar, N.D., Karam, L.J.: A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In: 2009 International Workshop on Quality of Multimedia Experience, pp. 87–91. IEEE (2009) Narvekar, N.D., Karam, L.J.: A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In: 2009 International Workshop on Quality of Multimedia Experience, pp. 87–91. IEEE (2009)
42.
go back to reference Wiles O, Sophia Koepke A, Zisserman A (2018) X2face: network for controlling face generation using images, audio, and pose codes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 670–686. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_41 Wiles O, Sophia Koepke A, Zisserman A (2018) X2face: network for controlling face generation using images, audio, and pose codes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 670–686. Springer, Cham (2018). https://​doi.​org/​10.​1007/​978-3-030-01261-8_​41
43.
go back to reference Wang, T.C., Liu, M.Y., Tao, A., Liu, G., Kautz, J., Catanzaro, B.: Few-shot video-to-video synthesis. arXiv preprint arXiv:191012713 (2019) Wang, T.C., Liu, M.Y., Tao, A., Liu, G., Kautz, J., Catanzaro, B.: Few-shot video-to-video synthesis. arXiv preprint arXiv:​191012713 (2019)
44.
go back to reference Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9459–946 (2019) Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9459–946 (2019)
45.
go back to reference Cassell, J., et al.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420 (1994) Cassell, J., et al.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420 (1994)
46.
47.
go back to reference Wagner, P., Malisz, Z., Kopp, S.: Gesture and speech in interaction: an overview. Speech Commun. 57, 209–232 (2014)CrossRef Wagner, P., Malisz, Z., Kopp, S.: Gesture and speech in interaction: an overview. Speech Commun. 57, 209–232 (2014)CrossRef
48.
go back to reference Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3497–3506 (2019) Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3497–3506 (2019)
49.
go back to reference Ahuja, C., Lee, D.W., Nakano, Y.I., Morency, L.-P.: Style transfer for co-speech gesture animation: a multi-speaker conditional-mixture approach. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds,) ECCV 2020. LNCS, vol. 12363, pp. 248–265. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_15 Ahuja, C., Lee, D.W., Nakano, Y.I., Morency, L.-P.: Style transfer for co-speech gesture animation: a multi-speaker conditional-mixture approach. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds,) ECCV 2020. LNCS, vol. 12363, pp. 248–265. Springer, Cham (2020). https://​doi.​org/​10.​1007/​978-3-030-58523-5_​15
50.
go back to reference Yoon, Y., et al.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. (TOG) 39(6), 1–16 (2020)CrossRef Yoon, Y., et al.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. (TOG) 39(6), 1–16 (2020)CrossRef
51.
go back to reference Yoon, Y., Ko, W.-R., Jang, M., Lee, J., Kim, J., Lee, G.: Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 4303–4309. IEEE (2019) Yoon, Y., Ko, W.-R., Jang, M., Lee, J., Kim, J., Lee, G.: Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 4303–4309. IEEE (2019)
52.
go back to reference Liao, M., Zhang, S., Wang, P., Zhu, H., Zuo, X., Yang, R.: Speech2Video synthesis with 3D skeleton regularization and expressive body poses. In: Proceedings of the Asian Conference on Computer Vision (2020) Liao, M., Zhang, S., Wang, P., Zhu, H., Zuo, X., Yang, R.: Speech2Video synthesis with 3D skeleton regularization and expressive body poses. In: Proceedings of the Asian Conference on Computer Vision (2020)
53.
go back to reference Qian, S., Tu, Z., Zhi, Y., Liu, W., Gao, S.: Speech drives templates: co-speech gesture synthesis with learned templates. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11077–11086 (2021) Qian, S., Tu, Z., Zhi, Y., Liu, W., Gao, S.: Speech drives templates: co-speech gesture synthesis with learned templates. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11077–11086 (2021)
54.
go back to reference Yi, R., Ye, Z., Zhang, J., Bao, H., Liu, Y.-J.: Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:2002.10137 (2020) Yi, R., Ye, Z., Zhang, J., Bao, H., Liu, Y.-J.: Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:​2002.​10137 (2020)
55.
go back to reference Alexanderson, S., Henter, G.E., Kucherenko, T., Beskow, J.: Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer Graphics Forum, vol. 39, pp. 487–496. Wiley (2020) Alexanderson, S., Henter, G.E., Kucherenko, T., Beskow, J.: Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer Graphics Forum, vol. 39, pp. 487–496. Wiley (2020)
56.
57.
go back to reference Aliakbarian, S., Saleh, F.S., Salzmann, M., Petersson, L., Gould, S.: A stochastic conditioning scheme for diverse human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5223–5232 (2020) Aliakbarian, S., Saleh, F.S., Salzmann, M., Petersson, L., Gould, S.: A stochastic conditioning scheme for diverse human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5223–5232 (2020)
58.
go back to reference Shlizerman, E., Dery, L., Schoen, H., Kemelmacher-Shlizerman, I.: Audio to body dynamics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7574–7583 (2018) Shlizerman, E., Dery, L., Schoen, H., Kemelmacher-Shlizerman, I.: Audio to body dynamics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7574–7583 (2018)
Metadata
Title
Literature Review of Audio-Driven 2D Avatar Video Generation Algorithms
Authors
Yuxuan Li
Han Zhang
Shaozhong Cao
Dan Jiang
Meng Wang
Weiqi Wang
Copyright Year
2023
Publisher
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-99-3618-2_9

Premium Partner