Skip to main content

2023 | OriginalPaper | Buchkapitel

Literature Review of Audio-Driven 2D Avatar Video Generation Algorithms

verfasst von : Yuxuan Li, Han Zhang, Shaozhong Cao, Dan Jiang, Meng Wang, Weiqi Wang

Erschienen in: IEIS 2022

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Audio-driven 2D avatar video generation algorithms have a wide range of applications in the media field. The technology of generating 2D avatar videos with only the input of compliant audio and images has been a positive boost to the development of online media and other fields. In such generation algorithms, the accurate coupling of speech audio and appearance changes such as faces and gestures in subtle movements has been a point of continuous improvement, with appearance changes moving from an early focus on matching speech content only to starting to incorporate human emotions expressed by speech. There has been a significant improvement in fidelity and synchronization compared to the early experimental results of the study, and the behavioral performance of the 2D avatars in the generated videos is getting closer to that of humans. This paper provides an overview of existing audio-driven 2D avatar generation algorithms and classifies their tasks into two categories: talking face generation and co-speech gesture generation. Firstly, the article describes the task specifically and describes its application areas. Secondly, we analyze the core algorithms in order of technological advancement and briefly describe the performance effects of the methods or models. Thirdly, we present common datasets for both types of tasks as well as evaluation metrics and compare the performance metrics of some recently proposed algorithms. Finally, the paper discusses the opportunities and challenges faced by the field and gives future research directions.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Chen, L., Cui, G., Kou, Z., Zheng, H., Xu, C.: What comprises a good talking-head video generation?: a survey and benchmark. arXiv preprint arXiv:2005.03201 (2020) Chen, L., Cui, G., Kou, Z., Zheng, H., Xu, C.: What comprises a good talking-head video generation?: a survey and benchmark. arXiv preprint arXiv:​2005.​03201 (2020)
2.
Zurück zum Zitat Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (ToG) 36(4), 1–13 (2017)CrossRef Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (ToG) 36(4), 1–13 (2017)CrossRef
3.
Zurück zum Zitat Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations (2016) Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations (2016)
4.
Zurück zum Zitat Xie, L., Liu, Z.-Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recogn. 40(8), 2325–2340 (2007)CrossRef Xie, L., Liu, Z.-Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recogn. 40(8), 2325–2340 (2007)CrossRef
5.
Zurück zum Zitat Simons, A.D.: Generation of mouthshape for a synthetic talking head. Inst. Acoust. (1990) Simons, A.D.: Generation of mouthshape for a synthetic talking head. Inst. Acoust. (1990)
6.
Zurück zum Zitat Rabiner, L.R., Juang, B.-H.: An ntroduction to hidden Markov models. IEEE ASSP Mag. Citeseer 3(1), 4–16 (1986)CrossRef Rabiner, L.R., Juang, B.-H.: An ntroduction to hidden Markov models. IEEE ASSP Mag. Citeseer 3(1), 4–16 (1986)CrossRef
7.
Zurück zum Zitat Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven facial animation with temporal GANs. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 37–40 (2019) Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven facial animation with temporal GANs. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 37–40 (2019)
8.
Zurück zum Zitat Song, Y., Zhu, J., Li, D.: Talking face generation by conditional recurrent adversarial network. In: International Joint Conference on Artificial Intelligence, pp. 919–925 (2019) Song, Y., Zhu, J., Li, D.: Talking face generation by conditional recurrent adversarial network. In: International Joint Conference on Artificial Intelligence, pp. 919–925 (2019)
9.
Zurück zum Zitat Chung, J.S., Jamaludin, A., Zisserman, A.: You said that?: synthesising talking faces from audio. Int. J. Comput. Vision 127, 1767–1779 (2017) Chung, J.S., Jamaludin, A., Zisserman, A.: You said that?: synthesising talking faces from audio. Int. J. Comput. Vision 127, 1767–1779 (2017)
10.
Zurück zum Zitat Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: AAAI (2019) Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: AAAI (2019)
12.
Zurück zum Zitat Zhu, H., Zheng, A., Huang, H.: High-resolution talking face generation via mutual information approximation. arXiv preprint arXiv:1812.06589 (2018) Zhu, H., Zheng, A., Huang, H.: High-resolution talking face generation via mutual information approximation. arXiv preprint arXiv:​1812.​06589 (2018)
13.
Zurück zum Zitat Zhou, H., Liu, Y., Liu, Z.: Talking face generation by adversarially disentangled audio- visual representation. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 9299–9306 (2019) Zhou, H., Liu, Y., Liu, Z.: Talking face generation by adversarially disentangled audio- visual representation. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 9299–9306 (2019)
14.
Zurück zum Zitat Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vision 128(5), 1398–1413 (2020)CrossRef Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vision 128(5), 1398–1413 (2020)CrossRef
15.
Zurück zum Zitat Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:​1406.​1078 (2014)
16.
Zurück zum Zitat Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z.: Generating talking face landmarks from speech. In: Proceedings of LV A/ICA (2018) Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z.: Generating talking face landmarks from speech. In: Proceedings of LV A/ICA (2018)
17.
Zurück zum Zitat Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z.: Noise-resilient training method for face landmark generation from speech. IEEE/ACM Trans. Audio Speech Lang. Process. (2019) Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z.: Noise-resilient training method for face landmark generation from speech. IEEE/ACM Trans. Audio Speech Lang. Process. (2019)
18.
Zurück zum Zitat Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of CVPR (2019) Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of CVPR (2019)
19.
Zurück zum Zitat Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom. Intell. Lab. Syst. 2(1–3), 37–52 (1987)CrossRef Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom. Intell. Lab. Syst. 2(1–3), 37–52 (1987)CrossRef
20.
Zurück zum Zitat Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 1735–1780 (1997) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 1735–1780 (1997)
21.
Zurück zum Zitat Greenwood, D., Matthews, I., Laycock, S.: Joint learning of facial expression and head pose from speech. In: Interspeech (2018) Greenwood, D., Matthews, I., Laycock, S.: Joint learning of facial expression and head pose from speech. In: Interspeech (2018)
22.
Zurück zum Zitat Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. TOG (2020) Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. TOG (2020)
23.
Zurück zum Zitat Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AUTOVC: zero-shot voice style transfer with only autoencoder loss. In: Proceedings of ICML, pp. 5210–5219 (2019) Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AUTOVC: zero-shot voice style transfer with only autoencoder loss. In: Proceedings of ICML, pp. 5210–5219 (2019)
24.
Zurück zum Zitat Song, L., Wu, W., Fu, C., Qian, C., Loy, C.C., He, R.: Everything’s Talkin’: pareidolia face reenactment. arXiv preprint arXiv:2104.03061 (2021) Song, L., Wu, W., Fu, C., Qian, C., Loy, C.C., He, R.: Everything’s Talkin’: pareidolia face reenactment. arXiv preprint arXiv:​2104.​03061 (2021)
25.
Zurück zum Zitat Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRf: Representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24CrossRef Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRf: Representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://​doi.​org/​10.​1007/​978-3-030-58452-8_​24CrossRef
26.
Zurück zum Zitat Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.V.: A lip sync expert is all you need for speech to lip generation in the wild. In: ACM Multimedia (2020) Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.V.: A lip sync expert is all you need for speech to lip generation in the wild. In: ACM Multimedia (2020)
27.
Zurück zum Zitat Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X., Liu, Z.: Pose-controllable talking face generation by implicitly modularizedaudio-visual representation. In: CVPR (2021) Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X., Liu, Z.: Pose-controllable talking face generation by implicitly modularizedaudio-visual representation. In: CVPR (2021)
28.
Zurück zum Zitat Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: NIPS (2019) Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: NIPS (2019)
29.
Zurück zum Zitat Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: CVPR (2021) Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: CVPR (2021)
30.
Zurück zum Zitat Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: Animating arbitrary objects via deep motion transfer. In: CVPR (2019) Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: Animating arbitrary objects via deep motion transfer. In: CVPR (2019)
31.
Zurück zum Zitat Chung, J.S., Senior, A.W., Vinyals, O., et al.: Lip reading sentences in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3444–3453 (2017) Chung, J.S., Senior, A.W., Vinyals, O., et al.: Lip reading sentences in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3444–3453 (2017)
32.
Zurück zum Zitat Cooke, M., Barker, J., Cunningham, S., et al.: An audio-visual corpus for speech perception and automatic speech recognition. Acoust. Soc. Am. 120, 2421–2424 (2006) Cooke, M., Barker, J., Cunningham, S., et al.: An audio-visual corpus for speech perception and automatic speech recognition. Acoust. Soc. Am. 120, 2421–2424 (2006)
33.
Zurück zum Zitat Nagrani, A., Chung, J.S., Zisserman, A.: Vox Celeb: a large-scale speaker identification dataset. Comput. Speech Lang. 101027 (2017) Nagrani, A., Chung, J.S., Zisserman, A.: Vox Celeb: a large-scale speaker identification dataset. Comput. Speech Lang. 101027 (2017)
34.
Zurück zum Zitat Chung, J.S., Nagrani, A., Zisserman, A.: Vox Celeb2: deep speaker recognition. In: Interspeech, pp. 1086–1090 (2018) Chung, J.S., Nagrani, A., Zisserman, A.: Vox Celeb2: deep speaker recognition. In: Interspeech, pp. 1086–1090 (2018)
35.
Zurück zum Zitat Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS ONE e0196391 (2018) Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS ONE e0196391 (2018)
36.
Zurück zum Zitat Afouras, T., Chung, J.S., Senior, A., et al.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018) Afouras, T., Chung, J.S., Senior, A., et al.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
37.
Zurück zum Zitat Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018) Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:​1809.​00496 (2018)
38.
Zurück zum Zitat Schwiebert, G., Weber, C., Qu, L., Siqueira, H., Wermter, S.: A multimodal German dataset for automatic lip reading systems and transfer learning. arXiv:2202.13403 Schwiebert, G., Weber, C., Qu, L., Siqueira, H., Wermter, S.: A multimodal German dataset for automatic lip reading systems and transfer learning. arXiv:​2202.​13403
39.
Zurück zum Zitat Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
40.
Zurück zum Zitat Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
41.
Zurück zum Zitat Narvekar, N.D., Karam, L.J.: A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In: 2009 International Workshop on Quality of Multimedia Experience, pp. 87–91. IEEE (2009) Narvekar, N.D., Karam, L.J.: A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In: 2009 International Workshop on Quality of Multimedia Experience, pp. 87–91. IEEE (2009)
42.
Zurück zum Zitat Wiles O, Sophia Koepke A, Zisserman A (2018) X2face: network for controlling face generation using images, audio, and pose codes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 670–686. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_41 Wiles O, Sophia Koepke A, Zisserman A (2018) X2face: network for controlling face generation using images, audio, and pose codes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 670–686. Springer, Cham (2018). https://​doi.​org/​10.​1007/​978-3-030-01261-8_​41
43.
Zurück zum Zitat Wang, T.C., Liu, M.Y., Tao, A., Liu, G., Kautz, J., Catanzaro, B.: Few-shot video-to-video synthesis. arXiv preprint arXiv:191012713 (2019) Wang, T.C., Liu, M.Y., Tao, A., Liu, G., Kautz, J., Catanzaro, B.: Few-shot video-to-video synthesis. arXiv preprint arXiv:​191012713 (2019)
44.
Zurück zum Zitat Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9459–946 (2019) Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9459–946 (2019)
45.
Zurück zum Zitat Cassell, J., et al.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420 (1994) Cassell, J., et al.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420 (1994)
46.
Zurück zum Zitat Kopp, S., et al.: Towards a common framework for multimodal generation: the behavior markup language. In: Gratch, J., Young, M., Aylett, R., Ballin, D., Olivier, P. (eds.) IVA 2006. LNCS (LNAI), vol. 4133, pp. 205–217. Springer, Heidelberg (2006). https://doi.org/10.1007/11821830_17CrossRef Kopp, S., et al.: Towards a common framework for multimodal generation: the behavior markup language. In: Gratch, J., Young, M., Aylett, R., Ballin, D., Olivier, P. (eds.) IVA 2006. LNCS (LNAI), vol. 4133, pp. 205–217. Springer, Heidelberg (2006). https://​doi.​org/​10.​1007/​11821830_​17CrossRef
47.
Zurück zum Zitat Wagner, P., Malisz, Z., Kopp, S.: Gesture and speech in interaction: an overview. Speech Commun. 57, 209–232 (2014)CrossRef Wagner, P., Malisz, Z., Kopp, S.: Gesture and speech in interaction: an overview. Speech Commun. 57, 209–232 (2014)CrossRef
48.
Zurück zum Zitat Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3497–3506 (2019) Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3497–3506 (2019)
49.
Zurück zum Zitat Ahuja, C., Lee, D.W., Nakano, Y.I., Morency, L.-P.: Style transfer for co-speech gesture animation: a multi-speaker conditional-mixture approach. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds,) ECCV 2020. LNCS, vol. 12363, pp. 248–265. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_15 Ahuja, C., Lee, D.W., Nakano, Y.I., Morency, L.-P.: Style transfer for co-speech gesture animation: a multi-speaker conditional-mixture approach. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds,) ECCV 2020. LNCS, vol. 12363, pp. 248–265. Springer, Cham (2020). https://​doi.​org/​10.​1007/​978-3-030-58523-5_​15
50.
Zurück zum Zitat Yoon, Y., et al.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. (TOG) 39(6), 1–16 (2020)CrossRef Yoon, Y., et al.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. (TOG) 39(6), 1–16 (2020)CrossRef
51.
Zurück zum Zitat Yoon, Y., Ko, W.-R., Jang, M., Lee, J., Kim, J., Lee, G.: Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 4303–4309. IEEE (2019) Yoon, Y., Ko, W.-R., Jang, M., Lee, J., Kim, J., Lee, G.: Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 4303–4309. IEEE (2019)
52.
Zurück zum Zitat Liao, M., Zhang, S., Wang, P., Zhu, H., Zuo, X., Yang, R.: Speech2Video synthesis with 3D skeleton regularization and expressive body poses. In: Proceedings of the Asian Conference on Computer Vision (2020) Liao, M., Zhang, S., Wang, P., Zhu, H., Zuo, X., Yang, R.: Speech2Video synthesis with 3D skeleton regularization and expressive body poses. In: Proceedings of the Asian Conference on Computer Vision (2020)
53.
Zurück zum Zitat Qian, S., Tu, Z., Zhi, Y., Liu, W., Gao, S.: Speech drives templates: co-speech gesture synthesis with learned templates. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11077–11086 (2021) Qian, S., Tu, Z., Zhi, Y., Liu, W., Gao, S.: Speech drives templates: co-speech gesture synthesis with learned templates. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11077–11086 (2021)
54.
Zurück zum Zitat Yi, R., Ye, Z., Zhang, J., Bao, H., Liu, Y.-J.: Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:2002.10137 (2020) Yi, R., Ye, Z., Zhang, J., Bao, H., Liu, Y.-J.: Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:​2002.​10137 (2020)
55.
Zurück zum Zitat Alexanderson, S., Henter, G.E., Kucherenko, T., Beskow, J.: Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer Graphics Forum, vol. 39, pp. 487–496. Wiley (2020) Alexanderson, S., Henter, G.E., Kucherenko, T., Beskow, J.: Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer Graphics Forum, vol. 39, pp. 487–496. Wiley (2020)
56.
57.
Zurück zum Zitat Aliakbarian, S., Saleh, F.S., Salzmann, M., Petersson, L., Gould, S.: A stochastic conditioning scheme for diverse human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5223–5232 (2020) Aliakbarian, S., Saleh, F.S., Salzmann, M., Petersson, L., Gould, S.: A stochastic conditioning scheme for diverse human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5223–5232 (2020)
58.
Zurück zum Zitat Shlizerman, E., Dery, L., Schoen, H., Kemelmacher-Shlizerman, I.: Audio to body dynamics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7574–7583 (2018) Shlizerman, E., Dery, L., Schoen, H., Kemelmacher-Shlizerman, I.: Audio to body dynamics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7574–7583 (2018)
Metadaten
Titel
Literature Review of Audio-Driven 2D Avatar Video Generation Algorithms
verfasst von
Yuxuan Li
Han Zhang
Shaozhong Cao
Dan Jiang
Meng Wang
Weiqi Wang
Copyright-Jahr
2023
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-99-3618-2_9

Premium Partner