nach oben

International Journal of Computer Vision

Erschienen in:

18.06.2023 | Manuscript

Conditional Temporal Variational AutoEncoder for Action Video Prediction

verfasst von: Xiaogang Xu, Yi Wang, Liwei Wang, Bei Yu, Jiaya Jia

Erschienen in: International Journal of Computer Vision | Ausgabe 10/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

To synthesize a realistic action sequence based on a single human image, it is crucial to model both motion patterns and diversity in the action video. This paper proposes an Action Conditional Temporal Variational AutoEncoder (ACT-VAE) to improve motion prediction accuracy and capture movement diversity. ACT-VAE predicts pose sequences for an action clip from a single input image. It is implemented as a deep generative model that maintains temporal coherence according to the action category with a novel temporal modeling on latent space. Further, ACT-VAE is a general action sequence prediction framework. When connected with a plug-and-play Pose-to-Image network, ACT-VAE can synthesize image sequences. Extensive experiments bear out our approach can predict accurate pose and synthesize realistic image sequences, surpassing state-of-the-art approaches. Compared to existing methods, ACT-VAE improves model accuracy and preserves diversity.

Vorheriger Artikel Anti-Bandit for Neural Architecture Search

Nächster Artikel Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Aberman, K., Wu, R., Lischinski, D., Chen, B., & Cohen-Or, D. (2019). Learning character-agnostic motion for motion retargeting in 2d. arXiv:1905.01680.

Adeli, V., Ehsanpour, M., Reid, I., Niebles, J. C., Savarese, S., Adeli, E., & Rezatofighi, H. (2021). Tripod: Human trajectory and pose dynamics forecasting in the wild. In International conference on computer vision.

Ahuja, C., & Morency, L. P. (2019). Language2pose: Natural language grounded pose forecasting. In 2019 International conference on 3D vision (3DV).

Aliakbarian, S., Saleh, F. S., Salzmann, M., Petersson, L., & Gould, S. (2020). A stochastic conditioning scheme for diverse human motion prediction. In IEEE conference on computer vision and pattern recognition.

Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2017). Stochastic variational video prediction. arXiv:1710.11252.

Balaji, Y., Min, M. R., Bai, B., Chellappa, R., & Graf, H. P. (2019). Conditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI.

Cai, H., Bai, C., Tai, Y. W., & Tang, C. K. (2018). Deep video generation, prediction and completion of human action sequences. In The European Conference on Computer Vision.

Cai, Y., Huang, L., Wang, Y., Cham, T. J., Cai, J., Yuan, J., Liu, J., Yang, X., Zhu, Y., Shen, X., et al. (2020). Learning progressive joint propagation for human motion prediction. In The european conference on computer vision.

Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., & Sheikh, Y. A. (2019). Openpose: Realtime multi-person 2d pose estimation using part affinity fields. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Intell.

Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE conference on computer vision and pattern recognition.

Castrejon, L., Ballas, N., & Courville, A. (2019). Improved conditional vrnns for video prediction. In International Conference on Computer Vision.

Chen, G., Li, J., Lu, J., & Zhou, J. (2021). Human trajectory prediction via counterfactual analysis. In International Conference on Computer Vision.

Chen, W., & Hays, J. (2018). Sketchygan: Towards diverse and realistic sketch to image synthesis. In IEEE conference on computer vision and pattern recognition.

Cheng, Y., Yang, B., Wang, B., Yan, W., & Tan, R. T. (2019). Occlusion-aware networks for 3d human pose estimation in video. In International conference on computer vision.

Choi, H., Moon, G., Chang, J. Y., & Lee, K. M. (2021). Beyond static features for temporally consistent 3d human pose and shape from a video. In IEEE conference on computer vision and pattern recognition.

Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., & Bengio, Y. (2015). A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems.

Clark, A., Donahue, J., & Simonyan, K. (2019). Adversarial video generation on complex datasets. arXiv:1907.06571.

Cui, A., McKee, D., & Lazebnik, S. (2021). Dressing in order: Recurrent person image generation for pose transfer, virtual try-on and outfit editing. In International Conference on Computer Vision

Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. arXiv:1802.07687.

Duan, H., Zhao, Y., Chen, K., Lin, D., & Dai, B. (2022). Revisiting skeleton-based action recognition. In IEEE conference on computer vision and pattern recognition.

Duan, J., Wang, L., Long, C., Zhou, S., Zheng, F., Shi, L., & Hua, G. (2022). Complementary attention gated network for pedestrian trajectory prediction. In AAAI.

Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems.

Frühstück, A., Singh, K. K., Shechtman, E., Mitra, N. J., Wonka, P., & Lu, J. (2022). Insetgan for full-body image generation. In IEEE conference on computer vision and pattern recognition.

Fu, J., Li, S., Jiang, Y., Lin, K. Y., Qian, C., Loy, C. C., Wu, W., & Liu, Z. (2022). Stylegan-human: A data-centric odyssey of human generation. In The European Conference on Computer Vision.

Gafni, O., Ashual, O., & Wolf, L. (2021). Single-shot freestyle dance reenactment. In IEEE conference on computer vision and pattern recognition.

Ge, C., Song, Y., Ge, Y., Yang, H., Liu, W., & Luo, P. (2021). Disentangled cycle consistency for highly-realistic virtual try-on. In IEEE conference on computer vision and pattern recognition.

Ge, Y., Song, Y., Zhang, R., Ge, C., Liu, W., & Luo, P. (2021). Parser-free virtual try-on via distilling appearance flows. In IEEE conference on computer vision and pattern recognition.

Geng, Z., Sun, K., Xiao, B., Zhang, Z., & Wang, J. (2021). Bottom-up human pose estimation via disentangled keypoint regression. In IEEE conference on computer vision and pattern recognition.

Ghosh, A., Zhang, R., Dokania, P.K., Wang, O., Efros, A.A., Torr, P.H., & Shechtman, E. (2019). Interactive sketch & fill: Multiclass sketch-to-image translation. In International conference on computer vision.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems.

Gopalakrishnan, A., Mali, A., Kifer, D., Giles, L., & Ororbia, A.G. (2019). A neural temporal model for human motion prediction. In IEEE conference on computer vision and pattern recognition.

Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016). Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems.

Guen, V. L., & Thome, N. (2020). Disentangling physical dynamics from unknown factors for unsupervised video prediction. In IEEE conference on computer vision and pattern recognition.

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., & Cheng, L. (2022). Generating diverse and natural 3d human motions from text. In IEEE conference on computer vision and pattern recognition.

Guo, X., & Choi, J. (2019). Human motion prediction via learning local structure representations and temporal dependencies. In AAAI.

Guo, X., Zhao, Y., & Li, J. (2021). Danceit: Music-inspired dancing video synthesis. IEEE Transactions on Image Process.

Han, L., Ren, J., Lee, H.Y., Barbieri, F., Olszewski, K., Minaee, S., Metaxas, D., & Tulyakov, S. (2022). Show me what and tell me how: Video synthesis via multimodal conditioning. In IEEE conference on computer vision and pattern recognition.

Ho, T.T., Virtusio, J.J., Chen, Y.Y., Hsu, C.M., & Hua, K.L. (2020). Sketch-guided deep portrait generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM).

Huang, Y., Bi, H., Li, Z., Mao, T., & Wang, Z. (2019). Stgat: Modeling spatial-temporal interactions for human trajectory prediction. In International Conference on Computer Vision.

Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2013). Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Iqbal, U., Molchanov, P., & Kautz, J. (2020). Weakly-supervised 3d human pose learning via multi-view images in the wild. In IEEE conference on computer vision and pattern recognition.

Jia, X., De Brabandere, B., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In Advances in Neural Information Processing Systems.

Jiang, Y., Yang, S., Qju, H., Wu, W., Loy, C. C., & Liu, Z. (2022). Text2human: Text-driven controllable human image generation. ACM Transactions on Graph.

Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In The European Conference on Computer Vision.

Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., & Kavukcuoglu, K. (2017). Video pixel networks. In ICML.

Kappel, M., Golyanik, V., Elgharib, M., Henningson, J. O., Seidel, H. P., Castillo, S., Theobalt, C., & Magnor, M. (2021). High-fidelity neural human motion transfer from monocular video. In IEEE conference on computer vision and pattern recognition.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and improving the image quality of stylegan. In IEEE IEEE conference on computer vision and pattern recognition.

Kim, Y., Nam, S., Cho, I., & Kim, S.J. (2019). Unsupervised keypoint learning for guiding class-conditional video prediction. In Advances in Neural Information Processing Systems.

Kim, Y., Nam, S., Cho, I., & Kim, S. J. (2019). Unsupervised keypoint learning for guiding class-conditional video prediction. In Advances in Neural Information Processing Systems.

Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.

Kingma, D.P., & Welling, M. (2014). Auto-encoding variational bayes. In The International Conference on Learning Representations.

Kocabas, M., Karagoz, S., & Akbas, E. (2019). Self-supervised learning of 3d human pose using multi-view geometry. In IEEE conference on computer vision and pattern recognition.

Kothari, P., Sifringer, B., & Alahi, A. (2021). Interpretable social anchors for human trajectory forecasting in crowds. In IEEE conference on computer vision and pattern recognition.

Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., & Kingma, D. (2019). Videoflow: A flow-based generative model for video. arXiv:1903.01434

Kwon, Y.H., & Park, M.G. (2019). Predicting future frames using retrospective cycle gan. In IEEE conference on computer vision and pattern recognition.

Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. arXiv:1804.01523.

Lee, H. Y., Yang, X., Liu, M. Y., Wang, T. C., Lu, Y. D., Yang, M. H., & Kautz, J. (2019). Dancing to music. In Advances in Neural Information Processing Systems.

Li, C., Zhang, Z., Sun Lee, W., & Hee Lee, G. (2018). Convolutional sequence to sequence model for human dynamics. In IEEE conference on computer vision and pattern recognition.

Li, L., Wang, S., Zhang, Z., Ding, Y., Zheng, Y., Yu, X., & Fan, C. (2021). Write-a-speaker: Text-based emotional and rhythmic talking-head generation. In AAAI.

Li, X., Zhang, J., Li, K., Vyas, S., & Rawat, Y.S. (2022). Pose-guided generative adversarial net for novel view action synthesis. In Proceedings of the IEEE/CVF winter conference on applications of computer vision.

Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M.H. (2018). Flow-grounded spatial-temporal video prediction from still images. In The European Conference on Computer Vision.

Li, Y., Li, Y., Lu, J., Shechtman, E., Lee, Y. J., & Singh, K. K. (2021). Collaging class-specific gans for semantic image synthesis. In International Conference on Computer Vision.

Liu, D., Wu, L., Zheng, F., Liu, L., & Wang, M. (2022). Verbal-person nets: Pose-guided multi-granularity language-to-person generation. IEEE Transactions on Neural Networks and Learning Systems.

Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., & Van Gool, L. (2017). Pose guided person image generation. In Advances in Neural Information Processing Systems.

Mao, W., Liu, M., & Salzmann, M. (2020). History repeats itself: Human motion prediction via motion attention. In The European Conference on Computer Vision.

Mao, W., Liu, M., Salzmann, M., & Li, H. (2019). Learning trajectory dependencies for human motion prediction. In The European Conference on Computer Vision.

Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., & Paul Smolley, S. (2017). Least squares generative adversarial networks. In IEEE conference on computer vision and pattern recognition.

Mathieu, M., Couprie, C., & LeCun, Y. (2015). Deep multi-scale video prediction beyond mean square error. arXiv:1511.05440.

Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In 11th Annual Conference of the International Speech Communication Association.

Minderer, M., Sun, C., Villegas, R., Cole, F., Murphy, K. P., & Lee, H. (2019). Unsupervised learning of object structure and dynamics from videos. In Advances in Neural Information Processing Systems.

Neverova, N., Alp Guler, R., & Kokkinos, I. (2018). Dense pose transfer. In The European Conference on Computer Vision.

Oliu, M., Selva, J., & Escalera, S. (2018). Folded recurrent neural networks for future video prediction. In The European Conference on Computer Vision.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems.

Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3d human pose estimation in video with temporal convolutions and semi-supervised training. In IEEE conference on computer vision and pattern recognition.

Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A. (2018). Film: Visual reasoning with a general conditioning layer. In AAAI.

Piergiovanni, A., Angelova, A., Toshev, A., & Ryoo, M.S. (2020). Adversarial generative grammars for human activity prediction. arXiv:2008.04888.

Razavi, A., Oord, A. V. D., Poole, B., & Vinyals, O. (2019). Preventing posterior collapse with delta-vaes. In ICML

Ren, X., Li, H., Huang, Z., & Chen, Q. (2020). Self-supervised dance video synthesis conditioned on music. In ACM International Conference on Multimedia.

Ren, Y., Fan, X., Li, G., Liu, S., & Li, T.H. (2022). Neural texture extraction and distribution for controllable person image synthesis. In IEEE conference on computer vision and pattern recognition.

Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., & Fua, P. (2018). Learning monocular 3d human pose estimation from multi-view images. In IEEE conference on computer vision and pattern recognition.

Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In IEEE conference on computer vision and pattern recognition.

Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., & Sebe, N. (2019). First order motion model for image animation. In Advances in Neural Information Processing Systems.

Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., & Sebe, N. (2019). Animating arbitrary objects via deep motion transfer. In IEEE conference on computer vision and pattern recognition.

Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems.

Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C. C., & Liu, Z. (2022). Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In IEEE conference on computer vision and pattern recognition.

Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015). Unsupervised learning of video representations using lstms. In ICML.

Tang, H., Bai, S., Zhang, L., Torr, P.H., & Sebe, N. (2020). Xinggan for person image generation. In The European Conference on Computer Vision.

Tulyakov, S., Liu, M.Y., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In IEEE conference on computer vision and pattern recognition.

Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., & Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv:1812.01717.

Villegas, R., Yang, J., Ceylan, D., & Lee, H. (2018). Neural kinematic networks for unsupervised motion retargetting. In IEEE conference on computer vision and pattern recognition.

Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017). Decomposing motion and content for natural video sequence prediction. arXiv:1706.08033.

Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017). Learning to generate long-term future via hierarchical prediction. In ICML.

Walker, J., Marino, K., Gupta, A., & Hebert, M. (2017). The pose knows: Video forecasting by generating pose futures. In International Conference on Computer Vision.

Wandt, B., Rudolph, M., Zell, P., Rhodin, H., & Rosenhahn, B. (2021). Canonpose: Self-supervised monocular 3d human pose estimation in the wild. In IEEE conference on computer vision and pattern recognition.

Wang, B., Adeli, E., Chiu, H. K., Huang, D. A., & Niebles, J. C. (2019). Imitation learning for human pose prediction. In International Conference on Computer Vision.

Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018). Video-to-video synthesis. arXiv:1808.06601.

Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional gans. In IEEE conference on computer vision and pattern recognition.

Wang, W., Alameda-Pineda, X., Xu, D., Fua, P., Ricci, E., & Sebe, N. (2018). Every smile is unique: Landmark-guided diverse smile generation. In IEEE conference on computer vision and pattern recognition.

Wang, Y., Li, M., Cai, H., Chen, W.M., & Han, S. (2022). Lite pose: Efficient architecture design for 2d human pose estimation. In IEEE conference on computer vision and pattern recognition.

Wang, Y., Zhang, J., Zhu, H., Long, M., Wang, J., & Yu, P. S. (2019). Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In IEEE conference on computer vision and pattern recognition.

Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process.

Wichers, N., Villegas, R., Erhan, D., & Lee, H. (2018). Hierarchical long-term video prediction without supervision. arXiv:1806.04768.

Wu, Q., Chen, X., Huang, Z., & Wang, W. (2020). Generating future frames with mask-guided prediction. In The IEEE International Conference on Multimedia and Expo.

Xu, J., Ni, B., Li, Z., Cheng, S., & Yang, X. (2018). Structure preserving video prediction. In IEEE conference on computer vision and pattern recognition.

Yan, X., Rastogi, A., Villegas, R., Sunkavalli, K., Shechtman, E., Hadap, S., Yumer, E., & Lee, H. (2018). Mt-vae: Learning motion transformations to generate multimodal human dynamics. In The European Conference on Computer Vision.

Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., & Lin, D. (2018). Pose guided human video generation. In The European Conference on Computer Vision.

Yang, Z., Zhu, W., Wu, W., Qian, C., Zhou, Q., Zhou, B., & Loy, C.C. (2020). Transmomo: Invariance-driven unsupervised video motion retargeting. In IEEE conference on computer vision and pattern recognition.

Yoo, Y., Yun, S., Jin Chang, H., Demiris, Y., & Young Choi, J. (2017). Variational autoencoded regression: high dimensional regression of visual data on complex manifold. In IEEE conference on computer vision and pattern recognition.

Yoon, J.S., Liu, L., Golyanik, V., Sarkar, K., Park, H.S., & Theobalt, C. (2021). Pose-guided human animation from a single image in the wild. In IEEE conference on computer vision and pattern recognition.

Yuan, Y., & Kitani, K. (2020). Dlow: Diversifying latent flows for diverse human motion prediction. In The European Conference on Computer Vision.

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In IEEE conference on computer vision and pattern recognition.

Zhang, W., Zhu, M., & Derpanis, K.G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In International Conference on Computer Vision.

Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. (2018). Learning to forecast and refine residual motion for image-to-video generation. In The European Conference on Computer Vision.

Zhou, X., Huang, S., Li, B., Li, Y., Li, J., & Zhang, Z. (2019). Text guided person image synthesis. In IEEE conference on computer vision and pattern recognition.

Zhu, J.Y., Park, T., Isola, P., & Efros, A.A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Conference on Computer Vision.

Zhu, W., Yang, Z., Di, Z., Wu, W., Wang, Y., & Loy, C.C. (2022). Mocanet: Motion retargeting in-the-wild via canonicalization networks. In AAAI.

Zhu, Z., Huang, T., Shi, B., Yu, M., Wang, B., & Bai, X. (2019). Progressive pose attention transfer for person image generation. In IEEE conference on computer vision and pattern recognition.

Zhuang, W., Wang, C., Chai, J., Wang, Y., Shao, M., & Xia, S. (2022). Music2dance: Dancenet for music-driven dance generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

Titel: Conditional Temporal Variational AutoEncoder for Action Video Prediction
verfasst von: Xiaogang Xu
Yi Wang
Liwei Wang
Bei Yu
Jiaya Jia
Publikationsdatum: 18.06.2023
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 10/2023
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-023-01832-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 10/2023

DOVE: Learning Deformable 3D Objects by Watching Videos

Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning

Importance First: Generating Scene Graph of Human Interest

InstaBoost++: Visual Coherence Principles for Unified 2D/3D Instance Level Data Augmentation

A2B: Anchor to Barycentric Coordinate for Robust Correspondence

When Multi-Focus Image Fusion Networks Meet Traditional Edge-Preservation Technology

Premium Partner