Top

International Journal of Computer Vision

Published in:

22-10-2020

Progressive Multi-granularity Analysis for Video Prediction

Authors: Jingwei Xu, Bingbing Ni, Xiaokang Yang

Published in: International Journal of Computer Vision | Issue 3/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Video prediction is challenging as real-world motion dynamics are usually multi-modally distributed. Existing stochastic methods commonly formulate random noise input with simple prior distribution, which is insufficient to model highly complex motion dynamics. This work proposes a progressive multiple granularity analysis framework to tackle the above difficulty. Firstly, to achieve coarse alignment, the input sequence is matched to prototype motion dynamics in the training set, based on self-supervised auto-encoder learning via motion/appearance disentanglement. Secondly, motion dynamics is transferred from the matched prototype sequence to input sequence via adaptively learned kernel, and the predicted frames are further refined through a motion-aware prediction model. Extensive qualitative and quantitative experiments on three widely used video prediction datasets demonstrate that: (1) the proposed framework essentially decomposes the hard task into a series of more approachable sub-tasks where a better solution is easier to be sought and (2) our proposed method performs favorably against state-of-the-art prediction methods.

next article Residual Dual Scale Scene Text Spotting by Fusing Bottom-Up and Top-Down Processing

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

https://drive.google.com/open?id=1G5gEnpSTt-xt887bJnEcWTASixsvQngK

Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2017). Stochastic variational video prediction.

Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools.

Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 4724–4733).

Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

Deng, J., Krause, J., Stark, M., & Li, F. (2016). Leveraging the wisdom of the crowd for fine-grained recognition. TPAMI, 38(4), 666–676.CrossRef

Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. In ICML.

Denton, E. L., & Birodkar, V. (2017). Unsupervised learning of disentangled representations from video. In NeurIPS.

Finn, C., Goodfellow, I. J., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NeurIPS.

Gavves, E., Fernando, B., Snoek, C. G. M., Smeulders, A. W. M., & Tuytelaars, T. (2015). Local alignments for fine-grained categorization. IJCV., 111(2), 191–212.CrossRef

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., & Bengio, Y. (2014). Generative adversarial nets. In NeurIPS.

Hariharan, B., Arbelaez, P., Girshick, R. B., & Malik, J. (2017). Object instance segmentation and fine-grained localization using hypercolumns. TPAMI, 39(4), 627–639.CrossRef

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRef

Hsieh, J., Liu, B., Huang, D., Li, F., & Niebles, J. C. (2018). Learning to decompose and disentangle representations for video prediction. In NeurIPS.

Huang, Z., Xu, J., & Ni, B. (2018). Human motion generation via cross-space constrained sampling. In IJCAI (pp. 757–763).

Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR (pp. 1647–1655).

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.

Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI, 36(7), 1325–1339.CrossRef

Isola, P., Zhu, J., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR.

Jang, Y., Kim, G., & Song, Y. (2018). Video prediction with appearance and motion conditions. In ICML.

Jia, X., Brabandere, B. D., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In NeurIPS.

Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. TPAMI, 24(7), 881–892.CrossRef

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.

Kurutach, T., Tamar, A., Yang, G., Russell, S. J., & Abbeel, P. (2018). Learning plannable representations with causal infogan. In NeurIPS.

Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. CoRR.

Lee, S., Purushwalkam, S., Cogswell, M., Ranjan, V., Crandall, D. J., & Batra, D. (2016). Stochastic multiple choice learning for training diverse deep ensembles. In NeurIPS.

Li, H., Huang, D., Morvan, J., Wang, Y., & Chen, L. (2015). Towards 3D face recognition in the real: A registration-free approach using fine-grained matching of 3D keypoint descriptors. IJCV, 113(2), 128–142.MathSciNetCrossRef

Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV.

Liang, X., Lee, L., Dai, W., & Xing, E. P. (2017). Dual motion GAN for future-flow embedded video prediction. In ICCV.

Lin, T., Roy Chowdhury, A., & Maji, S. (2018). Bilinear convolutional neural networks for fine-grained visual recognition. TPAMI, 40(6), 1309–1322.CrossRef

Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017). Predicting deeper into the future of semantic segmentation. In ICCV.

Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-sne. JMLR, 9(11), 2579–2605.MATH

Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., & Levine, S. (2018). Visual reinforcement learning with imagined goals. In NeurIPS.

Ni, B., Paramathayalan, V. R., Li, T., & Moulin, P. (2016). Multiple granularity modeling: A coarse-to-fine framework for fine-grained action analysis. IJCV, 120(1), 28–43.MathSciNetCrossRef

Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In ICML.

Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (xxxx) Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV, 119 (3):346–373 (16).

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI.

Ruder, M., Dosovitskiy, A., & Brox, T. (2018). Artistic style transfer for videos and spherical images. IJCV, 126(11), 1199–1219.MathSciNetCrossRef

Ryoo, M. S., & Matthies, L. H. (2016). First-person activity recognition: Feature, temporal structure, and prediction. IJCV, 119(3), 307–328.MathSciNetCrossRef

Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In NeurIPS.

Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.

Shen, F., Yan, S., & Zeng, G. (2018). Neural style transfer via meta networks. In CVPR (pp. 8061–8069).

Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., & Woo, W. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NeurIPS.

Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.

Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using lstms. In ICML.

Tian, Y., Li, J., Yu, S., & Huang, T. (2015). Learning complementary saliency priors for foreground object segmentation in complex scenes. IJCV, 111(2), 153–170.CrossRef

Tulyakov, S., Liu, M., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In CVPR.

Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017a). Decomposing motion and content for natural video sequence prediction.

Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017b). Learning to generate long-term future via hierarchical prediction. In ICML.

Wichers, N., Villegas, R., Erhan, D., & Lee, H. (2018). Hierarchical long-term video prediction without supervision. In ICML.

Wu, X., Hiramatsu, K., & Kashino, K. (2018). Label propagation with ensemble of pairwise geometric relations: Towards robust large-scale retrieval of object instances. IJCV, 126(7), 689–713.MathSciNetCrossRef

Xia, S., Wang, C., Chai, J., & Hodgins, J. K. (2015). Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics, 34(4), 1191–11910.CrossRef

Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. CoRR.

Xu, J., Ni, B., Li, Z., Cheng, S., & Yang, X. (2018a). Structure preserving video prediction. In CVPR (pp. 1460–1469).

Xu, J., Ni, B., & Yang, X. (2018b). Video prediction via selective sampling. In NeurIPS.

Xu, J., Xu, H., Ni, B., Yang, X., & Darrell, T. (2020a). Video prediction via example guidance. CoRR, arXiv:2007.01738.

Xu, J., Xu, H., Ni, B., Yang, X., Wang, X., & Darrell, T. (2020). Hierarchical style-based networks for motion synthesis. CoRR, arXiv:2008.10162.

Xu, Z., Tao, D., Huang, S., & Zhang, Y. (xxxx). Friend or foe: Fine-grained categorization with weak supervision. TIP, 26 (1):135–146.

Xue, T., Wu, J., Bouman, K. L., & Freeman, B. (2016). Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NeurIPS.

Yan, Y., Xu, J., Ni, B., Zhang, W., & Yang, X. (2017). Skeleton-aided articulated motion generation. In ACM MM (pp. 199–207).

Yang, R., Ni, B., Ma, C., Xu, Y., & Yang, X. (2017). Video segmentation via multiple granularity analysis. In CVPR.

Zhao, B., Feng, J., Wu, X., & Yan, S. (2017). A survey on deep learning-based fine-grained object classification and semantic segmentation. IJAC, 14, 119–135.

Title: Progressive Multi-granularity Analysis for Video Prediction
Authors: Jingwei Xu
Bingbing Ni
Xiaokang Yang
Publication date: 22-10-2020
Publisher: Springer US
Published in: International Journal of Computer Vision / Issue 3/2021
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-020-01389-w

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 3/2021

Correction to: Rooted Spanning Superpixels

Deep Nets: What have They Ever Done for Vision?

Weakly Supervised Group Mask Network for Object Detection

Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition Under Occlusion

AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild

Progressive DARTS: Bridging the Optimization Gap for NAS in the Wild

Premium Partner