Skip to main content
Top
Published in: International Journal of Computer Vision 3/2021

22-10-2020

Progressive Multi-granularity Analysis for Video Prediction

Authors: Jingwei Xu, Bingbing Ni, Xiaokang Yang

Published in: International Journal of Computer Vision | Issue 3/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Video prediction is challenging as real-world motion dynamics are usually multi-modally distributed. Existing stochastic methods commonly formulate random noise input with simple prior distribution, which is insufficient to model highly complex motion dynamics. This work proposes a progressive multiple granularity analysis framework to tackle the above difficulty. Firstly, to achieve coarse alignment, the input sequence is matched to prototype motion dynamics in the training set, based on self-supervised auto-encoder learning via motion/appearance disentanglement. Secondly, motion dynamics is transferred from the matched prototype sequence to input sequence via adaptively learned kernel, and the predicted frames are further refined through a motion-aware prediction model. Extensive qualitative and quantitative experiments on three widely used video prediction datasets demonstrate that: (1) the proposed framework essentially decomposes the hard task into a series of more approachable sub-tasks where a better solution is easier to be sought and (2) our proposed method performs favorably against state-of-the-art prediction methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2017). Stochastic variational video prediction. Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2017). Stochastic variational video prediction.
go back to reference Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools.
go back to reference Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 4724–4733). Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 4724–4733).
go back to reference Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP. Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.
go back to reference Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
go back to reference Deng, J., Krause, J., Stark, M., & Li, F. (2016). Leveraging the wisdom of the crowd for fine-grained recognition. TPAMI, 38(4), 666–676.CrossRef Deng, J., Krause, J., Stark, M., & Li, F. (2016). Leveraging the wisdom of the crowd for fine-grained recognition. TPAMI, 38(4), 666–676.CrossRef
go back to reference Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. In ICML. Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. In ICML.
go back to reference Denton, E. L., & Birodkar, V. (2017). Unsupervised learning of disentangled representations from video. In NeurIPS. Denton, E. L., & Birodkar, V. (2017). Unsupervised learning of disentangled representations from video. In NeurIPS.
go back to reference Finn, C., Goodfellow, I. J., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NeurIPS. Finn, C., Goodfellow, I. J., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NeurIPS.
go back to reference Gavves, E., Fernando, B., Snoek, C. G. M., Smeulders, A. W. M., & Tuytelaars, T. (2015). Local alignments for fine-grained categorization. IJCV., 111(2), 191–212.CrossRef Gavves, E., Fernando, B., Snoek, C. G. M., Smeulders, A. W. M., & Tuytelaars, T. (2015). Local alignments for fine-grained categorization. IJCV., 111(2), 191–212.CrossRef
go back to reference Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., & Bengio, Y. (2014). Generative adversarial nets. In NeurIPS. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., & Bengio, Y. (2014). Generative adversarial nets. In NeurIPS.
go back to reference Hariharan, B., Arbelaez, P., Girshick, R. B., & Malik, J. (2017). Object instance segmentation and fine-grained localization using hypercolumns. TPAMI, 39(4), 627–639.CrossRef Hariharan, B., Arbelaez, P., Girshick, R. B., & Malik, J. (2017). Object instance segmentation and fine-grained localization using hypercolumns. TPAMI, 39(4), 627–639.CrossRef
go back to reference Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRef Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRef
go back to reference Hsieh, J., Liu, B., Huang, D., Li, F., & Niebles, J. C. (2018). Learning to decompose and disentangle representations for video prediction. In NeurIPS. Hsieh, J., Liu, B., Huang, D., Li, F., & Niebles, J. C. (2018). Learning to decompose and disentangle representations for video prediction. In NeurIPS.
go back to reference Huang, Z., Xu, J., & Ni, B. (2018). Human motion generation via cross-space constrained sampling. In IJCAI (pp. 757–763). Huang, Z., Xu, J., & Ni, B. (2018). Human motion generation via cross-space constrained sampling. In IJCAI (pp. 757–763).
go back to reference Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR (pp. 1647–1655). Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR (pp. 1647–1655).
go back to reference Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
go back to reference Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI, 36(7), 1325–1339.CrossRef Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI, 36(7), 1325–1339.CrossRef
go back to reference Isola, P., Zhu, J., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR. Isola, P., Zhu, J., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR.
go back to reference Jang, Y., Kim, G., & Song, Y. (2018). Video prediction with appearance and motion conditions. In ICML. Jang, Y., Kim, G., & Song, Y. (2018). Video prediction with appearance and motion conditions. In ICML.
go back to reference Jia, X., Brabandere, B. D., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In NeurIPS. Jia, X., Brabandere, B. D., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In NeurIPS.
go back to reference Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. TPAMI, 24(7), 881–892.CrossRef Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. TPAMI, 24(7), 881–892.CrossRef
go back to reference Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.
go back to reference Kurutach, T., Tamar, A., Yang, G., Russell, S. J., & Abbeel, P. (2018). Learning plannable representations with causal infogan. In NeurIPS. Kurutach, T., Tamar, A., Yang, G., Russell, S. J., & Abbeel, P. (2018). Learning plannable representations with causal infogan. In NeurIPS.
go back to reference Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. CoRR. Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. CoRR.
go back to reference Lee, S., Purushwalkam, S., Cogswell, M., Ranjan, V., Crandall, D. J., & Batra, D. (2016). Stochastic multiple choice learning for training diverse deep ensembles. In NeurIPS. Lee, S., Purushwalkam, S., Cogswell, M., Ranjan, V., Crandall, D. J., & Batra, D. (2016). Stochastic multiple choice learning for training diverse deep ensembles. In NeurIPS.
go back to reference Li, H., Huang, D., Morvan, J., Wang, Y., & Chen, L. (2015). Towards 3D face recognition in the real: A registration-free approach using fine-grained matching of 3D keypoint descriptors. IJCV, 113(2), 128–142.MathSciNetCrossRef Li, H., Huang, D., Morvan, J., Wang, Y., & Chen, L. (2015). Towards 3D face recognition in the real: A registration-free approach using fine-grained matching of 3D keypoint descriptors. IJCV, 113(2), 128–142.MathSciNetCrossRef
go back to reference Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV.
go back to reference Liang, X., Lee, L., Dai, W., & Xing, E. P. (2017). Dual motion GAN for future-flow embedded video prediction. In ICCV. Liang, X., Lee, L., Dai, W., & Xing, E. P. (2017). Dual motion GAN for future-flow embedded video prediction. In ICCV.
go back to reference Lin, T., Roy Chowdhury, A., & Maji, S. (2018). Bilinear convolutional neural networks for fine-grained visual recognition. TPAMI, 40(6), 1309–1322.CrossRef Lin, T., Roy Chowdhury, A., & Maji, S. (2018). Bilinear convolutional neural networks for fine-grained visual recognition. TPAMI, 40(6), 1309–1322.CrossRef
go back to reference Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017). Predicting deeper into the future of semantic segmentation. In ICCV. Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017). Predicting deeper into the future of semantic segmentation. In ICCV.
go back to reference Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-sne. JMLR, 9(11), 2579–2605.MATH Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-sne. JMLR, 9(11), 2579–2605.MATH
go back to reference Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., & Levine, S. (2018). Visual reinforcement learning with imagined goals. In NeurIPS. Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., & Levine, S. (2018). Visual reinforcement learning with imagined goals. In NeurIPS.
go back to reference Ni, B., Paramathayalan, V. R., Li, T., & Moulin, P. (2016). Multiple granularity modeling: A coarse-to-fine framework for fine-grained action analysis. IJCV, 120(1), 28–43.MathSciNetCrossRef Ni, B., Paramathayalan, V. R., Li, T., & Moulin, P. (2016). Multiple granularity modeling: A coarse-to-fine framework for fine-grained action analysis. IJCV, 120(1), 28–43.MathSciNetCrossRef
go back to reference Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In ICML. Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In ICML.
go back to reference Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (xxxx) Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV, 119 (3):346–373 (16). Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (xxxx) Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV, 119 (3):346–373 (16).
go back to reference Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI.
go back to reference Ruder, M., Dosovitskiy, A., & Brox, T. (2018). Artistic style transfer for videos and spherical images. IJCV, 126(11), 1199–1219.MathSciNetCrossRef Ruder, M., Dosovitskiy, A., & Brox, T. (2018). Artistic style transfer for videos and spherical images. IJCV, 126(11), 1199–1219.MathSciNetCrossRef
go back to reference Ryoo, M. S., & Matthies, L. H. (2016). First-person activity recognition: Feature, temporal structure, and prediction. IJCV, 119(3), 307–328.MathSciNetCrossRef Ryoo, M. S., & Matthies, L. H. (2016). First-person activity recognition: Feature, temporal structure, and prediction. IJCV, 119(3), 307–328.MathSciNetCrossRef
go back to reference Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In NeurIPS. Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In NeurIPS.
go back to reference Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR. Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.
go back to reference Shen, F., Yan, S., & Zeng, G. (2018). Neural style transfer via meta networks. In CVPR (pp. 8061–8069). Shen, F., Yan, S., & Zeng, G. (2018). Neural style transfer via meta networks. In CVPR (pp. 8061–8069).
go back to reference Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., & Woo, W. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NeurIPS. Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., & Woo, W. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NeurIPS.
go back to reference Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
go back to reference Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using lstms. In ICML. Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using lstms. In ICML.
go back to reference Tian, Y., Li, J., Yu, S., & Huang, T. (2015). Learning complementary saliency priors for foreground object segmentation in complex scenes. IJCV, 111(2), 153–170.CrossRef Tian, Y., Li, J., Yu, S., & Huang, T. (2015). Learning complementary saliency priors for foreground object segmentation in complex scenes. IJCV, 111(2), 153–170.CrossRef
go back to reference Tulyakov, S., Liu, M., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In CVPR. Tulyakov, S., Liu, M., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In CVPR.
go back to reference Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017a). Decomposing motion and content for natural video sequence prediction. Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017a). Decomposing motion and content for natural video sequence prediction.
go back to reference Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017b). Learning to generate long-term future via hierarchical prediction. In ICML. Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017b). Learning to generate long-term future via hierarchical prediction. In ICML.
go back to reference Wichers, N., Villegas, R., Erhan, D., & Lee, H. (2018). Hierarchical long-term video prediction without supervision. In ICML. Wichers, N., Villegas, R., Erhan, D., & Lee, H. (2018). Hierarchical long-term video prediction without supervision. In ICML.
go back to reference Wu, X., Hiramatsu, K., & Kashino, K. (2018). Label propagation with ensemble of pairwise geometric relations: Towards robust large-scale retrieval of object instances. IJCV, 126(7), 689–713.MathSciNetCrossRef Wu, X., Hiramatsu, K., & Kashino, K. (2018). Label propagation with ensemble of pairwise geometric relations: Towards robust large-scale retrieval of object instances. IJCV, 126(7), 689–713.MathSciNetCrossRef
go back to reference Xia, S., Wang, C., Chai, J., & Hodgins, J. K. (2015). Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics, 34(4), 1191–11910.CrossRef Xia, S., Wang, C., Chai, J., & Hodgins, J. K. (2015). Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics, 34(4), 1191–11910.CrossRef
go back to reference Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. CoRR. Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. CoRR.
go back to reference Xu, J., Ni, B., Li, Z., Cheng, S., & Yang, X. (2018a). Structure preserving video prediction. In CVPR (pp. 1460–1469). Xu, J., Ni, B., Li, Z., Cheng, S., & Yang, X. (2018a). Structure preserving video prediction. In CVPR (pp. 1460–1469).
go back to reference Xu, J., Ni, B., & Yang, X. (2018b). Video prediction via selective sampling. In NeurIPS. Xu, J., Ni, B., & Yang, X. (2018b). Video prediction via selective sampling. In NeurIPS.
go back to reference Xu, J., Xu, H., Ni, B., Yang, X., Wang, X., & Darrell, T. (2020). Hierarchical style-based networks for motion synthesis. CoRR, arXiv:2008.10162. Xu, J., Xu, H., Ni, B., Yang, X., Wang, X., & Darrell, T. (2020). Hierarchical style-based networks for motion synthesis. CoRR, arXiv:​2008.​10162.
go back to reference Xu, Z., Tao, D., Huang, S., & Zhang, Y. (xxxx). Friend or foe: Fine-grained categorization with weak supervision. TIP, 26 (1):135–146. Xu, Z., Tao, D., Huang, S., & Zhang, Y. (xxxx). Friend or foe: Fine-grained categorization with weak supervision. TIP, 26 (1):135–146.
go back to reference Xue, T., Wu, J., Bouman, K. L., & Freeman, B. (2016). Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NeurIPS. Xue, T., Wu, J., Bouman, K. L., & Freeman, B. (2016). Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NeurIPS.
go back to reference Yan, Y., Xu, J., Ni, B., Zhang, W., & Yang, X. (2017). Skeleton-aided articulated motion generation. In ACM MM (pp. 199–207). Yan, Y., Xu, J., Ni, B., Zhang, W., & Yang, X. (2017). Skeleton-aided articulated motion generation. In ACM MM (pp. 199–207).
go back to reference Yang, R., Ni, B., Ma, C., Xu, Y., & Yang, X. (2017). Video segmentation via multiple granularity analysis. In CVPR. Yang, R., Ni, B., Ma, C., Xu, Y., & Yang, X. (2017). Video segmentation via multiple granularity analysis. In CVPR.
go back to reference Zhao, B., Feng, J., Wu, X., & Yan, S. (2017). A survey on deep learning-based fine-grained object classification and semantic segmentation. IJAC, 14, 119–135. Zhao, B., Feng, J., Wu, X., & Yan, S. (2017). A survey on deep learning-based fine-grained object classification and semantic segmentation. IJAC, 14, 119–135.
Metadata
Title
Progressive Multi-granularity Analysis for Video Prediction
Authors
Jingwei Xu
Bingbing Ni
Xiaokang Yang
Publication date
22-10-2020
Publisher
Springer US
Published in
International Journal of Computer Vision / Issue 3/2021
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-020-01389-w

Other articles of this Issue 3/2021

International Journal of Computer Vision 3/2021 Go to the issue

Premium Partner