Skip to main content
Erschienen in: International Journal of Computer Vision 10-11/2020

29.05.2020

Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN

verfasst von: Masaki Saito, Shunta Saito, Masanori Koyama, Sosuke Kobayashi

Erschienen in: International Journal of Computer Vision | Ausgabe 10-11/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Training of generative adversarial network (GAN) on a video dataset is a challenge because of the sheer size of the dataset and the complexity of each observation. In general, the computational cost of training GAN scales exponentially with the resolution. In this study, we present a novel memory efficient method of unsupervised learning of high-resolution video dataset whose computational cost scales only linearly with the resolution. We achieve this by designing the generator model as a stack of small sub-generators and training the model in a specific way. We train each sub-generator with its own specific discriminator. At the time of the training, we introduce between each pair of consecutive sub-generators an auxiliary subsampling layer that reduces the frame-rate by a certain ratio. This procedure can allow each sub-generator to learn the distribution of the video at different levels of resolution. We also need only a few GPUs to train a highly complex generator that far outperforms the predecessor in terms of inception scores.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The pre-trained model itself can be downloaded from http://​vlg.​cs.​dartmouth.​edu/​c3d/​.
 
2
Note that we mention videos after 16 frames. Up to 16 frames, our model generates stable videos regardless of the dataset.
 
Literatur
Zurück zum Zitat Acharya, D., Huang, Z., Paudel, D. P., & Gool, L. V. (2018). Towards high resolution video generation with progressive growing of sliced Wasserstein GANs. Arxiv preprint arXiv:1810.02419 Acharya, D., Huang, Z., Paudel, D. P., & Gool, L. V. (2018). Towards high resolution video generation with progressive growing of sliced Wasserstein GANs. Arxiv preprint arXiv:​1810.​02419
Zurück zum Zitat Akiba, T., Fukuda, K., & Suzuki, S. (2017). ChainerMN: Scalable distributed deep learning framework. In Proceedings of workshop on ML systems in NIPS Akiba, T., Fukuda, K., & Suzuki, S. (2017). ChainerMN: Scalable distributed deep learning framework. In Proceedings of workshop on ML systems in NIPS
Zurück zum Zitat Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H. & Levine, S. (2018). Stochastic variational video prediction. In ICLR. Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H. & Levine, S. (2018). Stochastic variational video prediction. In ICLR.
Zurück zum Zitat Bansal, A., Ma, S., Ramanan, D., & Sheikh, Y. (2018). Recycle-GAN: Unsupervised video retargeting. In ECCV. Bansal, A., Ma, S., Ramanan, D., & Sheikh, Y. (2018). Recycle-GAN: Unsupervised video retargeting. In ECCV.
Zurück zum Zitat Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. Arxiv preprint arXiv:1809.11096. Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. Arxiv preprint arXiv:​1809.​11096.
Zurück zum Zitat Byeon, W., Wang, Q., Srivastava, R. K., & Koumoutsakos, P. (2018). ContextVP: Fully context-aware video prediction. In ECCV. Byeon, W., Wang, Q., Srivastava, R. K., & Koumoutsakos, P. (2018). ContextVP: Fully context-aware video prediction. In ECCV.
Zurück zum Zitat Cai, H., Bai, C., Tai, Y. W., & Tang, C. K. (2018). Deep video generation. Prediction and completion of human action sequences. In ECCV. Cai, H., Bai, C., Tai, Y. W., & Tang, C. K. (2018). Deep video generation. Prediction and completion of human action sequences. In ECCV.
Zurück zum Zitat Denton, E., Chintala, S., Szlam, A., & Fergus, R. (2015). Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS. Denton, E., Chintala, S., Szlam, A., & Fergus, R. (2015). Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS.
Zurück zum Zitat Ebert, F., Finn, C., Lee, A.X. & Levine, S. (2017). Self-supervised visual planning with temporal skip connections. In Conference on robot learning (CoRL). Ebert, F., Finn, C., Lee, A.X. & Levine, S. (2017). Self-supervised visual planning with temporal skip connections. In Conference on robot learning (CoRL).
Zurück zum Zitat Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NIPS. Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NIPS.
Zurück zum Zitat Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS.
Zurück zum Zitat Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets. In NIPS. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets. In NIPS.
Zurück zum Zitat Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of Wasserstein GANs. In NIPS. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of Wasserstein GANs. In NIPS.
Zurück zum Zitat Hao, Z., Huang, X., & Belongie, S. (2018). Controllable video generation with sparse trajectories. In CVPR. Hao, Z., Huang, X., & Belongie, S. (2018). Controllable video generation with sparse trajectories. In CVPR.
Zurück zum Zitat Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In CVPR. Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In CVPR.
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Zurück zum Zitat Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS.
Zurück zum Zitat Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In ECCV. Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In ECCV.
Zurück zum Zitat Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR. Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR.
Zurück zum Zitat Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., et al. (2016). Video pixel networks. arXiv preprint arXiv:1610.00527. Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., et al. (2016). Video pixel networks. arXiv preprint arXiv:​1610.​00527.
Zurück zum Zitat Karpathy, A., Shetty, S., Toderici, G., Sukthankar, R., Leung, T., & Li Fei-Fei. (2014). Large-scale video classification with convolutional neural networks. In CVPR. Karpathy, A., Shetty, S., Toderici, G., Sukthankar, R., Leung, T., & Li Fei-Fei. (2014). Large-scale video classification with convolutional neural networks. In CVPR.
Zurück zum Zitat Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2018). Progressive growing of GANs for improved quality, stability, and variation. In ICLR. Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2018). Progressive growing of GANs for improved quality, stability, and variation. In ICLR.
Zurück zum Zitat Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR. Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.
Zurück zum Zitat Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523. Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. arXiv preprint arXiv:​1804.​01523.
Zurück zum Zitat Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. H. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. H. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV.
Zurück zum Zitat Liang, X., Lee, L., Dai, W., & Xing, E. P. (2017). Dual motion GAN for future-flow embedded video prediction. In ICCV. Liang, X., Lee, L., Dai, W., & Xing, E. P. (2017). Dual motion GAN for future-flow embedded video prediction. In ICCV.
Zurück zum Zitat Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In NIPS. Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In NIPS.
Zurück zum Zitat Liu, Z., Yeh, R. A., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In ICCV. Liu, Z., Yeh, R. A., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In ICCV.
Zurück zum Zitat Lotter, W., Kreiman, G., & Cox, D. (2017). Deep predictive coding networks for video prediction and unsupervised learning. In ICLR. Lotter, W., Kreiman, G., & Cox, D. (2017). Deep predictive coding networks for video prediction and unsupervised learning. In ICLR.
Zurück zum Zitat Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In ICLR. Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In ICLR.
Zurück zum Zitat Mescheder, L., Nowozin, S., & Geiger, A. (2018). Which training methods for GANs do actually converge? In ICML. Mescheder, L., Nowozin, S., & Geiger, A. (2018). Which training methods for GANs do actually converge? In ICML.
Zurück zum Zitat Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In ICLR. Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In ICLR.
Zurück zum Zitat Miyato, T., & Koyama, M. (2018). cGANs with projection discriminator. In ICLR. Miyato, T., & Koyama, M. (2018). cGANs with projection discriminator. In ICLR.
Zurück zum Zitat Oh, J., Guo, X., Lee, H., Lewis, R., & Singh, S. (2015). Action-conditional video prediction using deep networks in Atari games. In NIPS. Oh, J., Guo, X., Lee, H., Lewis, R., & Singh, S. (2015). Action-conditional video prediction using deep networks in Atari games. In NIPS.
Zurück zum Zitat Ohnishi, K., Yamamoto, S., Ushiku, Y., & Harada, T. (2018). Hierarchical video generation from orthogonal information: Optical flow and texture. In AAAI. Ohnishi, K., Yamamoto, S., Ushiku, Y., & Harada, T. (2018). Hierarchical video generation from orthogonal information: Optical flow and texture. In AAAI.
Zurück zum Zitat Oliphant, T. E. (2015). Guide to NumPy (2nd ed.). USA: CreateSpace Independent Publishing Platform. Oliphant, T. E. (2015). Guide to NumPy (2nd ed.). USA: CreateSpace Independent Publishing Platform.
Zurück zum Zitat Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR. Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR.
Zurück zum Zitat Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., & Chopra, S. (2014). Video (language) modeling: A baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604. Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., & Chopra, S. (2014). Video (language) modeling: A baseline for generative models of natural videos. arXiv preprint arXiv:​1412.​6604.
Zurück zum Zitat Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2018). FaceForensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179. Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2018). FaceForensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:​1803.​09179.
Zurück zum Zitat Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In ICCV. Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In ICCV.
Zurück zum Zitat Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. In NIPS. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. In NIPS.
Zurück zum Zitat Shi, X., Chen, Z., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. C. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS. Shi, X., Chen, Z., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. C. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS.
Zurück zum Zitat Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:​1212.​0402.
Zurück zum Zitat Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICML. Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICML.
Zurück zum Zitat Tokui, S., Oono, K., Hido, S., & Clayton, J. (2015). Chainer: A next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems in NIPS. Tokui, S., Oono, K., Hido, S., & Clayton, J. (2015). Chainer: A next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems in NIPS.
Zurück zum Zitat Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.
Zurück zum Zitat Tulyakov, S., Liu, M. Y., Yang, X., & Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In CVPR. Tulyakov, S., Liu, M. Y., Yang, X., & Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In CVPR.
Zurück zum Zitat Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., & Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717. Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., & Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:​1812.​01717.
Zurück zum Zitat Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In NIPS. Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In NIPS.
Zurück zum Zitat Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., et al. (2018a). Video-to-video synthesis. arXiv preprint arXiv:1808.06601. Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., et al. (2018a). Video-to-video synthesis. arXiv preprint arXiv:​1808.​06601.
Zurück zum Zitat Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018b). High-resolution image synthesis and semantic manipulation with conditional GANs. In CVPR. Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018b). High-resolution image synthesis and semantic manipulation with conditional GANs. In CVPR.
Zurück zum Zitat Wang, X., Girshick, R., Gupta, A., & He, K. (2018c). Non-local neural networks. In CVPR. Wang, X., Girshick, R., Gupta, A., & He, K. (2018c). Non-local neural networks. In CVPR.
Zurück zum Zitat Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., & Lin, D. (2018). Pose guided human video generation. In ECCV. Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., & Lin, D. (2018). Pose guided human video generation. In ECCV.
Zurück zum Zitat Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2018). Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2018). Self-attention generative adversarial networks. arXiv preprint arXiv:​1805.​08318.
Zurück zum Zitat Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., et al. (2017a). StackGAN++: Realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1710.10916. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., et al. (2017a). StackGAN++: Realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:​1710.​10916.
Zurück zum Zitat Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., et al. (2017b). StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., et al. (2017b). StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.
Zurück zum Zitat Zhang, Z., Xie, Y., & Yang, L. (2018). Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In CVPR. Zhang, Z., Xie, Y., & Yang, L. (2018). Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In CVPR.
Zurück zum Zitat Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. (2018). Learning to forecast and refine residual motion for image-to-video generation. In ECCV. Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. (2018). Learning to forecast and refine residual motion for image-to-video generation. In ECCV.
Zurück zum Zitat Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV. Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.
Metadaten
Titel
Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN
verfasst von
Masaki Saito
Shunta Saito
Masanori Koyama
Sosuke Kobayashi
Publikationsdatum
29.05.2020
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 10-11/2020
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-020-01333-y

Weitere Artikel der Ausgabe 10-11/2020

International Journal of Computer Vision 10-11/2020 Zur Ausgabe