nach oben

International Journal of Computer Vision

Erschienen in:

29.05.2020

Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN

verfasst von: Masaki Saito, Shunta Saito, Masanori Koyama, Sosuke Kobayashi

Erschienen in: International Journal of Computer Vision | Ausgabe 10-11/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Training of generative adversarial network (GAN) on a video dataset is a challenge because of the sheer size of the dataset and the complexity of each observation. In general, the computational cost of training GAN scales exponentially with the resolution. In this study, we present a novel memory efficient method of unsupervised learning of high-resolution video dataset whose computational cost scales only linearly with the resolution. We achieve this by designing the generator model as a stack of small sub-generators and training the model in a specific way. We train each sub-generator with its own specific discriminator. At the time of the training, we introduce between each pair of consecutive sub-generators an auxiliary subsampling layer that reduces the frame-rate by a certain ratio. This procedure can allow each sub-generator to learn the distribution of the video at different levels of resolution. We also need only a few GPUs to train a highly complex generator that far outperforms the predecessor in terms of inception scores.

Vorheriger Artikel Compositional GAN: Learning Image-Conditional Binary Composition

Nächster Artikel Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood Estimation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The pre-trained model itself can be downloaded from http://vlg.cs.dartmouth.edu/c3d/.

Note that we mention videos after 16 frames. Up to 16 frames, our model generates stable videos regardless of the dataset.

Acharya, D., Huang, Z., Paudel, D. P., & Gool, L. V. (2018). Towards high resolution video generation with progressive growing of sliced Wasserstein GANs. Arxiv preprint arXiv:1810.02419

Akiba, T., Fukuda, K., & Suzuki, S. (2017). ChainerMN: Scalable distributed deep learning framework. In Proceedings of workshop on ML systems in NIPS

Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H. & Levine, S. (2018). Stochastic variational video prediction. In ICLR.

Bansal, A., Ma, S., Ramanan, D., & Sheikh, Y. (2018). Recycle-GAN: Unsupervised video retargeting. In ECCV.

Borji, A. (2018). Pros and cons of GAN evaluation measures. Arxiv preprint arXiv:1802.03446.

Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. Arxiv preprint arXiv:1809.11096.

Byeon, W., Wang, Q., Srivastava, R. K., & Koumoutsakos, P. (2018). ContextVP: Fully context-aware video prediction. In ECCV.

Cai, H., Bai, C., Tai, Y. W., & Tang, C. K. (2018). Deep video generation. Prediction and completion of human action sequences. In ECCV.

Denton, E., Chintala, S., Szlam, A., & Fergus, R. (2015). Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS.

Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. Arxiv preprint arXiv:1802.07687.

Ebert, F., Finn, C., Lee, A.X. & Levine, S. (2017). Self-supervised visual planning with temporal skip connections. In Conference on robot learning (CoRL).

Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NIPS.

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets. In NIPS.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of Wasserstein GANs. In NIPS.

Hao, Z., Huang, X., & Belongie, S. (2018). Controllable video generation with sparse trajectories. In CVPR.

Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In CVPR.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS.

Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In ECCV.

Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR.

Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., et al. (2016). Video pixel networks. arXiv preprint arXiv:1610.00527.

Karpathy, A., Shetty, S., Toderici, G., Sukthankar, R., Leung, T., & Li Fei-Fei. (2014). Large-scale video classification with convolutional neural networks. In CVPR.

Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2018). Progressive growing of GANs for improved quality, stability, and variation. In ICLR.

Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.

Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523.

Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. H. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV.

Liang, X., Lee, L., Dai, W., & Xing, E. P. (2017). Dual motion GAN for future-flow embedded video prediction. In ICCV.

Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In NIPS.

Liu, Z., Yeh, R. A., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In ICCV.

Lotter, W., Kreiman, G., & Cox, D. (2017). Deep predictive coding networks for video prediction and unsupervised learning. In ICLR.

Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In ICLR.

Mescheder, L., Nowozin, S., & Geiger, A. (2018). Which training methods for GANs do actually converge? In ICML.

Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In ICLR.

Miyato, T., & Koyama, M. (2018). cGANs with projection discriminator. In ICLR.

Oh, J., Guo, X., Lee, H., Lewis, R., & Singh, S. (2015). Action-conditional video prediction using deep networks in Atari games. In NIPS.

Ohnishi, K., Yamamoto, S., Ushiku, Y., & Harada, T. (2018). Hierarchical video generation from orthogonal information: Optical flow and texture. In AAAI.

Oliphant, T. E. (2015). Guide to NumPy (2nd ed.). USA: CreateSpace Independent Publishing Platform.

Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR.

Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., & Chopra, S. (2014). Video (language) modeling: A baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604.

Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2018). FaceForensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179.

Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In ICCV.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. In NIPS.

Shi, X., Chen, Z., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. C. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS.

Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICML.

Tokui, S., Oono, K., Hido, S., & Clayton, J. (2015). Chainer: A next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems in NIPS.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.

Tulyakov, S., Liu, M. Y., Yang, X., & Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In CVPR.

Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., & Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717.

Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In NIPS.

Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., et al. (2018a). Video-to-video synthesis. arXiv preprint arXiv:1808.06601.

Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018b). High-resolution image synthesis and semantic manipulation with conditional GANs. In CVPR.

Wang, X., Girshick, R., Gupta, A., & He, K. (2018c). Non-local neural networks. In CVPR.

Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., & Lin, D. (2018). Pose guided human video generation. In ECCV.

Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2018). Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318.

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., et al. (2017a). StackGAN++: Realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1710.10916.

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., et al. (2017b). StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.

Zhang, Z., Xie, Y., & Yang, L. (2018). Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In CVPR.

Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. (2018). Learning to forecast and refine residual motion for image-to-video generation. In ECCV.

Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.

Titel: Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN
verfasst von: Masaki Saito
Shunta Saito
Masanori Koyama
Sosuke Kobayashi
Publikationsdatum: 29.05.2020
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 10-11/2020
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-020-01333-y

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 10-11/2020

SliderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters

Guest Editorial: Generative Adversarial Networks for Computer Vision

Discriminator Feature-Based Inference by Recycling the Discriminator of GANs

Layout2image: Image Generation from Layout

Towards Image-to-Video Translation: A Structure-Aware Approach via Multi-stage Generative Adversarial Networks

Densifying Supervision for Fine-Grained Visual Comparisons