skip to main content
research-article

Deep video portraits

Published:30 July 2018Publication History
Skip Abstract Section

Abstract

We present a novel approach that enables photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, we are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor. The core of our approach is a generative neural network with a novel space-time architecture. The network takes as input synthetic renderings of a parametric face model, based on which it predicts photo-realistic video frames for a given target actor. The realism in this rendering-to-video transfer is achieved by careful adversarial training, and as a result, we can create modified target videos that mimic the behavior of the synthetically-created input. In order to enable source-to-target video re-animation, we render a synthetic target video with the reconstructed head animation parameters from a source video, and feed it into the trained network - thus taking full control of the target. With the ability to freely recombine source and target parameters, we are able to demonstrate a large variety of video rewrite applications without explicitly modeling hair, body or background. For instance, we can reenact the full head using interactive user-controlled editing, and realize high-fidelity visual dubbing. To demonstrate the high quality of our output, we conduct an extensive series of experiments and evaluations, where for instance a user study shows that our video edits are hard to detect.

Skip Supplemental Material Section

Supplemental Material

163-129.mp4

mp4

228.2 MB

a163-kim.mp4

mp4

288 MB

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google ScholarGoogle Scholar
  2. Oleg Alexander, Mike Rogers, William Lambeth, Jen-Yuan Chiang, Wan-Chun Ma, Chuan-Chang Wang, and Paul Debevec. 2010. The Digital Emily Project: Achieving a Photorealistic Digital Actor. IEEE Computer Graphics and Applications 30, 4 (July/ August 2010), 20--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen. 2017. Bringing Portraits to Life. ACM Transactions on Graphics (SIGGRAPH Asia) 36, 6 (November 2017), 196:1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Volker Blanz, Kristina Scherbaum, Thomas Vetter, and Hans-Peter Seidel. 2004. Exchanging Faces in Images. Computer Graphics Forum (Eurographics) 23, 3 (September 2004), 669--676.Google ScholarGoogle ScholarCross RefCross Ref
  5. Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3D Faces. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 187--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, and Stefanos Zafeiriou. 2018. Large Scale 3D Morphable Models. International Journal of Computer Vision 126, 2 (April 2018), 233--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 353--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time High-fidelity Facial Performance Capture. ACM Transactions on Graphics (SIGGRAPH) 34, 4 (July 2015), 46:1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chen Cao, Qiming Hou, and Kun Zhou. 2014a. Displaced Dynamic Expression Regression for Real-time Facial Tracking and Animation. ACM Transactions on Graphics (SIGGRAPH) 33, 4 (July 2014), 43:1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2014b. FaceWarehouse: A 3D Facial Expression Database for Visual Computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (March 2014), 413--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-time Facial Animation with Image-based Dynamic Avatars. ACM Transactions on Graphics (SIGGRAPH) 35, 4 (July 2016), 126:1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yao-Jen Chang and Tony Ezzat. 2005. Transferable Videorealistic Speech Animation. In Symposium on Computer Animation (SCA). 143--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Qifeng Chen and Vladlen Koltun. 2017. Photographic Image Synthesis with Cascaded Refinement Networks. In International Conference on Computer Vision (ICCV). 1520--1529.Google ScholarGoogle Scholar
  14. Kevin Dale, Kalyan Sunkavalli, Micah K. Johnson, Daniel Vlasic, Wojciech Matusik, and Hanspeter Pfister. 2011. Video face replacement. ACM Transactions on Graphics (SIGGRAPH Asia) 30, 6 (December 2011), 130:1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. ACM Transactions on Graphics (SIGGRAPH) 21, 3 0uly 2002), 388--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ohad Fried, Eli Shechtman, Dan B. Goldman, and Adam Finkelstein. 2016. Perspective-aware Manipulation of Portrait Photos. ACM Transactions on Graphics (SIGGRAPH) 35, 4 (July 2016), 128:1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. 2014. Driving High-Re solution Facial Scans with Video Performance Capture. ACM Transactions on Graphics 34, 1 (December 2014), 8:1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and Victor Lempitsky. 2016. DeepWarp: Photorealistic Image Resynthesis for Gaze Manipulation. In European Conference on Computer Vision (ECCV). 311--326.Google ScholarGoogle Scholar
  19. Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormaehlen, Patrick Pérez, and Christian Theobalt. 2014. Automatic Face Reenactment. In Conference on Computer Vision and Pattern Recognition (CVPR). 4217--4224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick Perez, and Christian Theobalt. 2015. VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track. Computer Graphics Forum (Eurographics) 34, 2 (May 2015), 193--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick Pérez, and Christian Theobalt. 2016. Reconstruction of Personalized 3D Face Rigs from Monocular Video. ACM Transactions on Graphics 35, 3 (June 2016), 28:1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Geoffrey E. Hinton and Ruslan Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313, 5786 (July 2006), 504--507.Google ScholarGoogle ScholarCross RefCross Ref
  24. Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar Digitization from a Single Image for Real-time Rendering. ACM Transactions on Graphics (SIGGRAPH Asia) 36, 6 (November 2017), 195:1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D Avatar Creation from Hand-held Video Input. ACM Transactions on Graphics (SIGGRAPH) 34, 4 (July 2015), 45:1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image Translation with Conditional Adversarial Networks. In Conference on Computer Vision and Pattern Recognition (CVPR). 5967--5976.Google ScholarGoogle Scholar
  27. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  28. Ira Kemelmacher-Shlizerman. 2013. Internet-Based Morphable Model. In International Conference on Computer Vision (ICCV). 3256--3263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steven M. Seitz. 2010. Being John Malkovich. In European Conference on Computer Vision (ECCV). 341--353. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ira Kemelmacher-Shlizerman, Eli Shechtman, Rahul Garg, and Steven M. Seitz. 2011. Exploring photobios. ACM Transactions on Graphics (SIGGRAPH) 30, 4 (August 2011), 61:1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  32. Christoph Lassner, Gerard Pons-Moll, and Peter V. Gehler. 2017. A Generative Model of People in Clothing. In International Conference on Computer Vision (ICCV). 853--862.Google ScholarGoogle Scholar
  33. Hao Li, Laura Trutoiu, Kyle Olszewski, Lingyu Wei, Tristan Trutna, Pei-Lun Hsieh, Aaron Nicholls, and Chongyang Ma. 2015. Facial Performance Sensing Head-mounted Display. ACM Transactions on Graphics (SIGGRAPH) 34, 4 0uly 2015), 47:1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kai Li, Qionghai Dai, Ruiping Wang, Yebin Liu, Feng Xu, and Jue Wang. 2014. A Data-Driven Approach for Facial Expression Retargeting in Video. IEEE Transactions on Multimedia 16, 2 (February 2014), 299--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kang Liu and Joern Ostermann. 2011. Realistic facial expression synthesis for an image-based talking head. In International Conference on Multimedia and Expo (ICME). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised Image-to-image Translation Networks. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  37. Zicheng Liu, Ying Shan, and Zhengyou Zhang. 2001. Expressive Expression Mapping with Ratio Images. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 271--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Liqian Ma, Qianru Sun, Xu Jia, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. 2017. Pose Guided Person Image Generation. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  39. Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. (2014). https://arxiv.org/abs/1411.1784 arXiv:1411.1784.Google ScholarGoogle Scholar
  40. Kyle Olszewski, Zimo Li, Chao Yang, Yi Zhou, Ronald Yu, Zeng Huang, Sitao Xiang, Shunsuke Saito, Pushmeet Kohli, and Hao Li. 2017. Realistic Dynamic Facial Textures from a Single Image using GANs. In International Conference on Computer Vision (ICCV). 5439--5448.Google ScholarGoogle ScholarCross RefCross Ref
  41. Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-fidelity Facial and Speech Animation for VR HMDs. ACM Transactions on Graphics (SIGGRAPH Asia) 35, 6 (November 2016), 221:1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  43. Ravi Ramamoorthi and Pat Hanrahan. 2001. An efficient representation for irradiance environment maps. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 497--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Elad Richardson, Matan Sela, and Ron Kimmel. 2016. 3D Face Reconstruction by Learning from Synthetic Data. In International Conference on 3D Vision (3DV). 460--469.Google ScholarGoogle ScholarCross RefCross Ref
  45. Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. 2017. Learning Detailed Face Reconstruction from a Single Image. In Conference on Computer Vision and Pattern Recognition (CVPR). 5553--5562.Google ScholarGoogle ScholarCross RefCross Ref
  46. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). 234--241.Google ScholarGoogle Scholar
  47. Joseph Roth, Yiying Tong Tong, and Xiaoming Liu. 2017. Adaptive 3D Face Reconstruction from Unconstrained Photo Collections. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11 (November 2017), 2127--2141.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. 2011. Real-time avatar animation from a single image. In International Conference on Automatic Face and Gesture Recognition (FG). 117--124.Google ScholarGoogle Scholar
  49. Matan Sela, Elad Richardson, and Ron Kimmel. 2017. Unrestricted Facial Geometry Reconstruction Using Image-to-image Translation. In International Conference on Computer Vision (ICCV). 1585--1594.Google ScholarGoogle Scholar
  50. Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. 2014. Automatic Acquisition of High-fidelity Facial Performances Using Monocular Videos. ACM Transactions on Graphics (SIGGRAPH Asia) 33, 6 (November 2014), 222:1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Robert W. Sumner and Jovan Popović. 2004. Deformation Transfer for Triangle Meshes. ACM Transactions on Graphics (SIGGRAPH) 23, 3 (August 2004), 399--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M. Seitz. 2014. Total Moving Face Reconstruction. In European Conference on Computer Vision (ECCV) (Lecture Notes in Computer Science), Vol. 8692. 796--812.Google ScholarGoogle Scholar
  53. Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2015. What Makes Tom Hanks Look Like Tom Hanks. In International Conference on Computer Vision (ICCV). 3952--3960. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Transactions on Graphics (SIGGRAPH) 36, 4 (July 2017), 95:1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Yaniv Taigman, Adam Polyak, and Lior Wolf. 2017. Unsupervised Cross-Domain Image Generation. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  56. Ayush Tewari, Michael Zollhöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard. Patrick Pérez, and Christian Theobalt. 2017. MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. In International Conference on Computer Vision (ICCV). 3735--3744.Google ScholarGoogle Scholar
  57. Justus Thies, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger, and Christian Theobalt. 2015. Real-time Expression Transfer for Facial Reenactment. ACM Transactions on Graphics (SIGGRAPH Asia) 34, 6 (November 2015), 183:1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2Face: Real-Time Face Capture and Reenactment of RGB Videos. In Conference on Computer Vision and Pattern Recognition (CVPR). 2387--2395.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2018. FaceVR: Real-Time Facial Reenactment and Eye Gaze Control in Virtual Reality. ACM Transactions on Graphics (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gerard Medioni. 2017. Regressing Robust and Discriminative 3D Morphable Models with a very Deep Neural Network. In Conference on Computer Vision and Pattern Recognition (CVPR). 1493--1502.Google ScholarGoogle ScholarCross RefCross Ref
  61. Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. 2005. Face Transfer with Multilinear Models. ACM Transactions on Graphics (SIGGRAPH) 24, 3 (July 2005), 426--433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Chao Wang, Haiyong Zheng, Zhibin Yu, Ziqiang Zheng, Zhaorui Gu, and Bing Zheng. 2017. Discriminative Region Proposal Adversarial Networks for High-Quality Image-to-image Translation. (2017). https://arxiv.org/abs/711.09554 arXiv:1711.09554.Google ScholarGoogle Scholar
  63. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  64. Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime Performance-based Facial Animation. ACM Transactions on Graphics (SIGGRAPH) 30, 4 (July 2011), 77:1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. 2018. GazeDirector: Fully articulated eye gaze redirection in video. Computer Graphics Forum (Eurographics) 37, 2 (2018).Google ScholarGoogle Scholar
  66. Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Beeler. 2016. An Anatomically-Constrained Local Deformation Model for Monocular Face Capture. ACM Transactions on Graphics (SIGGRAPH) 35, 4 (July 2016), 115:1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. DualGAN: Unsupervised Dual Learning for Image-to-image Translation. In International Conference on Computer Vision (ICCV). 2868--2876.Google ScholarGoogle Scholar
  68. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-image Translation using Cycle-Consistent Adversarial Networks. In International Conference on Computer Vision (ICCV). 2242--2251.Google ScholarGoogle Scholar
  69. Michael Zollhöfer, Justus Thies, Pablo Garrido, Derek Bradley, Thabo Beeler, Patrick Pérez, Marc Stamminger, Matthias Nießner, and Christian Theobalt. 2018. State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications. Computer Graphics Forum 37, 2 (2018).Google ScholarGoogle Scholar

Index Terms

  1. Deep video portraits

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Graphics
            ACM Transactions on Graphics  Volume 37, Issue 4
            August 2018
            1670 pages
            ISSN:0730-0301
            EISSN:1557-7368
            DOI:10.1145/3197517
            Issue’s Table of Contents

            Copyright © 2018 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 30 July 2018
            Published in tog Volume 37, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader