Abstract
With the rising interest in personalized VR and gaming experiences comes the need to create high quality 3D avatars that are both low-cost and variegated. Due to this, building dynamic avatars from a single unconstrained input image is becoming a popular application. While previous techniques that attempt this require multiple input images or rely on transferring dynamic facial appearance from a source actor, we are able to do so using only one 2D input image without any form of transfer from a source image. We achieve this using a new conditional Generative Adversarial Network design that allows fine-scale manipulation of any facial input image into a new expression while preserving its identity. Our photoreal avatar GAN (paGAN) can also synthesize the unseen mouth interior and control the eye-gaze direction of the output, as well as produce the final image from a novel viewpoint. The method is even capable of generating fully-controllable temporally stable video sequences, despite not using temporal information during training. After training, we can use our network to produce dynamic image-based avatars that are controllable on mobile devices in real time. To do this, we compute a fixed set of output images that correspond to key blendshapes, from which we extract textures in UV space. Using a subject's expression blendshapes at run-time, we can linearly blend these key textures together to achieve the desired appearance. Furthermore, we can use the mouth interior and eye textures produced by our network to synthesize on-the-fly avatar animations for those regions. Our work produces state-of-the-art quality image and video synthesis, and is the first to our knowledge that is able to generate a dynamically textured avatar with a mouth interior, all from a single image.
Supplemental Material
- P. Ekman and W. Friesen. 1978. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto.Google Scholar
- O. Alexander, G. Fyffe, J. Busch, X. Yu, R. Ichikari, A. Jones, P. Debevec, J. Jimenez, E. Danvoye, B. Antionazzi, M. Eheler, Z. Kysela, and J. von der Pahlen. 2013. Digital Ira: Creating a Real-time Photoreal Digital Actor. In ACM SIGGRAPH 2013 Posters (SIGGRAPH '13). ACM, New York, NY, USA, Article 1, 1 pages. Google ScholarDigital Library
- B. Amberg, R. Knothe, and T. Vetter. 2008. Expression Invariant 3D Face Recognition with a Morphable Model. In International Conference on Automatic Face Gesture Recognition. 1--6.Google Scholar
- H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and M. F. Cohen. 2017. Bringing Portraits to Life. ACM Trans. Graph. 36, 4 (2017), to appear. Google ScholarDigital Library
- V. Blanz and T. Vetter. 1999. A Morphable Model for the Synthesis of 3D Faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '99). 187--194. Google ScholarDigital Library
- J. Booth, A. Roussos, A. Ponniah, D. Dunaway, and S. Zafeiriou. 2017. Large Scale 3D Morphable Models. International Journal of Computer Vision (2017), 1--22. Google ScholarDigital Library
- J. Booth, A. Roussos, S. Zafeiriou, A. Ponniahy, and D. Dunaway. 2016. A 3D Morphable Model Learnt from 10,000 Faces. In Conference on Computer Vision and Pattern Recognition. 5543--5552.Google Scholar
- S. Bouaziz, Y. Wang, and M. Pauly. 2013. Online Modeling for Realtime Facial Animation. ACM Trans. Graph. 32, 4, Article 40 (July 2013), 10 pages. Google ScholarDigital Library
- C. Cao, D. Bradley, K. Zhou, and T. Beeler. 2015. Real-time high-fidelity facial performance capture. ACM Trans. Graph. 34, 4 (2015), 46. Google ScholarDigital Library
- C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. 2014. Facewarehouse: A 3d facial expression database for visual computing. IEEE TVCG 20, 3 (2014), 413--425. Google ScholarDigital Library
- C. Cao, H. Wu, Y. Weng, T. Shao, and K. Zhou. 2016. Real-time facial animation with image-based dynamic avatars. ACM Trans. Graph. 35, 4 (2016), 126. Google ScholarDigital Library
- D. Casas, A. Feng, O. Alexander, G. Fyffe, P. Debevec, R. Ichikari, H. Li, K. Olszewski, E. Suma, and A. Shapiro. 2016. Rapid Photorealistic Blendshape Modeling from RGB-D Sensors. In Proceedings of the 29th International Conference on Computer Animation and Social Agents. ACM, 121--129. Google ScholarDigital Library
- Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. 2017. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. arXiv preprint arXiv:1711.09020 (2017).Google Scholar
- K. Dale, K. Sunkavalli, M. K. Johnson, D. Vlasic, W. Matusik, and H. Pfister. 2011. Video Face Replacement. ACM Trans. Graph. 30, 6, Article 130 (Dec. 2011), 10 pages. Google ScholarDigital Library
- H. Ding, K. Sricharan, and R. Chellappa. 2017. Exprgan: Facial expression editing with controllable expression intensity. arXiv preprint arXiv:1709.03842 (2017).Google Scholar
- S. Du, Y. Tao, and A. M. Martinez. 2014. Compound facial expressions of emotion. Proceedings of the National Academy of Sciences 111, 15 (2014), E1454--E1462.Google ScholarCross Ref
- P. Garrido, M. Zollhöfer, D. Casas, L. Valgaerts, K. Varanasi, P. Pérez, and C. Theobalt. 2016. Reconstruction of personalized 3D face rigs from monocular video. ACM Trans. Graph. 35, 3 (2016), 28. Google ScholarDigital Library
- L. A. Gatys, A. S. Ecker, and M. Bethge. 2015. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015).Google Scholar
- P.-L. Hsieh, C. Ma, J. Yu, and H. Li. 2015. Unconstrained realtime facial performance capture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1675--1683.Google Scholar
- L. Hu, S. Saito, L. Wei, K. Nagano, J. Seo, J. Fursund, I. Sadeghi, C. Sun, Y.-C. Chen, and H. Li. 2017. Avatar Digitization From a Single Image For Real-Time Rendering. ACM Trans. Graph. 36, 6 (2017). Google ScholarDigital Library
- L. Huynh, W. Chen, S. Saito, J. Xing, K. Nagano, A. Jones, P. Debevec, and H. Li. 2018. Mesoscopic Facial Geometry Inference Using Deep Neural Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. 2016. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004 (2016).Google Scholar
- J. Jimenez, T. Scully, N. Barbosa, C. Donner, X. Alvarez, T. Viera, P. Matts, V. Orvalho, D. Gutierrez, and T. Weyrich. 2010. A Practical Appearance Model for Dynamic Facial Color. 29, 5 (2010), 141:1--141:9.Google Scholar
- T. Karras, T. Aila, S. Laine, and J. Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).Google Scholar
- H. Kim, P. Carrido, A. Tewari, W. Xu, J. Thies, M. Niessner, P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt. 2018. Deep Video Portraits. ACM Trans. Graph. 37, 4, Article 163 (July 2018), 14 pages. Google ScholarDigital Library
- D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and A. Van Knippenberg. 2010. Presentation and validation of the Radboud Faces Database. Cognition and emotion 24, 8 (2010), 1377--1388.Google Scholar
- H. Li, B. Adams, L. J. Guibas, and M. Pauly. 2009. Robust Single-View Geometry And Motion Reconstruction. ACM Trans. Graph. 28, 5 (2009). Google ScholarDigital Library
- H. Li, T. Weise, and M. Pauly. 2010. Example-Based Facial Rigging. ACM Trans. Graph. 29, 3 (July 2010). Google ScholarDigital Library
- H. Li, J. Yu, Y. Ye, and C. Bregler. 2013. Realtime Facial Animation with On-the-fly Correctives. ACM Trans. Graph. 32, 4 (July 2013). Google ScholarDigital Library
- D. S. Ma, J. Correll, and B. Wittenbrink. 2015. The Chicago face database: A free stimulus set of faces and norming data. Behavior research methods 47, 4 (2015), 1122--1135.Google Scholar
- K. Olszewski, Z. Li, C. Yang, Y. Zhou, R. Yu, Z. Huang, S. Xiang, S. Saito, P. Kohli, and H. Li. Realistic dynamic facial textures from a single image using gans.Google Scholar
- S. Saito, T. Li, and H. Li. 2016. Real-Time Facial Segmentation and Performance Capture from RGB Input. In ECCV.Google Scholar
- S. Saito, L. Wei, L. Hu, K. Nagano, and H. Li. 2017. Photorealistic Facial Texture Inference Using Deep Neural Networks. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE.Google Scholar
- Y. Seol, J. Seo, P. H. Kim, J. Lewis, and J. Noh. 2011. Artist friendly facial animation retargeting. In ACM Trans. Graph., Vol. 30. ACM, 162. Google ScholarDigital Library
- L. Song, Z. Lu, R. He, Z. Sun, and T. Tan. 2017. Geometry Guided Adversarial Facial Expression Synthesis. arXiv preprint arXiv:1712.03474 (2017).Google Scholar
- G. Stratou, A. Ghosh, P. Debevec, and L.-P. Morency. 2011. Effect of illumination on automatic expression recognition: a novel 3D relightable facial database. In Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on. IEEE, 611--618.Google ScholarCross Ref
- S. Suwajanakorn, I. Kemelmacher-Shlizerman, and S. M. Seitz. 2014. Total moving face reconstruction. In ECCV. Springer, 796--812.Google Scholar
- S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. 36, 4 (2017), 95. Google ScholarDigital Library
- J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt. 2015. Real-time Expression Transfer for Facial Reenactment. ACM Trans. Graph. 34, 6 (2015). Google ScholarDigital Library
- J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2016a. Face2Face: Real-time Face Capture and Reenactment of RGB Videos. In IEEE CVPR.Google Scholar
- J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. 2016b. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2387--2395.Google Scholar
- D. Vlasic, M. Brand, H. Pfister, and J. Popović. 2005. Face transfer with multilinear models. ACM Trans. Graph. 24, 3 (2005), 426--433. Google ScholarDigital Library
- T. Weise, S. Bouaziz, H. Li, and M. Pauly. 2011. Realtime Performance-Based Facial Animation. ACM Trans. Graph. 30, 4 (July 2011). Google ScholarDigital Library
- T. Weise, H. Li, L. V. Gool, and M. Pauly. 2009. Face/Off: Live Facial Puppetry. In Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer animation (Proc. SCA'09). Eurographics Association, ETH Zurich. Google ScholarDigital Library
- X. Wu, R. He, Z. Sun, and T. Tan. 2015. A light CNN for deep face representation with noisy labels. arXiv preprint arXiv:1511.02683 (2015).Google Scholar
- S. Yamaguchi, S. Saito, K. Nagano, Y. Zhao, W. Chen, K. Olszewski, S. Morishima, and H. Li. 2018. High-fidelity Facial Reflectance and Geometry Inference from an Unconstrained Image. ACM Trans. Graph. 37, 4, Article 162 (July 2018), 14 pages. Google ScholarDigital Library
- F. Yang, J. Wang, E. Shechtman, L. Bourdev, and D. Metaxas. 2011. Expression flow for 3D-aware face component transfer. ACM Trans. Graph. 30, 4 (2011), 60:1--10. Google ScholarDigital Library
- J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 (2017).Google Scholar
Index Terms
- paGAN: real-time avatars using dynamic textures
Recommendations
Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces
Synthesizing expressive facial animation is a very challenging topic within the graphics community. In this paper, we present an expressive facial animation synthesis system enabled by automated learning from facial motion capture data. Accurate 3D ...
Real-time facial animation with image-based dynamic avatars
We present a novel image-based representation for dynamic 3D avatars, which allows effective handling of various hairstyles and headwear, and can generate expressive facial animations with fine-scale details in real-time. We develop algorithms for ...
Video textures
SIGGRAPH '00: Proceedings of the 27th annual conference on Computer graphics and interactive techniquesThis paper introduces a new type of medium, called a video texture, which has qualities somewhere between those of a photograph and a video. A video texture provides a continuous infinitely varying stream of images. While the individual frames of a ...
Comments