Abstract
In facial animation, the accurate shape and motion of the lips of virtual humans is of paramount importance, since subtle nuances in mouth expression strongly influence the interpretation of speech and the conveyed emotion. Unfortunately, passive photometric reconstruction of expressive lip motions, such as a kiss or rolling lips, is fundamentally hard even with multi-view methods in controlled studios. To alleviate this problem, we present a novel approach for fully automatic reconstruction of detailed and expressive lip shapes along with the dense geometry of the entire face, from just monocular RGB video. To this end, we learn the difference between inaccurate lip shapes found by a state-of-the-art monocular facial performance capture approach, and the true 3D lip shapes reconstructed using a high-quality multi-view system in combination with applied lip tattoos that are easy to track. A robust gradient domain regressor is trained to infer accurate lip shapes from coarse monocular reconstructions, with the additional help of automatically extracted inner and outer 2D lip contours. We quantitatively and qualitatively show that our monocular approach reconstructs higher quality lip shapes, even for complex shapes like a kiss or lip rolling, than previous monocular approaches. Furthermore, we compare the performance of person-specific and multi-person generic regression strategies and show that our approach generalizes to new individuals and general scenes, enabling high-fidelity reconstruction even from commodity video footage.
Supplemental Material
Available for Download
Supplemental file.
- Alexa, M. 2002. Linear combination of transformations. ACM TOG 21, 3, 380--387. Google ScholarDigital Library
- Alexander, O., Rogers, M., Lambeth, W., Chiang, J., Ma, W., Wang, C., and Debevec, P. E. 2010. The digital emily project: Achieving a photorealistic digital actor. IEEE CGAA 30, 4, 20--31. Google ScholarDigital Library
- Alexander, O., Fyffe, G., Busch, J., Yu, X., Ichikari, R., Jones, A., Debevec, P., Jimenez, J., Danvoye, E., Antionazzi, B., Eheler, M., Kysela, Z., and von der Pahlen, J. 2013. Digital Ira: Creating a real-time photoreal digital actor. In ACM Siggrah Posters. Google ScholarDigital Library
- Anderson, R., Stenger, B., and Cipolla, R. 2013. Lip tracking for 3D face registration. In Proc. MVA, 145--148.Google Scholar
- Barnard, M., Holden, E. J., and Owens, R. 2002. Lip tracking using pattern matching snakes. In Proc. ACCV, 1--6.Google Scholar
- Beeler, T., Bickel, B., Beardsley, P., Sumner, B., and Gross, M. 2010. High-quality single-shot capture of facial geometry. ACM TOG 29, 4, 40:1--40:9. Google ScholarDigital Library
- Beeler, T., Hahn, F., Bradley, D., Bickel, B., Beardsley, P., Gotsman, C., Sumner, R. W., and Gross, M. 2011. High-quality passive facial performance capture using anchor frames. ACM TOG 30, 4, 75:1--75:10. Google ScholarDigital Library
- Beeler, T., Bickel, B., Noris, G., Marschner, S., Beardsley, P., Sumner, R. W., and Gross, M. 2012. Coupled 3D reconstruction of sparse facial hair and skin. ACM TOG 31, 4, 117:1--117:10. Google ScholarDigital Library
- Bérard, P., Bradley, D., Nitti, M., Beeler, T., and Gross, M. 2014. High-quality capture of eyes. ACM TOG 33, 6, 223:1--223:12. Google ScholarDigital Library
- Bermano, A., Beeler, T., Kozlov, Y., Bradley, D., Bickel, B., and Gross, M. 2015. Detailed spatio-temporal reconstruction of eyelids. ACM TOG 34, 4, 44:1--44:11. Google ScholarDigital Library
- Bhat, K. S., Goldenthal, R., Ye, Y., Mallet, R., and Koperwas, M. 2013. High fidelity facial animation capture and retargeting with contours. In Proc. ACM SCA, 7--14. Google ScholarDigital Library
- Bickel, B., Botsch, M., Angst, R., Matusik, W., Otaduy, M. A., Pfister, H., and Gross, M. H. 2007. Multi-scale capture of facial geometry and motion. ACM TOG 26, 3, 33:1--33:10. Google ScholarDigital Library
- Bishop, C. M. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. Google ScholarDigital Library
- Blanz, V., and Vetter, T. 1999. A morphable model for the synthesis of 3D faces. In Proc. ACM Siggraph, 187--194. Google ScholarDigital Library
- Borshukov, G., Piponi, D., Larsen, O., Lewis, J. P., and Tempelaar-Lietz, C. 2003. Universal capture: Image-based facial animation for "The Matrix Reloaded". In ACM SIGGRAPH 2003 Sketches & Applications. Google ScholarDigital Library
- Bouaziz, S., Wang, Y., and Pauly, M. 2013. Online modeling for realtime facial animation. ACM TOG 32, 4, 40:1--40:10. Google ScholarDigital Library
- Bradley, D., Heidrich, W., Popa, T., and Sheffer, A. 2010. High resolution passive facial performance capture. ACM TOG 29, 4, 41:1--41:10. Google ScholarDigital Library
- Cao, C., Hou, Q., and Zhou, K. 2014. Displaced dynamic expression regression for real-time facial tracking and animation. ACM TOG 33, 4, 43:1--43:10. Google ScholarDigital Library
- Cao, C., Bradley, D., Zhou, K., and Beeler, T. 2015. Real-time high-fidelity facial performance capture. ACM TOG 34, 4, 46:1--46:9. Google ScholarDigital Library
- Chen, Y.-L., Wu, H.-T., Shi, F., Tong, X., and Chai, J. 2013. Accurate and robust 3D facial capture using a single RGBD camera. In Proc. ICCV, 3615--3622. Google ScholarDigital Library
- Cootes, T. F., Edwards, G. J., and Taylor, C. J. 2001. Active appearance models. IEEE Trans. Pattern Anal. Machine Intell. 23, 6, 681--685. Google ScholarDigital Library
- Dale, K., Sunkavalli, K., Johnson, M. K., Vlasic, D., Matusik, W., and Pfister, H. 2011. Video face replacement. ACM TOG 30, 6, 130:1--130:10. Google ScholarDigital Library
- Dollár, P., Tu, Z., and Belongie, S. 2006. Supervised learning of edges and object boundaries. In Proc. CVPR, 1964--1971. Google ScholarDigital Library
- Echevarria, J. I., Bradley, D., Gutierrez, D., and Beeler, T. 2014. Capturing and stylizing hair for 3D fabrication. ACM TOG 33, 4, 125:1--125:11. Google ScholarDigital Library
- Eveno, N., Caplier, A., and Coulon, P. Y. 2004. Accurate and quasi-automatic lip tracking. IEEE Trans. Circuit and Systems for Video Tech. 14, 5, 706--715. Google ScholarDigital Library
- Fyffe, G., Jones, A., Alexander, O., Ichikari, R., and Debevec, P. 2014. Driving high-resolution facial scans with video performance capture. ACM TOG 34, 1, 8:1--8:14. Google ScholarDigital Library
- Garrido, P., Valgaerts, L., Wu, C., and Theobalt, C. 2013. Reconstructing detailed dynamic face geometry from monocular video. ACM TOG 32, 6, 158:1--158:10. Google ScholarDigital Library
- Garrido, P., Valgaerts, L., Sarmadi, H., Steiner, I., Varanasi, K., Perez, P., and Theobalt, C. 2015. VDub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. CGF 34, 2, 193--204. Google ScholarDigital Library
- Garrido, P., Zollhöfer, M., Casas, D., Valgaerts, L., Varanasi, K., Pérez, P., and Theobalt, C. 2016. Reconstruction of personalized 3D face rigs from monocular video. ACM TOG 35, 3, 28:1--28:15. Google ScholarDigital Library
- Ghosh, A., Fyffe, G., Tunwattanapong, B., Busch, J., Yu, X., and Debevec, P. 2011. Multiview face capture using polarized spherical gradient illumination. ACM TOG 30, 6, 129:1--129:10. Google ScholarDigital Library
- Graham, P., Tunwattanapong, B., Busch, J., Yu, X., Jones, A., Debevec, P. E., and Ghosh, A. 2013. Measurement-based synthesis of facial microgeometry. CGF 32, 2, 335--344.Google ScholarCross Ref
- Guenter, B., Grimm, C., Wood, D., Malvar, H., and Pighin, F. 1998. Making faces. In Proc. ACM Siggraph, 55--66. Google ScholarDigital Library
- Higham, N. J. 1986. Computing the polar decomposition with applications. SIAM J. Sci. Stat. Comput. 7, 4, 1160--1174. Google ScholarDigital Library
- Hoerl, A. E., and Kennard, R. W. 2000. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 42, 1, 80--86. Google ScholarDigital Library
- Hsieh, P.-L., Ma, C., Yu, J., and Li, H. 2015. Unconstrained realtime facial performance capture. In Proc. CVPR, 1675--1683.Google Scholar
- Hu, L., Ma, C., Luo, L., and Li, H. 2015. Single-view hair modeling using a hairstyle database. ACM TOG 34, 4, 125:1--125:9. Google ScholarDigital Library
- Huang, H., Chai, J., Tong, X., and Wu, H.-T. 2011. Lever-aging motion capture and 3D scanning for high-fidelity facial performance acquisition. ACM TOG 30, 4, 74:1--74:10. Google ScholarDigital Library
- Ichim, A. E., Bouaziz, S., and Pauly, M. 2015. Dynamic 3D avatar creation from hand-held video input. ACM TOG 34, 4, 45:1--45:14. Google ScholarDigital Library
- Kaucic, R., and Blake, A. 1998. Accurate, real-time, unadorned lip tracking. In Proc. ICCV, 370--375. Google ScholarDigital Library
- Kawai, M., Iwao, T., Maejima, A., and Morishima, S. 2014. Automatic photorealistic 3D inner mouth restoration from frontal images. In Proc. ISVC, 51--62.Google Scholar
- Kemelmacher-Shlizerman, I., Sankar, A., Shechtman, E., and Seitz, S. M. 2010. Being John Malkovich. In Proc. ECCV, 341--353. Google ScholarDigital Library
- Klaudiny, M., and Hilton, A. 2012. High-detail 3D capture and non-sequential alignment of facial performance. In Proc. 3DIMPVT, 17--24. Google ScholarDigital Library
- Lewis, J., and Anjyo, K.-i. 2010. Direct manipulation blend-shapes. IEEE Comp. Graphics and Applications 30, 4, 42--50. Google ScholarDigital Library
- Li, H., Yu, J., Ye, Y., and Bregler, C. 2013. Realtime facial animation with on-the-fly correctives. ACM TOG 32, 4, 42:1--42:10. Google ScholarDigital Library
- Liu, Y., Xu, F., Chai, J., Tong, X., Wang, L., and Huo, Q. 2015. Video-audio driven real-time facial animation. ACM Trans. Graph. 34, 6, 182:1--182:10. Google ScholarDigital Library
- Luo, L., Li, H., Paris, S., Weise, T., Pauly, M., and Rusinkiewicz, S. 2012. Multi-view hair capture using orientation fields. In Proc. CVPR, 1490--1497. Google ScholarDigital Library
- Nagano, K., Fyffe, G., Alexander, O., Barbič, J., Li, H., Ghosh, A., and Debevec, P. 2015. Skin microstructure deformation with displacement map convolution. ACM TOG 34, 4, 109:1--109:10. Google ScholarDigital Library
- Nath, A. R., and Beauchamp, M. S. 2012. A neural basis for interindividual differences in the McGurk effect, a multisensory speech illusion. NeuroImage 59, 1, 781--787.Google ScholarCross Ref
- Nguyen, Q. D., and Milgram, M. 2009. Semi adaptive appearance models for lip tracking. In Proc. ICIP, 2437--2440. Google ScholarDigital Library
- Pighin, F., and Lewis, J. 2006. Performance-driven facial animation. In ACM Siggraph Courses.Google Scholar
- Saragih, J. M., Lucey, S., and Cohn, J. F. 2009. Face alignment through subspace constrained mean-shifts. In Proc. ICCV, 1034--1041.Google Scholar
- Saragih, J. M., Lucey, S., and Cohn, J. F. 2011. Deformable model fitting by regularized landmark mean-shift. Int. J. Computer Vision 91, 2, 200--215. Google ScholarDigital Library
- Shi, F., Wu, H.-T., Tong, X., and Chai, J. 2014. Automatic acquisition of high-fidelity facial performances using monocular videos. ACM TOG 33, 6, 222:1--222:13. Google ScholarDigital Library
- Sifakis, E., Neverov, I., and Fedkiw, R. 2005. Automatic determination of facial muscle activations from sparse motion capture marker data. ACM TOG 24, 3, 417--425. Google ScholarDigital Library
- Sumner, R. W., and Popovic, J. 2004. Deformation transfer for triangle meshes. ACM TOG 23, 3, 399--405. Google ScholarDigital Library
- Suwajanakorn, S., Kemelmacher-Shlizerman, I., and Seitz, S. M. 2014. Total moving face reconstruction. In Proc. ECCV, 796--812.Google Scholar
- Suwajanakorn, S., Seitz, S. M., and Kemelmacher-Shlizerman, I. 2015. What makes Tom Hanks look like Tom Hanks. In Proc. ICCV, 3952--3960. Google ScholarDigital Library
- Thies, J., Zollhöfer, M., Niessner, M., Valgaerts, L., Stamminger, M., and Theobalt, C. 2015. Real-time expression transfer for facial reenactment. ACM TOG 34, 6, 183:1--183:14. Google ScholarDigital Library
- Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., and Niessner, M. 2016. Face2Face: Real-time face capture and reenactment of RGB videos. In Proc. CVPR.Google Scholar
- Tian, Y.-L., Kanade, T., and Cohn, J. F. 2000. Robust lip tracking by combining shape, color and motion. In Proc. ACCV, 1--6.Google Scholar
- Valgaerts, L., Wu, C., Bruhn, A., Seidel, H.-P., and Theobalt, C. 2012. Lightweight binocular facial performance capture under uncontrolled lighting. ACM TOG 31, 6, 187:1--187:11. Google ScholarDigital Library
- Vlasic, D., Brand, M., Pfister, H., and Popovic, J. 2005. Face transfer with multilinear models. ACM TOG 24, 3, 426--433. Google ScholarDigital Library
- Wang, S. L., Lau, W. H., and Leung, S. H. 2004. Automatic lip contour extraction from color images. Pattern Recogn. 37, 12, 2375--2387. Google ScholarDigital Library
- Wang, Y., Huang, X., Su Lee, C., Zhang, S., Li, Z., Samaras, D., Metaxas, D., Elgammal, A., and Huang, P. 2004. High resolution acquisition, learning and transfer of dynamic 3D facial expressions. CGF 23, 3, 677--686.Google ScholarCross Ref
- Weise, T., Li, H., Gool, L. J. V., and Pauly, M. 2009. Face/Off: Live facial puppetry. In Proc. ACM SCA, 7--16. Google ScholarDigital Library
- Weise, T., Bouaziz, S., Li, H., and Pauly, M. 2011. Realtime performance-based facial animation. ACM TOG 30, 77:1--77:10. Google ScholarDigital Library
- Wenger, A., Gardner, A., Tchou, C., Unger, J., Hawkins, T., and Debevec, P. 2005. Performance relighting and reflectance transformation with time-multiplexed illumination. ACM TOG 24, 3, 756--764. Google ScholarDigital Library
- Weyrich, T., Matusik, W., Pfister, H., Bickel, B., Donner, C., Tu, C., McAndless, J., Lee, J., Ngan, A., Jensen, H. W., and Gross, M. 2006. Analysis of human faces using a measurement-based skin reflectance model. ACM TOG 25, 3, 1013--1024. Google ScholarDigital Library
- Williams, L. 1990. Performance-driven facial animation. In Proc. ACM Siggraph, 235--242. Google ScholarDigital Library
Index Terms
- Corrective 3D reconstruction of lips from monocular video
Recommendations
Reconstructing detailed dynamic face geometry from monocular video
Detailed facial performance geometry can be reconstructed using dense camera and light setups in controlled studios. However, a wide range of important applications cannot employ these approaches, including all movie productions shot from a single ...
Automatic acquisition of high-fidelity facial performances using monocular videos
This paper presents a facial performance capture system that automatically captures high-fidelity facial performances using uncontrolled monocular videos (e.g., Internet videos). We start the process by detecting and tracking important facial features ...
Robust 3D face modeling and reconstruction from frontal and side images
Robust and effective capture and reconstruction of 3D face models directly by smartphone users enables many applications. This paper presents a novel 3D face modeling and reconstruction solution that robustly and accurately acquire 3D face models from a ...
Comments