research-article

Open Access

Neural volumes: learning dynamic renderable volumes from images

Authors:
Stephen Lombardi

Facebook Reality Labs

Facebook Reality Labs
View Profile

,
Tomas Simon

Facebook Reality Labs

Facebook Reality Labs
View Profile

,
Jason Saragih

Facebook Reality Labs

Facebook Reality Labs
View Profile

,
Gabriel Schwartz

Facebook Reality Labs

Facebook Reality Labs
View Profile

,
Andreas Lehrmann

Facebook Reality Labs

Facebook Reality Labs
View Profile

,
Yaser Sheikh

Facebook Reality Labs

Facebook Reality Labs
View Profile

Authors Info & Claims

ACM Transactions on Graphics Volume 38 Issue 4Article No.: 65pp 1–14https://doi.org/10.1145/3306346.3323020

Published:12 July 2019Publication History

ACM Transactions on Graphics

Abstract

Modeling and rendering of dynamic scenes is challenging, as natural scenes often contain complex phenomena such as thin structures, evolving topology, translucency, scattering, occlusion, and biological motion. Mesh-based reconstruction and tracking often fail in these cases, and other approaches (e.g., light field video) typically rely on constrained viewing conditions, which limit interactivity. We circumvent these difficulties by presenting a learning-based approach to representing dynamic objects inspired by the integral projection model used in tomographic imaging. The approach is supervised directly from 2D images in a multi-view capture setting and does not require explicit reconstruction or tracking of the object. Our method has two primary components: an encoder-decoder network that transforms input images into a 3D volume representation, and a differentiable ray-marching operation that enables end-to-end training. By virtue of its 3D representation, our construction extrapolates better to novel viewpoints compared to screen-space rendering techniques. The encoder-decoder architecture learns a latent representation of a dynamic scene that enables us to produce novel content sequences not seen during training. To overcome memory limitations of voxel-based representations, we learn a dynamic irregular grid structure implemented with a warp field during ray-marching. This structure greatly improves the apparent resolution and reduces grid-like artifacts and jagged motion. Finally, we demonstrate how to incorporate surface-based representations into our volumetric-learning framework for applications where the highest resolution is required, using facial performance capture as a case in point.

Supplemental Material

Available for Download

zip

a65-lombardi.zip (396.5 KB)

Supplemental material

References

Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis, Engin Tola, and Anders Bjorholm Dahl. 2016. Large-Scale Data for Multiple-View Stereopsis. International Journal of Computer Vision (IJCV) (2016). Google ScholarDigital Library
Agisoft. 2019. Metashape. https://www.agisoft.com/.Google Scholar
Bradley Atcheson, Ivo Ihrke, Wolfgang Heidrich, Art Tevs, Derek Bradley, Marcus Magnor, and Hans-Peter Seidel. 2008. Time-resolved 3D Capture of Non-stationary Gas Flows. ACM Trans. Graph. 27, 5 (2008). Google ScholarDigital Library
Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W. Sumner, and Markus Gross. 2011. High-quality Passive Facial Performance Capture Using Anchor Frames. ACM Trans. Graph. 30, 4 (2011). Google ScholarDigital Library
Jeremy S. De Bonet and Paul A. Viola. 1999. Poxels: Probabilistic Voxelized Volume Reconstruction. In International Conference on Computer Vision (ICCV).Google Scholar
Adrian Broadhurst, Tom Drummond, and Roberto Cipolla. 2001. A Probabilistic Framework for Space Carving. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen. 2001. Unstructured Lumigraph Rendering. In Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). Google ScholarDigital Library
Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015. High-quality Streamable Free-viewpoint Video. ACM Trans. Graph. 34, 4 (2015). Google ScholarDigital Library
J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. 2017. Deformable Convolutional Networks. In International Conference on Computer Vision (ICCV).Google Scholar
Abe Davis, Marc Levoy, and Fredo Durand. 2012. Unstructured Light Fields. Computer Graphics Forum 31, 2pt1 (2012). Google ScholarDigital Library
Andrew Fitzgibbon and Andrew Zisserman. 2005. Image-based Rendering Using Image-based Priors. In International Conference on Computer Vision (ICCV). Google ScholarDigital Library
Yasutaka Furukawa and Carlos Hernández. 2015. Multi-View Stereo: A Tutorial. Foundations and Trends in Computer Graphics and Vision 9, 1--2 (2015). Google ScholarDigital Library
Yasutaka Furukawa and Jean Ponce. 2010. Accurate, Dense, and Robust Multiview Stereopsis. Pattern Analysis and Machine Intelligence (PAMI) 32, 8 (2010). Google ScholarDigital Library
G. Fyffe, K. Nagano, L. Huynh, S. Saito, J. Busch, A. Jones, H. Li, and P. Debevec. 2017. Multi-View Stereo on Consistent Face Topology. Computer Graphics Forum 36, 2 (2017). Google ScholarDigital Library
Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. 2007. Multi-View Stereo for Community Photo Collections. In International Conference on Computer Vision (ICCV).Google Scholar
H. Ha, M. Perdoch, H. Alismail, I. S. Kweon, and Y. Sheikh. 2017. Deltille Grids for Geometric Camera Calibration. In International Conference on Computer Vision (ICCV).Google Scholar
Tim Hawkins, Per Einarsson, and Paul Debevec. 2005. Acquisition of Time-varying Participating Media. ACM Trans. Graph. 24, 3 (2005). Google ScholarDigital Library
Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. 2018. Deep Blending for Free-Viewpoint Image-Based Rendering. ACM Trans. Graph. 37, 6 (2018). Google ScholarDigital Library
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In International Conference on Learning Representations (ICLR).Google Scholar
Milan Ikits, Joe Kniss, Aaron Lefohn, and Charles Hansen. 2004. GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics (Chapter 39, Volume Rendering Techniques). Addison Wesley.Google Scholar
Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. 2016. VolumeDeform: Real-Time Volumetric Non-rigid Reconstruction. In European Conference on Computer Vision (ECCV).Google Scholar
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. Computer Vision and Pattern Recognition (CVPR) (2017).Google Scholar
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial Transformer Networks. In Advances in Neural Information Processing Systems (NeurIPS). Google ScholarDigital Library
Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. 2016. Learning-based View Synthesis for Light Field Cameras. ACM Trans. Graph. 35, 6 (2016). Google ScholarDigital Library
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations (ICLR).Google Scholar
Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. 2018. Deep Video Portraits. ACM Trans. Graph. 37, 4 (2018). Google ScholarDigital Library
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference for Learning Representations (ICLR).Google Scholar
Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR).Google Scholar
Kiriakos N. Kutulakos and Steven M. Seitz. 2000. A Theory of Shape by Space Carving. International Journal of Computer Vision 38, 3 (2000). Google ScholarDigital Library
Marc Levoy. 1988. Display of Surfaces from Volume Data. IEEE Computer Graphics and Applications 8, 3 (1988). Google ScholarDigital Library
J. P. Lewis, Matt Cordner, and Nickson Fong. 2000. Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-driven Deformation. In Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). Google ScholarDigital Library
Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep Appearance Models for Face Rendering. ACM Trans. Graph. 37, 4 (2018). Google ScholarDigital Library
Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor, Julien Valentin, Sameh Khamis, Philip Davidson, Anastasia Tkach, Peter Lincoln, Adarsh Kowdle, Christoph Rhemann, Dan B Goldman, Cem Keskin, Steve Seitz, Shahram Izadi, and Sean Fanello. 2018. LookinGood: Enhancing Performance Capture with Real-time Neural Re-rendering. ACM Trans. Graph. 37, 6 (2018). Google ScholarDigital Library
Paul Merrell, Amir Akbarzadeh, Liang Wang, Philippos Mordohai, Jan-Michael Frahm, Ruigang Yang, David NistÃl'r, and Marc Pollefeys. 2007. Real-Time Visibility-Based Fusion of Depth Maps. In International Conference on Computer Vision (ICCV).Google Scholar
Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. 2015. DynamicFusion: Reconstruction and Tracking of Non-Rigid Scenes in Real-Time. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
Thu H Nguyen-Phuoc, Chuan Li, Stephen Balaban, and Yongliang Yang. 2018. RenderNet: A Deep Convolutional Network for Differentiable Rendering from 3D Shapes. In Advances in Neural Information Processing Systems (NeurIPS). Google ScholarDigital Library
Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. 2013. Real-time 3D Reconstruction at Scale Using Voxel Hashing. ACM Trans. Graph. 32, 6 (2013). Google ScholarDigital Library
Ryan S. Overbeck, Daniel Erickson, Daniel Evangelakos, Matt Pharr, and Paul Debevec. 2018. A System for Acquiring, Processing, and Rendering Panoramic Light Field Stills for Virtual Reality. ACM Trans. Graph. 37, 6 (2018). Google ScholarDigital Library
Despoina Paschalidou, Ali Osman Ulusoy, Carolin Schmitt, Luc Gool, and Andreas Geiger. 2018. RayNet: Learning Volumetric 3D Reconstruction with Ray Potentials. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Eric Penner and Li Zhang. 2017. Soft 3D Reconstruction for View Synthesis. ACM Trans. Graph. 36, 6 (2017). Google ScholarDigital Library
Fabián Prada, Misha Kazhdan, Ming Chuang, Alvaro Collet, and Hugues Hoppe. 2016. Motion Graphs for Unstructured Textured Meshes. ACM Trans. Graph. 35, 4 (2016). Google ScholarDigital Library
Andrew Prock and Charles R. Dyer. 1998. Towards Real-Time Voxel Coloring. In Image Understanding Workshop.Google Scholar
Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017. OctNet: Learning Deep 3D Representations at High Resolutions. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
S. Roth and M. J. Black. 2006. Specular Flow and the Recovery of Surface Structure. In Computer Vision and Pattern Recognition (CVPR). Google ScholarDigital Library
Nikolay Savinov, Christian Häne, Lubor Ladicky, and Marc Pollefeys. 2016. Semantic 3D Reconstruction with Continuous Regularization and Ray Potentials Using a Visibility Consistency Constraint. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
Daniel Scharstein and Richard Szeliski. 2002. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. International Journal of Computer Vision (IJCV) (2002). Google ScholarDigital Library
Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. 2016. Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV).Google Scholar
Steven M. Seitz and Charles R. Dyer. 1997. Photorealistic Scene Reconstruction by Voxel Coloring.Google Scholar
Steven M. Seitz and Charles R. Dyer. 1999. Photorealistic Scene Reconstruction by Voxel Coloring. International Journal of Computer Vision 35, 2 (1999). Google ScholarDigital Library
Zhixin Shu, Mihir Sahasrabudhe, Riza Alp Guler, Dimitris Samaras, Nikos Paragios, and Iasonas Kokkinos. 2018. Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance. In European Conference on Computer Vision (ECCV).Google Scholar
V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhöfer. 2018. DeepVoxels: Learning Persistent 3D Feature Embeddings. arXiv:1812.01024 {cs.CV} (2018).Google Scholar
Richard Szeliski and Polina Golland. 1999. Stereo Matching with Transparency and Matting. International Journal of Computer Vision (IJCV) 32, 1 (1999). Google ScholarDigital Library
L. Torresani, A. Hertzmann, and C. Bregler. 2008. Nonrigid Structure-from-Motion: Estimating Shape and Motion with Hierarchical Priors. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 30, 5 (2008). Google ScholarDigital Library
Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. 2018. Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. 2017. Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
Ali Osman Ulusoy, Andreas Geiger, and Michael J. Black. 2015. Towards Probabilistic Volumetric Reconstruction Using Ray Potentials. In 3DV.Google Scholar
Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2018. Deep Image Prior. In Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-Video Synthesis. In Advances in Neural Information Processing Systems (NeurIPS). Google ScholarDigital Library
Zexiang Xu, Hsiang-Tao Wu, Lvdi Wang, Changxi Zheng, Xin Tong, and Yue Qi. 2014. Dynamic Hair Capture Using Spacetime Optimization. ACM Trans. Graph. 33, 6 (2014). Google ScholarDigital Library
Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A Globally Optimal Algorithm for Robust TV-L<sup>1</sup> Range Image Integration. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. 2018. Stereo Magnification: Learning View Synthesis using Multiplane Images. ACM Trans. Graph. 37, 4 (2018). Google ScholarDigital Library
Michael Zollhöfer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, and Marc Stamminger. 2014. Real-time Non-rigid Reconstruction Using an RGB-D Camera. ACM Trans. Graph. 33, 4 (2014). Google ScholarDigital Library

Index Terms

Neural volumes: learning dynamic renderable volumes from images
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Computing methodologies
  1. Computer graphics
    1. Rendering
    2. Shape modeling
      1. Volumetric models

Recommendations

Rendering 3D volumes using per-pixel displacement mapping
Sandbox '07: Proceedings of the 2007 ACM SIGGRAPH symposium on Video games

Rendering 3D Volumes Using Per-Pixel Displacement Mapping offers a simple and practical solution to the problem of seamlessly integrating many highly detailed 3D objects into a scene without the need to render large sets of polygons or introduce the ...
Read More
FaceFolds: Meshed Radiance Manifolds for Efficient Volumetric Rendering of Dynamic Faces

3D rendering of dynamic face captures is a challenging problem, and it demands improvements on several fronts---photorealism, efficiency, compatibility, and configurability. We present a novel representation that enables high-quality volumetric rendering ...
Read More
Imperfect voxelized shadow volumes
SIGGRAPH '13: ACM SIGGRAPH 2013 Talks

Voxelized shadow volumes (VSVs) [Wyman 2011] are a discretized view-dependent shadow volume representation, but are limited to point or directional lights. We extend them, allowing dynamic volumetric visibility from area lights using imperfect shadow ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Graphics Volume 38, Issue 4
August 2019
1480 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3306346
Editor:
Olga Sorkine-Hornung
ETH Zurich
Issue’s Table of Contents
Copyright © 2019 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 July 2019
Published in tog Volume 38, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
differentiable ray marching
ray potentials
volume warping
volumetric rendering
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 430
  Total Citations
  View Citations
- 2,927
  Total Downloads
- Downloads (Last 12 months)471
- Downloads (Last 6 weeks)52
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Neural volumes: learning dynamic renderable volumes from images

ACM Transactions on Graphics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Rendering 3D volumes using per-pixel displacement mapping

FaceFolds: Meshed Radiance Manifolds for Efficient Volumetric Rendering of Dynamic Faces

Imperfect voxelized shadow volumes