Abstract
Modeling and rendering of dynamic scenes is challenging, as natural scenes often contain complex phenomena such as thin structures, evolving topology, translucency, scattering, occlusion, and biological motion. Mesh-based reconstruction and tracking often fail in these cases, and other approaches (e.g., light field video) typically rely on constrained viewing conditions, which limit interactivity. We circumvent these difficulties by presenting a learning-based approach to representing dynamic objects inspired by the integral projection model used in tomographic imaging. The approach is supervised directly from 2D images in a multi-view capture setting and does not require explicit reconstruction or tracking of the object. Our method has two primary components: an encoder-decoder network that transforms input images into a 3D volume representation, and a differentiable ray-marching operation that enables end-to-end training. By virtue of its 3D representation, our construction extrapolates better to novel viewpoints compared to screen-space rendering techniques. The encoder-decoder architecture learns a latent representation of a dynamic scene that enables us to produce novel content sequences not seen during training. To overcome memory limitations of voxel-based representations, we learn a dynamic irregular grid structure implemented with a warp field during ray-marching. This structure greatly improves the apparent resolution and reduces grid-like artifacts and jagged motion. Finally, we demonstrate how to incorporate surface-based representations into our volumetric-learning framework for applications where the highest resolution is required, using facial performance capture as a case in point.
Supplemental Material
Available for Download
Supplemental material
- Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis, Engin Tola, and Anders Bjorholm Dahl. 2016. Large-Scale Data for Multiple-View Stereopsis. International Journal of Computer Vision (IJCV) (2016). Google ScholarDigital Library
- Agisoft. 2019. Metashape. https://www.agisoft.com/.Google Scholar
- Bradley Atcheson, Ivo Ihrke, Wolfgang Heidrich, Art Tevs, Derek Bradley, Marcus Magnor, and Hans-Peter Seidel. 2008. Time-resolved 3D Capture of Non-stationary Gas Flows. ACM Trans. Graph. 27, 5 (2008). Google ScholarDigital Library
- Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W. Sumner, and Markus Gross. 2011. High-quality Passive Facial Performance Capture Using Anchor Frames. ACM Trans. Graph. 30, 4 (2011). Google ScholarDigital Library
- Jeremy S. De Bonet and Paul A. Viola. 1999. Poxels: Probabilistic Voxelized Volume Reconstruction. In International Conference on Computer Vision (ICCV).Google Scholar
- Adrian Broadhurst, Tom Drummond, and Roberto Cipolla. 2001. A Probabilistic Framework for Space Carving. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen. 2001. Unstructured Lumigraph Rendering. In Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). Google ScholarDigital Library
- Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015. High-quality Streamable Free-viewpoint Video. ACM Trans. Graph. 34, 4 (2015). Google ScholarDigital Library
- J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. 2017. Deformable Convolutional Networks. In International Conference on Computer Vision (ICCV).Google Scholar
- Abe Davis, Marc Levoy, and Fredo Durand. 2012. Unstructured Light Fields. Computer Graphics Forum 31, 2pt1 (2012). Google ScholarDigital Library
- Andrew Fitzgibbon and Andrew Zisserman. 2005. Image-based Rendering Using Image-based Priors. In International Conference on Computer Vision (ICCV). Google ScholarDigital Library
- Yasutaka Furukawa and Carlos Hernández. 2015. Multi-View Stereo: A Tutorial. Foundations and Trends in Computer Graphics and Vision 9, 1--2 (2015). Google ScholarDigital Library
- Yasutaka Furukawa and Jean Ponce. 2010. Accurate, Dense, and Robust Multiview Stereopsis. Pattern Analysis and Machine Intelligence (PAMI) 32, 8 (2010). Google ScholarDigital Library
- G. Fyffe, K. Nagano, L. Huynh, S. Saito, J. Busch, A. Jones, H. Li, and P. Debevec. 2017. Multi-View Stereo on Consistent Face Topology. Computer Graphics Forum 36, 2 (2017). Google ScholarDigital Library
- Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. 2007. Multi-View Stereo for Community Photo Collections. In International Conference on Computer Vision (ICCV).Google Scholar
- H. Ha, M. Perdoch, H. Alismail, I. S. Kweon, and Y. Sheikh. 2017. Deltille Grids for Geometric Camera Calibration. In International Conference on Computer Vision (ICCV).Google Scholar
- Tim Hawkins, Per Einarsson, and Paul Debevec. 2005. Acquisition of Time-varying Participating Media. ACM Trans. Graph. 24, 3 (2005). Google ScholarDigital Library
- Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. 2018. Deep Blending for Free-Viewpoint Image-Based Rendering. ACM Trans. Graph. 37, 6 (2018). Google ScholarDigital Library
- Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In International Conference on Learning Representations (ICLR).Google Scholar
- Milan Ikits, Joe Kniss, Aaron Lefohn, and Charles Hansen. 2004. GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics (Chapter 39, Volume Rendering Techniques). Addison Wesley.Google Scholar
- Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. 2016. VolumeDeform: Real-Time Volumetric Non-rigid Reconstruction. In European Conference on Computer Vision (ECCV).Google Scholar
- Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. Computer Vision and Pattern Recognition (CVPR) (2017).Google Scholar
- Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial Transformer Networks. In Advances in Neural Information Processing Systems (NeurIPS). Google ScholarDigital Library
- Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. 2016. Learning-based View Synthesis for Light Field Cameras. ACM Trans. Graph. 35, 6 (2016). Google ScholarDigital Library
- Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations (ICLR).Google Scholar
- Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. 2018. Deep Video Portraits. ACM Trans. Graph. 37, 4 (2018). Google ScholarDigital Library
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference for Learning Representations (ICLR).Google Scholar
- Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR).Google Scholar
- Kiriakos N. Kutulakos and Steven M. Seitz. 2000. A Theory of Shape by Space Carving. International Journal of Computer Vision 38, 3 (2000). Google ScholarDigital Library
- Marc Levoy. 1988. Display of Surfaces from Volume Data. IEEE Computer Graphics and Applications 8, 3 (1988). Google ScholarDigital Library
- J. P. Lewis, Matt Cordner, and Nickson Fong. 2000. Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-driven Deformation. In Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). Google ScholarDigital Library
- Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep Appearance Models for Face Rendering. ACM Trans. Graph. 37, 4 (2018). Google ScholarDigital Library
- Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor, Julien Valentin, Sameh Khamis, Philip Davidson, Anastasia Tkach, Peter Lincoln, Adarsh Kowdle, Christoph Rhemann, Dan B Goldman, Cem Keskin, Steve Seitz, Shahram Izadi, and Sean Fanello. 2018. LookinGood: Enhancing Performance Capture with Real-time Neural Re-rendering. ACM Trans. Graph. 37, 6 (2018). Google ScholarDigital Library
- Paul Merrell, Amir Akbarzadeh, Liang Wang, Philippos Mordohai, Jan-Michael Frahm, Ruigang Yang, David NistÃl'r, and Marc Pollefeys. 2007. Real-Time Visibility-Based Fusion of Depth Maps. In International Conference on Computer Vision (ICCV).Google Scholar
- Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. 2015. DynamicFusion: Reconstruction and Tracking of Non-Rigid Scenes in Real-Time. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Thu H Nguyen-Phuoc, Chuan Li, Stephen Balaban, and Yongliang Yang. 2018. RenderNet: A Deep Convolutional Network for Differentiable Rendering from 3D Shapes. In Advances in Neural Information Processing Systems (NeurIPS). Google ScholarDigital Library
- Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. 2013. Real-time 3D Reconstruction at Scale Using Voxel Hashing. ACM Trans. Graph. 32, 6 (2013). Google ScholarDigital Library
- Ryan S. Overbeck, Daniel Erickson, Daniel Evangelakos, Matt Pharr, and Paul Debevec. 2018. A System for Acquiring, Processing, and Rendering Panoramic Light Field Stills for Virtual Reality. ACM Trans. Graph. 37, 6 (2018). Google ScholarDigital Library
- Despoina Paschalidou, Ali Osman Ulusoy, Carolin Schmitt, Luc Gool, and Andreas Geiger. 2018. RayNet: Learning Volumetric 3D Reconstruction with Ray Potentials. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Eric Penner and Li Zhang. 2017. Soft 3D Reconstruction for View Synthesis. ACM Trans. Graph. 36, 6 (2017). Google ScholarDigital Library
- Fabián Prada, Misha Kazhdan, Ming Chuang, Alvaro Collet, and Hugues Hoppe. 2016. Motion Graphs for Unstructured Textured Meshes. ACM Trans. Graph. 35, 4 (2016). Google ScholarDigital Library
- Andrew Prock and Charles R. Dyer. 1998. Towards Real-Time Voxel Coloring. In Image Understanding Workshop.Google Scholar
- Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017. OctNet: Learning Deep 3D Representations at High Resolutions. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- S. Roth and M. J. Black. 2006. Specular Flow and the Recovery of Surface Structure. In Computer Vision and Pattern Recognition (CVPR). Google ScholarDigital Library
- Nikolay Savinov, Christian Häne, Lubor Ladicky, and Marc Pollefeys. 2016. Semantic 3D Reconstruction with Continuous Regularization and Ray Potentials Using a Visibility Consistency Constraint. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Daniel Scharstein and Richard Szeliski. 2002. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. International Journal of Computer Vision (IJCV) (2002). Google ScholarDigital Library
- Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. 2016. Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV).Google Scholar
- Steven M. Seitz and Charles R. Dyer. 1997. Photorealistic Scene Reconstruction by Voxel Coloring.Google Scholar
- Steven M. Seitz and Charles R. Dyer. 1999. Photorealistic Scene Reconstruction by Voxel Coloring. International Journal of Computer Vision 35, 2 (1999). Google ScholarDigital Library
- Zhixin Shu, Mihir Sahasrabudhe, Riza Alp Guler, Dimitris Samaras, Nikos Paragios, and Iasonas Kokkinos. 2018. Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance. In European Conference on Computer Vision (ECCV).Google Scholar
- V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhöfer. 2018. DeepVoxels: Learning Persistent 3D Feature Embeddings. arXiv:1812.01024 {cs.CV} (2018).Google Scholar
- Richard Szeliski and Polina Golland. 1999. Stereo Matching with Transparency and Matting. International Journal of Computer Vision (IJCV) 32, 1 (1999). Google ScholarDigital Library
- L. Torresani, A. Hertzmann, and C. Bregler. 2008. Nonrigid Structure-from-Motion: Estimating Shape and Motion with Hierarchical Priors. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 30, 5 (2008). Google ScholarDigital Library
- Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. 2018. Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. 2017. Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Ali Osman Ulusoy, Andreas Geiger, and Michael J. Black. 2015. Towards Probabilistic Volumetric Reconstruction Using Ray Potentials. In 3DV.Google Scholar
- Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2018. Deep Image Prior. In Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-Video Synthesis. In Advances in Neural Information Processing Systems (NeurIPS). Google ScholarDigital Library
- Zexiang Xu, Hsiang-Tao Wu, Lvdi Wang, Changxi Zheng, Xin Tong, and Yue Qi. 2014. Dynamic Hair Capture Using Spacetime Optimization. ACM Trans. Graph. 33, 6 (2014). Google ScholarDigital Library
- Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A Globally Optimal Algorithm for Robust TV-L<sup>1</sup> Range Image Integration. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. 2018. Stereo Magnification: Learning View Synthesis using Multiplane Images. ACM Trans. Graph. 37, 4 (2018). Google ScholarDigital Library
- Michael Zollhöfer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, and Marc Stamminger. 2014. Real-time Non-rigid Reconstruction Using an RGB-D Camera. ACM Trans. Graph. 33, 4 (2014). Google ScholarDigital Library
Index Terms
- Neural volumes: learning dynamic renderable volumes from images
Recommendations
Rendering 3D volumes using per-pixel displacement mapping
Sandbox '07: Proceedings of the 2007 ACM SIGGRAPH symposium on Video gamesRendering 3D Volumes Using Per-Pixel Displacement Mapping offers a simple and practical solution to the problem of seamlessly integrating many highly detailed 3D objects into a scene without the need to render large sets of polygons or introduce the ...
FaceFolds: Meshed Radiance Manifolds for Efficient Volumetric Rendering of Dynamic Faces
3D rendering of dynamic face captures is a challenging problem, and it demands improvements on several fronts---photorealism, efficiency, compatibility, and configurability. We present a novel representation that enables high-quality volumetric rendering ...
Imperfect voxelized shadow volumes
SIGGRAPH '13: ACM SIGGRAPH 2013 TalksVoxelized shadow volumes (VSVs) [Wyman 2011] are a discretized view-dependent shadow volume representation, but are limited to point or directional lights. We extend them, allowing dynamic volumetric visibility from area lights using imperfect shadow ...
Comments