These applications are realized by redefining the “backward pass” of a 3D mesh renderer and incorporating it into neural networks.
Short introduction
We propose Neural Renderer. This is a 3D mesh renderer and able to be integrated into neural networks.
We applied this renderer to (a) 3D mesh reconstruction from a single image and (b) 2D-to-3D image style transfer and 3D DeepDream.
Abstract
For modeling the 3D world behind 2D images, which 3D representation is most appropriate? A polygon mesh is a promising candidate for its compactness and geometric properties. However, it is not straightforward to model a polygon mesh from 2D images using neural networks because the conversion from a mesh to an image, or rendering, involves a discrete operation called rasterization, which prevents back-propagation. Therefore, in this work, we propose an approximate gradient for rasterization that enables the integration of rendering into neural networks. Using this renderer, we perform single-image 3D mesh reconstruction with silhouette image supervision and our system outperforms the existing voxel-based approach. Additionally, we perform gradient-based 3D mesh editing operations, such as 2D-to-3D style transfer and 3D DeepDream, with 2D supervision for the first time. These applications demonstrate the potential of the integration of a mesh renderer into neural networks and the effectiveness of our proposed renderer.
A 3D mesh can be correctly reconstructed from a single image using our method.
Comparison with voxel-based method [1]
Mesh reconstruction does not suffer from the low-resolution problem and cubic artifacts in voxel reconstruction.
Our approach outperforms the voxel-based approach [1] in 10 out of 13 categories on the voxel IoU metric.
2D-to-3D style transfer
The styles of the paintings are accurately transferred to the textures and shapes by our methond. Please pay attention to the outline of the bunny and the lid of the teapot.
The style images are Thomson No. 5 (Yellow Sunset) (D. Coupland, 2011), The Tower of Babel (P. Bruegel the Elder, 1563), The Scream (E. Munch, 1910), and Portrait of Pablo Picasso (J. Gris, 1912).
Understanding the 3D world from 2D images is one of the fundamental problems in computer vision. And, rendering (3D-to-2D conversion) lies on the borderline between the 3D world and 2D images. A polygon mesh is an efficient, rich and intuitive 3D representation. Therefore, the “backward pass” of a 3D mesh renderer is worth pursuing.
Rendering cannot be integrated into neural networks without modifications because the back-propagation is prevented from the renderer. In this work, we propose an approximate gradient for rendering, which enables end-to-end training of neural networks including rendering. Please read
the paper for the details of our renderer.
The applications demonstrated above were performed using this renderer. The figure below shows the pipelines.
The 3D mesh generator was trained with silhouette images. The generator tries to minimize the difference between the silhouettes of reconstructed 3D shape and true silhouettes in the training phase.
2D-to-3D style transfer was performed by optimizing the shape and texture of a mesh to minimize style loss defined on the images. 3D DeepDream was also performed in a similar way.
Both applications were realized by flowing information in 2D image space into 3D space through our renderer.
@InProceedings{kato2018renderer,
title={Neural 3D Mesh Renderer},
author={Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada},
booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2018}
}
References
X. Yan et al. “Perspective Transformer Nets: Learning Single-view 3D Object Reconstruction without 3D Supervision.” Advances in Neural Information Processing Systems (NIPS). 2016.
Papers that use neural renderer
Deformation representation based convolutional mesh autoencoder for 3D hand generation [Zheng et al. Neurocomputing 2020]
SUNNet: A novel framework for simultaneous human parsing and pose estimation [Xu et al. Neurocomputing 2020]
Weakly-supervised Reconstruction of 3D Objects with Large Shape Variation from Single In-the-Wild Images [Sun et al. ACCV 2020]
Learning Object Manipulation Skills via Approximate State Estimation from Real Videos [Petrik et al. CoRL 2020]
Human Parsing Based Texture Transfer from Single Image to 3D Human via Cross-View Consistency [Zhao et al. NeurIPS 2020]