Self-supervised Learning of 3D Objects
from Natural Images

Hiroharu Kato1    Tatsuya Harada1,2
1The University of Tokyo    2RIKEN
arXiv 2019
Training data = 2D images
Input image


We present a method to learn single-view reconstruction of the 3D shape, pose, and texture of objects from categorized natural images. These are our results on CIFAR-10 (first row) and PASCAL (second row) datasets.


We present a method to learn single-view reconstruction of the 3D shape, pose, and texture of objects from categorized natural images in a self-supervised manner. Since this is a severely ill-posed problem, carefully designing a training method and introducing constraints are essential. To avoid the difficulty of training all elements at the same time, we propose training category-specific base shapes with fixed pose distribution and simple textures first, and subsequently training poses and textures using the obtained shapes. Another difficulty is that shapes and backgrounds sometimes become excessively complicated to mistakenly reconstruct textures on object surfaces. To suppress it, we propose using strong regularization and constraints on object surfaces and background images. With these two techniques, we demonstrate that we can use natural image collections such as CIFAR-10 and PASCAL objects for training, which indicates the possibility to realize 3D object reconstruction on diverse object categories beyond synthetic datasets.


Full paper is available at

Technical overview

We train 3D shape, texture, pose, and background estimation using categorized natural images. Because ground-truth annotations of them are not given, we have no choice but to use images themselves as indirect supervision. Therefore, we adopt a render-and-compare approach, in which supervision signals come from reconstruction error between input images and reconstructed images using estimated object elements. This framework is illustrated as follows.

One straightforward idea is to implement it using neural networks and train them end-to-end. However, this actually does not work because there are several trivial and poor solutions in reconstruction. The following figure shows two critical failures.

One example is to expand an object to cover the whole image and copy the input image into the texture of the object. Another example is to shrink an object under one pixel and copy the input image into the background. Image reconstruction is almost perfect in both cases.

However, we know that this reconstruction is unlikely as realistic 3D scenes. Introducing our knowledge of 3D objects into the model is the technical key point in this work. Specifically, (1) to focus on learning shapes, we propose a two-stage training. In the first step, using random poses and few-color textures, we generate a category-specific base shape. Training of the full model is done using the obtained shapes. (2) Sometimes, shapes try to represent edges on textures, and backgrounds try to represent foreground objects. We suppress them by introducing surface smoothness constraints and background simplicity constraints.

Our proposed training steps and introduced constraints are illustrated as follows.


Comming soon.


  title={Self-supervised Learning of 3D Objects from Natural Images},
  author={Hiroharu Kato and Tatsuya Harada},