|Input image||Prediction||+ View prior learning||Input image||Prediction|
We propose a method to generate 3D shapes that looks reasonable from any viewpoint without 3D supervision. This method improves the performance of 3D reconstruction on both synthetic dataset (ShapeNet) and natural image dataset (PASCAL 3D+).
There is some ambiguity in the 3D shape of an object when the number of observed views is small. Because of this ambiguity, although a 3D object reconstructor can be trained using a single view or a few views per object, reconstructed shapes only fit the observed views and appear incorrect from the unobserved viewpoints. To reconstruct shapes that look reasonable from any viewpoint, we propose to train a discriminator that learns prior knowledge regarding possible views. The discriminator is trained to distinguish the reconstructed views of the observed viewpoints from the views of the unobserved viewpoints. The reconstructor is trained to correct unobserved views by fooling the discriminator. Our method outperforms current state-of-the-art methods on both synthetic and natural image datasets; this validates the effectiveness of our method.
Full paper is available at https://arxiv.org/abs/1811.10719.
ShapeNet dataset  consists of images synthetically generated from the 3D CAD models.
In this experiment, we used only single view per object for training a 3D reconstructor. The following table shows reconstruction accuracy (IoU). The difference between the proposed method and the baseline is whether to use our proposed method, view prior learning (VPL). Our method improves the performance significantly.
|Baseline (w/o texture)||.479||.266||.466||.550||.367||.265||.454||.524||.382||.367||.342||.337||.439||.403|
|Proposed (w/o texture)||.513||.376||.591||.701||.444||.425||.422||.596||.479||.500||.436||.595||.485||.505|
|Baseline (w/ texture)||.483||.284||.544||.535||.356||.372||.443||.534||.386||.370||.361||.529||.448||.434|
|Proposed (w/ texture)||.531||.385||.591||.701||.454||.423||.441||.570||.521||.508||.444||.601||.498||.513|
The three largest categories on ShapeNet dataset: airplane, car, and chair.
Our method is particularly effective for display and phone because of the high ambiguity in shapes owing to the simplicity of the silhouettes.
Our method is also effective for sofa and bench because learning the long shapes of them is difficult using only one view without considering several viewpoints.
Our method is also effective when multiple views per object are available for training. The following table shows the mean reconstruction accuracy on ShapeNet dataset. Ours consistently outperforms the baseline.
|Number of views per object||2||3||5||10||20|
When twenty views per object are used in training, our proposed method achieves state-of-the-art performance on ShapeNet dataset.
|Method||Reconstruction accuracy (IoU)|
|Multi-view training||PTN ||.574|
|Our best model||.655|
|3D supervision||3D-R2N2 ||.560|
|Input image||Proposed w/o texture||Proposed w/ texture|
PASCAL dataset  is composed of natural images with noisy annotations. Our method achieves state-of-the-art performance on this difficult dataset.
|Category-agnostic models||DRC ||.415||.666||.247||.443|
|Category-specific models||CSDM ||.398||.600||.291||.429|
The following figure shows reconstruction results of a conventional method.
|Input image||Reconstructed view from the original viewpoint
As can be seen from the figure, although reconstructed views from the original viewpoints (class A) look correct, views from other viewpoints (class B) look incorrect. We introduce a discriminator that distinguish views of class A from views of class B. By training, this discriminator learns prior knowledge about realistic views. The reconstructor is trained to make unobserved views correct by fooling this discriminator, which results in reconstructing shapes that are viewed as reasonable from any viewpoints.