OccNeRF: Self-Supervised Multi-Camera Occupancy Prediction with Neural Radiance Fields

1Tsinghua University    2Beijing National Research Center for Information Science and Technology   
3Gaussian Robotics    4Xiaomi Car   

Demo

Depth Estimation

3D Occupancy Prediction

Abstract

As a fundamental task of vision-based perception, 3D occupancy prediction reconstructs 3D structures of surrounding environments. It provides detailed information for autonomous driving planning and navigation. However, most existing methods heavily rely on the LiDAR point clouds to generate occupancy ground truth, which is not available in the vision-based system. In this paper, we propose an OccNeRF method for self-supervised multi-camera occupancy prediction. Different from bounded 3D occupancy labels, we need to consider unbounded scenes with raw image supervision. To solve the issue, we parameterize the reconstructed occupancy fields and reorganize the sampling strategy. The neural rendering is adopted to convert occupancy fields to multi-camera depth maps, supervised by multi-frame photometric consistency. Moreover, for semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model. Extensive experiments for both self-supervised depth estimation and semantic occupancy prediction tasks on nuScenes dataset demonstrate the effectiveness of our method.

overview_image

Method

The picture below is a brief summary of our method. We first use a 2D backbone to extract multi-camera features, which are lifted to 3D space to get volume features with interpolation. The parameterized occupancy fields are reconstructed to describe unbounded scenes. To obtain the rendered depth and semantic maps, we perform volume rendering with our reorganized sampling strategy. The multi-frame depths are supervised by photometric loss. For semantic prediction, we adopted pretrained Grounded-SAM with prompts cleaning. The green arrow indicates supervision signals.

overview_image

Results

We conducted self-supervised multi-camera depth estimation and 3D occupancy prediction on nuScenes dataset. Our method does not need any 3D supervision in both tasks.

Depth Estimation

The following table shows the self-supervised multi-camera depth estimation results on nuScenes dataset. We do not use pretrained segmentation model in this experiment. The results are averaged over 6 cameras and ‘FSM*’ is the reproduced FSM result reported in VFDepth. We can see that our method outperforms other state-of-the-art methods by a large margin, demonstrating the effectiveness of OccNeRF.
depth-estimation-results
The image below shows qualitative results on nuScenes dataset. Our method can predict visually appealing depth maps with texture details and fine-grained occupancy.
depth-estimation-qual-results

3D Occpancy Prediction

We also conduct semantic occupancy prediction experiments on the Occ3D-nuScenes dataset. Since the pretrained openvocabulary model cannot recognize ambiguous prompts such as ‘other’ and ‘other flat’, we remove these two classes during evaluation. For another self-supervised method SimpleOcc, we use the same 2D semantic labels from the pretrained model for fair comparison. As shown in the following table, our method outperforms SimpleOcc and even gets comparable performance with some full-supervised methods in some classes.
occupancy-prediction-results
The image below shows qualitative results of semantic occupancy prediction from our model.
occupancy-prediction-qual-results

BibTeX


@article{chubin2023occnerf, 
  title   = {OccNeRF: Self-Supervised Multi-Camera Occupancy Prediction with Neural Radiance Fields}, 
  author  = {Chubin Zhang and Juncheng Yan and Yi Wei and Jiaxin Li and Li Liu and Yansong Tang and Yueqi Duan and Jiwen Lu},
  journal = {arXiv preprint arXiv:2312.09243},
  year    = {2023}
}