RecGen
3D Multi-Object Scene Reconstruction from Sparse Observations

RecGen is a generative framework that reconstructs complete 3D multi-object scenes — including shape, texture, and pose — from one or more RGB-D images, even under heavy occlusion. By combining compositional synthetic data with strong 3D shape priors, it generalizes across diverse objects and real-world settings. RecGen outperforms the previous state of the art by over 30% in shape quality and 34% in pose estimation, while using nearly 80% fewer training meshes.

Methodology: From Images to 3D Scenes
1 / 3
Method step 1 Method step 2 Method step 3
Interactive Reconstruction Viewer

From a single image, RecGen reconstructs a full 3D scene. Select a scene below to explore interactively.

Input Image
Input scene Input segments
6D pose estimation benchmark with household objects
Interactive 3D Viewer
Loading meshes...
Drag to rotate, scroll to zoom
Scene reconstruction
Visit on desktop for the interactive 3D viewer.
Datasets
Training Data

RecGen is trained on 198K high-quality 3D assets from six public datasets, totaling 3.2M synthetic RGB-D images of compositional multi-object and part-based scenes.

Object Datasets

Compositional tabletop scenes with 3-10 distractor objects.

Part Datasets

Single-object scenes with fine-grained parts and self-occlusions.


Evaluation Datasets

RecGen is evaluated on four object-centric datasets and one part-centric dataset for shape and pose estimation.

LM-O

8 diverse objects in a single scene at various occlusion levels.

LM-O dataset example
Object Kinect

HB

33 diverse objects in 13 scenes of varying complexity.

HB dataset example
Object Kinect 2

HOPE

28 toy grocery objects in 50 scenes with varying lighting and occlusion.

HOPE dataset example
Object RealSense

ReOcS

18 grocery objects with splits based on occlusion extent.

ReOcS dataset example
Object Stereo

ArtVIP

Articulated objects with ground-truth part meshes.

ArtVIP dataset example
Part Synthetic
Results
Qualitative Results
Input SceneComplete Any6D SAM3D RecGen GT

Interactive Comparison

Drag the slider to compare any two methods. Select a dataset, scene, and the methods to compare.

vs
Left method
Right method
RecGen Ground Truth

Quantitative Results

Comparison of pose estimation (ADD-SB) and shape reconstruction (CDnorm) metrics across methods. RecGen outperforms baselines on both object-centric and part-centric datasets.

ADD-SB (lower is better)
ADD-SB @0.05 (higher is better)
ADD-SB per dataset (lower is better)
ADD-SB @0.05 per dataset (higher is better)
CDnorm (lower is better)
CDnorm per dataset (lower is better)
Occlusions, Symmetry, and Multi-View

A closer look at where RecGen's gains come from: occlusion robustness, symmetric object handling, and multi-view conditioning

Occlusion Analysis

Performance by occlusion severity. Average across object-centric datasets (HB, LM-O, ReOcS).

ADD-SB by Occlusion Level (lower is better)
CDnorm by Occlusion Level (lower is better)

Symmetry-Aware Reconstruction

Symmetric objects (e.g. bowls, bottles, cups) are inherently ambiguous from a single viewpoint — many orientations produce identical images. Rather than collapsing to a single (often wrong) pose, RecGen's probabilistic generation naturally handles these ambiguities, producing reconstructions that are faithful to the observation regardless of the underlying symmetry.

Input
RecGen

Appearance generation for objects with symmetric shapes. RecGen produces pose-consistent textures, while SAM3D generates pose-agnostic appearances that often mismatch the observation.

To quantify this, we use a VLM-based orientation check: the percentage of symmetric-object reconstructions whose orientation matches the ground truth.

VLM Alignment Rate (%) — RecGen vs SAM3D

Multi-View Reconstruction

RecGen can leverage multiple RGB-D observations to produce more complete and accurate reconstructions. By fusing information across viewpoints, the model resolves ambiguities and fills in occluded regions. Out of the two pose predictions, the best one is selected based on its alignment with the corresponding view’s point map in metric camera space.

ADD-SB (lower is better)
CDnorm (lower is better)
Resources

Paper

Read the full technical details and evaluation in our paper.

arXiv

Code

Training code, evaluation scripts, and pre-trained models.

GitHub

Dataset

Download training data and evaluation benchmarks.

Coming Soon
Reference
@misc{zadaianchuk2026recgen,
      title={RecGen: Reconstructive Generation of 3D Scenes from RGB-D Observations},
      author={Andrii Zadaianchuk and Leonardo Barcellona and Lennard Schuenemann and Christian Gumbsch and Zehao Wang and Muhammad Zubair Irshad and Fabien Despinoy and Rahaf Aljundi and Stratis Gavves and Sergey Zakharov},
      year={2026},
      eprint={2604.27106},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.27106},
}