RecGen: Reconstructive Generation from RGB-D

RecGen

3D Multi-Object Scene Reconstruction from Sparse Observations

Andrii Zadaianchuk^1*, Leonardo Barcellona^1*, Lennard Schuenemann¹, Christian Gumbsch¹, Zehao Wang²,
Muhammad Zubair Irshad⁴, Fabien Despinoy³, Rahaf Aljundi³, Stratis Gavves¹, Sergey Zakharov^4*

Paper Code Dataset Cite

RecGen is a generative framework that reconstructs complete 3D multi-object scenes — including shape, texture, and pose — from one or more RGB-D images, even under heavy occlusion. By combining compositional synthetic data with strong 3D shape priors, it generalizes across diverse objects and real-world settings. RecGen outperforms the previous state of the art by over 30% in shape quality and 34% in pose estimation, while using nearly 80% fewer training meshes.

Methodology: From Images to 3D Scenes

1 / 3

Interactive Reconstruction Viewer

From a single image, RecGen reconstructs a full 3D scene. Select a scene below to explore interactively.

Input Image

Show masks

6D pose estimation benchmark with household objects

Interactive 3D Viewer

Loading meshes...

Drag to rotate, scroll to zoom

Visit on desktop for the interactive 3D viewer.

Datasets

Training Data

RecGen is trained on 198K high-quality 3D assets from six public datasets, totaling 3.2M synthetic RGB-D images of compositional multi-object and part-based scenes.

Object Datasets

Compositional tabletop scenes with 3-10 distractor objects.

Objaverse-XL

ABO

HSSD

Part Datasets

Single-object scenes with fine-grained parts and self-occlusions.

PhysXNet

PartNeXt

PartNet-Mobility

Evaluation Datasets

RecGen is evaluated on four object-centric datasets and one part-centric dataset for shape and pose estimation.

LM-O

8 diverse objects in a single scene at various occlusion levels.

Object Kinect

HB

33 diverse objects in 13 scenes of varying complexity.

Object Kinect 2

HOPE

28 toy grocery objects in 50 scenes with varying lighting and occlusion.

Object RealSense

ReOcS

18 grocery objects with splits based on occlusion extent.

Object Stereo

ArtVIP

Articulated objects with ground-truth part meshes.

Part Synthetic

Results

Qualitative Results

Input SceneComplete Any6D SAM3D RecGen GT

Interactive Comparison

Drag the slider to compare any two methods. Select a dataset, scene, and the methods to compare.

RecGen Ground Truth

Quantitative Results

Comparison of pose estimation (ADD-SB) and shape reconstruction (CD_norm) metrics across methods. RecGen outperforms baselines on both object-centric and part-centric datasets.

Show per-dataset breakdown

ADD-SB (lower is better)

ADD-SB @0.05 (higher is better)

ADD-SB per dataset (lower is better)

ADD-SB @0.05 per dataset (higher is better)

CD_norm (lower is better)

CD_norm per dataset (lower is better)

Occlusions, Symmetry, and Multi-View

A closer look at where RecGen's gains come from: occlusion robustness, symmetric object handling, and multi-view conditioning

Occlusion Analysis

Performance by occlusion severity. Average across object-centric datasets (HB, LM-O, ReOcS).

ADD-SB by Occlusion Level (lower is better)

CD_norm by Occlusion Level (lower is better)

Symmetry-Aware Reconstruction

Symmetric objects (e.g. bowls, bottles, cups) are inherently ambiguous from a single viewpoint — many orientations produce identical images. Rather than collapsing to a single (often wrong) pose, RecGen's probabilistic generation naturally handles these ambiguities, producing reconstructions that are faithful to the observation regardless of the underlying symmetry.

Show verdict borders

Input

RecGen

SAM3D

Appearance generation for objects with symmetric shapes. RecGen produces pose-consistent textures, while SAM3D generates pose-agnostic appearances that often mismatch the observation.

To quantify this, we use a VLM-based orientation check: the percentage of symmetric-object reconstructions whose orientation matches the ground truth.

VLM Alignment Rate (%) — RecGen vs SAM3D

Multi-View Reconstruction

RecGen can leverage multiple RGB-D observations to produce more complete and accurate reconstructions. By fusing information across viewpoints, the model resolves ambiguities and fills in occluded regions. Out of the two pose predictions, the best one is selected based on its alignment with the corresponding view’s point map in metric camera space.

ADD-SB (lower is better)

CD_norm (lower is better)

Resources

Paper

Read the full technical details and evaluation in our paper.

arXiv

Code

Training code, evaluation scripts, and pre-trained models.

GitHub

Dataset

Download training data and evaluation benchmarks.

Coming Soon

Reference

@misc{zadaianchuk2026recgen,
      title={RecGen: Reconstructive Generation of 3D Scenes from RGB-D Observations},
      author={Andrii Zadaianchuk and Leonardo Barcellona and Lennard Schuenemann and Christian Gumbsch and Zehao Wang and Muhammad Zubair Irshad and Fabien Despinoy and Rahaf Aljundi and Stratis Gavves and Sergey Zakharov},
      year={2026},
      eprint={2604.27106},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.27106},
}