Katharina Schmid1, Nicolas von Lützow1, Jozef Hladký2, Angela Dai1, Matthias Nießner1
1 Technical University of Munich 2 Computing Systems Lab, Huawei Technologies, Switzerland
We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D prior. We cast scene reconstruction as conditional 3D generation over a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models -- we use Trellis.2 as an example -- which we generalize to the scene level. To this end, we propose a projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior of Trellis.2 to multi-view, scene-scale generation, producing faithful, editable PBR mesh reconstructions of indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.
✅ Paper release (22.05.2026)
✅ Code release (29.06.2026)
✅ Checkpoint release (29.06.2026)
- Clone the repo:
git clone -b main https://github.com/kasothaphie/GenRecon.git --recursive
cd GenRecon- Set up environment
The simplest path is the bundled setup script, which creates the conda env and
installs PyTorch, Flash-Attention, and all CUDA extensions. It expects a working
CUDA toolkit on your system, so set CUDA_HOME (and your GPU architecture)
beforehand. For background on the script and troubleshooting, refer to
microsoft/TRELLIS.2.
export CUDA_HOME=/usr/local/cuda # path to your CUDA 12.x toolkit
export TORCH_CUDA_ARCH_LIST="9.0" # adjust for your GPU
. ./setup.sh --new-env --basic --flash-attn --nvdiffrast --nvdiffrec --cumesh --o-voxel --flexgemmAlternative: create the environment manually (e.g. if you don't have a system CUDA toolkit)
Create the env yourself and install CUDA (and ninja) into it via conda.
conda create -n genrecon python=3.10 nvidia::cuda-toolkit=12.6 ninja
conda activate genrecon
conda env config vars set CUDA_HOME=$CONDA_PREFIX
conda deactivate && conda activate genrecon # to set the env variable from above
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu126
conda install -c conda-forge libjpeg-turbo xorg-libx11
export TORCH_CUDA_ARCH_LIST="9.0" # adjust for your GPU
. ./setup.sh --basic --nvdiffrast --nvdiffrec --cumesh --o-voxel --flexgemmThen install Flash-Attention manually. Pick the wheel matching your Python/torch/CUDA/ABI from the v2.7.3 releases; the example below requires an Ampere or newer GPU (sm_80+).
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiTRUE-cp310-cp310-linux_x86_64.whlWe are grateful to the authors of the following datasets, whose data made this work possible. Please refer to the respective sources for licensing and download instructions.
| Dataset | Source data |
|---|---|
| SAGE-10k | original data |
| 3D-FRONT | original data (no longer available) |
| ScanNet++ | original data |
3D-FRONT note: the original source data is no longer available, but more recent re-releases (e.g. this one) should work similarly.
We provide finetuned checkpoints for the 3 generative models. Please refer to https://kaldir.vc.cit.tum.de/genrecon/README.md for download instructions. Alternatively, just run:
wget https://kaldir.vc.cit.tum.de/genrecon/sparse_structure.pt
wget https://kaldir.vc.cit.tum.de/genrecon/shape_slat.pt
wget https://kaldir.vc.cit.tum.de/genrecon/texture_slat.ptBefore training, we need to render the 3D indoor scenes, create chunks, convert them into the O-Voxel representation and compute the latent representation.
Please refer to data_toolkit_scenes/README.md for detailed instructions.
python train.py \
--config configs/gen/ss_flow_img/genrecon.json \
--output_dir results/ss_gen \
--data_dir "{\"SAGE\": {\"base\": \"${DATA_ROOT}/SAGE\", \"ss_latent\": \"${DATA_ROOT}/SAGE/ss_latents/ss_enc_conv3d_16l8_fp16_64\"}}" \
--val_data_dir "{\"SAGE\": {\"base\": \"${DATA_ROOT}/SAGE_val\", \"ss_latent\": \"${DATA_ROOT}/SAGE_val/ss_latents/ss_enc_conv3d_16l8_fp16_64\"}}"python train.py \
--config configs/gen/slat_flow_img2shape/genrecon_512.json \
--output_dir results/shape_gen \
--data_dir "{\"SAGE\": {\"base\": \"${DATA_ROOT}/SAGE\", \"shape_latent\": \"${DATA_ROOT}/SAGE/shape_latents/shape_enc_next_dc_f16c32_fp16_512\"}}" \
--val_data_dir "{\"SAGE\": {\"base\": \"${DATA_ROOT}/SAGE_val\", \"shape_latent\": \"${DATA_ROOT}/SAGE_val/shape_latents/shape_enc_next_dc_f16c32_fp16_512\"}}"python train.py \
--config configs/gen/slat_flow_imgshape2tex/genrecon_512.json \
--output_dir results/tex_gen \
--data_dir "{\"SAGE\": {\"base\": \"${DATA_ROOT}/SAGE\", \"shape_latent\": \"${DATA_ROOT}/SAGE/shape_latents/shape_enc_next_dc_f16c32_fp16_512\", \"pbr_latent\": \"${DATA_ROOT}/SAGE/pbr_latents/tex_enc_next_dc_f16c32_fp16_512\"}}" \
--val_data_dir "{\"SAGE\": {\"base\": \"${DATA_ROOT}/SAGE_val\", \"shape_latent\": \"${DATA_ROOT}/SAGE_val/shape_latents/shape_enc_next_dc_f16c32_fp16_512\", \"pbr_latent\": \"${DATA_ROOT}/SAGE_val/pbr_latents/tex_enc_next_dc_f16c32_fp16_512\"}}"
Make sure you have downloaded the checkpoints or trained the models yourself.
python reconstruct_scene.py \
--mode Scannet_colmap \
--path "${PATH_TO_SCANNETPP_SCENE}" \
--output_path "${OUT_DIR}" \
--ss_ckpt "${SS_CKPT}" \
--shape_ckpt "${SHAPE_CKPT}" \
--tex_ckpt "${TEX_CKPT}" \
--num_imgs_per_scene 32To reconstruct from ScanNet++ iPhone captures instead, pass --mode Scannet_iphone.
First, you need to compute the camera paramters. We recommend COLMAP.
python reconstruct_scene.py \
--mode Iphone \
--path "${WORK_ROOT}" \
--output_path "${OUT_DIR}" \
--ss_ckpt "${SS_CKPT}" \
--shape_ckpt "${SHAPE_CKPT}" \
--tex_ckpt "${TEX_CKPT}" \
--num_imgs_per_scene 999 \
--chunk_size_factor 1.08 \
--stat_std_ratio 3.0 \
--radius_nb_points 7 \
--radius_m 0.2 \
--pipeline_config configs/pipelines/texture.json \
--proj_batch_voxels 2048Bake the reconstructed scene into a single textured scene.glb. This reads the
to_glb_inputs.pt and chunk_inputs.pt written by reconstruct_scene.py into
--output_path, and writes scene.glb to the same directory.
python chunked_to_glb.py \
--inputs "${OUT_DIR}/to_glb_inputs.pt" \
--chunk_inputs "${OUT_DIR}/chunk_inputs.pt" \
--output_dir "${OUT_DIR}"This work would not have been possible without the following open-source projects, and we thank their authors and contributors.
This model and code are released under the MIT License.
Please note that certain dependencies operate under separate license terms:
-
nvdiffrast: Utilized for rendering generated 3D assets. This package is governed by its own License.
-
nvdiffrec: Implements the split-sum renderer for PBR materials. This package is governed by its own License.
If you find GenRecon useful, please consider citing:
@article{schmid2026genreconbridginggenerativepriors,
author={Schmid, Katharina and von Lützow, Nicolas and Hladký, Jozef and Dai, Angela and Nießner, Matthias},
title={GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction},
year={2026},
eprint={2605.23888},
archivePrefix={arXiv}
}