Summary of Our Method
Core Innovation
We unify a video generator and a multi-view 3D reconstruction model into a single latent diffusion framework. To achieve this, we stitch a pretrained 3D foundation model into the video VAE latent space, reusing strong 3D priors for efficient adaptation. We then apply direct reward finetuning to align the generated latents with 3D quality and geometric consistency—without any 3D labels.
- Plug-and-play model stitching: We search for a compatible layer in the 3D model and connect it to the video latent via a lightweight linear mapping. Thanks to good initialization from stitching, only light finetuning—without any additional data—is needed to recover the original 3D model performance.
- Direct 3D alignment: We strengthen the connection between the video generator and the stitched 3D model through reward-based tuning, ensuring that decoded 3D representations are both high-quality and geometrically consistent.
Why This Matters
- Efficient integration of evolving 3D models: Thanks to model stitching, integrating new 3D foundation models no longer requires large-scale finetuning or additional data collection. In contrast, retraining decoders from scratch makes it difficult to keep pace with the rapid progress of 3D model development.
- Robust decoding under latent noise: While one might feed VAE-decoded images into 3D foundation models, we show that integrating the 3D model directly into the latent space results in significantly more stable and robust decoding under latent noise. This indicates that during generation—where noisy or perturbed latents are common—our stitched decoder remains far more effective and resilient.
- Reward tuning for robust generation: Since the 3D reconstruction network is stitched into the latent space, reward tuning becomes more efficient and stable. Through this process, the generator learns to produce decodable and 3D-consistent latents, resulting in a highly reliable latent diffusion model.
Interactive Viewer for Generated 3DGS (Compressed)
Below are 3DGS results generated with our VIST3A method (Wan + AnySplat). Click on the image to interact with it.
Prompt: Select a model to see its prompt here.
Comparison with Baselines on 3DGS Generation
Interactive Viewer for Generated Pointmap
Below are point map generated by our method (Wan + VGGT). Click to interact.
Prompt:
Select a model to see its prompt here.
BibTeX
@inproceedings{
go2026texttod,
title={Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator},
author={Hyojun Go and Dominik Narnhofer and Goutam Bhat and Prune Truong and Federico Tombari and Konrad Schindler},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=kI27Niy4xY}
}