VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

ETH Zurich1, Google2

Rendered videos of generated 3DGS from our VIST3A method (Wan + AnySplat).

TL;DR: VIST3A unifies a video generator and a multi-view 3D reconstruction model into a single latent diffusion model that generates 3D representations directly from text.

Summary of Our Method

VIST3A method figure

Core Innovation

We unify a video generator and a multi-view 3D reconstruction model into a single latent diffusion framework. To achieve this, we stitch a pretrained 3D foundation model into the video VAE latent space, reusing strong 3D priors for efficient adaptation. We then apply direct reward finetuning to align the generated latents with 3D quality and geometric consistency—without any 3D labels.

  • Plug-and-play model stitching: We search for a compatible layer in the 3D model and connect it to the video latent via a lightweight linear mapping. Thanks to good initialization from stitching, only light finetuning—without any additional data—is needed to recover the original 3D model performance.
  • Direct 3D alignment: We strengthen the connection between the video generator and the stitched 3D model through reward-based tuning, ensuring that decoded 3D representations are both high-quality and geometrically consistent.

Why This Matters

  • Efficient integration of evolving 3D models: Thanks to model stitching, integrating new 3D foundation models no longer requires large-scale finetuning or additional data collection. In contrast, retraining decoders from scratch makes it difficult to keep pace with the rapid progress of 3D model development.
  • Robust decoding under latent noise: While one might feed VAE-decoded images into 3D foundation models, we show that integrating the 3D model directly into the latent space results in significantly more stable and robust decoding under latent noise. This indicates that during generation—where noisy or perturbed latents are common—our stitched decoder remains far more effective and resilient.
  • Reward tuning for robust generation: Since the 3D reconstruction network is stitched into the latent space, reward tuning becomes more efficient and stable. Through this process, the generator learns to produce decodable and 3D-consistent latents, resulting in a highly reliable latent diffusion model.

Interactive Viewer for Generated 3DGS (Compressed)

Below are 3DGS results generated with our VIST3A method (Wan + AnySplat). Click on the image to interact with it.

Prompt: Select a model to see its prompt here.

Comparison with Baselines on 3DGS Generation

Interactive Viewer for Generated Pointmap

Below are point map generated by our method (Wan + VGGT). Click to interact.

Prompt:

Select a model to see its prompt here.

BibTeX