VIST3A: Text-to-3D by Stitching a Multi-View Reconstruction Network to a Video Generator

Go, Hyojun; Narnhofer, Dominik; Truong, Prune; Bhat, Goutam; Tombari, Federico; Schindler, Konrad

VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Hyojun Go¹, Dominik Nanhofer¹, Goutam Bhat², Prune Troung², Federico Tombari², Konrad Schindler¹,

ETH Zurich¹, Google²

Paper (under construction) GitHub

Rendered videos of generated 3DGS from our VIST3A method (Wan + AnySplat).

TL;DR: VIST3A unifies a video generator and a multi-view 3D reconstruction model into a single latent diffusion model that generates 3D representations directly from text.

Summary of Our Method

Core Innovation

We unify a video generator and a multi-view 3D reconstruction model into a single latent diffusion framework. To achieve this, we stitch a pretrained 3D foundation model into the video VAE latent space, reusing strong 3D priors for efficient adaptation. We then apply direct reward finetuning to align the generated latents with 3D quality and geometric consistency—without any 3D labels.

Plug-and-play model stitching: We search for a compatible layer in the 3D model and connect it to the video latent via a lightweight linear mapping. Thanks to good initialization from stitching, only light finetuning—without any additional data—is needed to recover the original 3D model performance.
Direct 3D alignment: We strengthen the connection between the video generator and the stitched 3D model through reward-based tuning, ensuring that decoded 3D representations are both high-quality and geometrically consistent.

Why This Matters

Efficient integration of evolving 3D models: Thanks to model stitching, integrating new 3D foundation models no longer requires large-scale finetuning or additional data collection. In contrast, retraining decoders from scratch makes it difficult to keep pace with the rapid progress of 3D model development.
Robust decoding under latent noise: While one might feed VAE-decoded images into 3D foundation models, we show that integrating the 3D model directly into the latent space results in significantly more stable and robust decoding under latent noise. This indicates that during generation—where noisy or perturbed latents are common—our stitched decoder remains far more effective and resilient.
Reward tuning for robust generation: Since the 3D reconstruction network is stitched into the latent space, reward tuning becomes more efficient and stable. Through this process, the generator learns to produce decodable and 3D-consistent latents, resulting in a highly reliable latent diffusion model.

Interactive Viewer for Generated 3DGS (Compressed)

Below are 3DGS results generated with our VIST3A method (Wan + AnySplat). Click on the image to interact with it.

Prompt: Select a model to see its prompt here.

Comparison with Baselines on 3DGS Generation

Prompt: "An Asian restaurant, possibly chinese, is depicted in a street view scene. The entrance to the restaurant is marked by a large blue sign with Chinese characters. ... . In front of the restaurant, there's a prominent gray awning. Trees and bushes add greenery to the urban setting."

Director3D

SplatFlow

Prometheus3D

VideoRFSplat

Ours

Prompt: "A small infant with round, silver-framed glasses perched on their nose in comfortably sitting in the center of plush white bed. The child, dressed in a pale yellow onesie, holds an open, colorful picture book with both tiny hands, appearing to gaze intently at the illustrations. Surrounding the infant are an assortment of plush toys, including a fluffy blue bear and a soft green frog, scattered about the soft, cream colored beds..."

Director3D

SplatFlow

Prometheus3D

VideoRFSplat

Ours

Prompt: "A bluebird perched on a tree brench."

Director3D

SplatFlow

Prometheus3D

VideoRFSplat

Ours

Prompt: "An imaginative scene unfolds with a castle intricately constructed from golden tortilla chips, its towers and walls standing tall amidst a flowing river of vibrant red salsa. Surrounding the edible fortress, tiny burritos, wrapped in soft tortillas with visible fillings, appear to be animated and meandering along the banks of the salsa river. The entire whimsical landscape is set upon a large plate, suggesting a playful, culinary creation."