SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Abstract

Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts—thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks—including object editing, novel view synthesis, and camera pose estimation—within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.

SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Hyojun Go^1*

Byeongjun Park^2*

Jiho Jang¹

Jin-Young Kim¹

Soonwoo Kwon¹

Changick Kim^{2 †}

¹Twelvelabs

²KAIST

CVPR 2025

(* : Equal Contribution, † : Corresponding Author)

TL;DR: SplatFlow is a unified framework that combines a latent-space multi-view generator and a Gaussian Splatting Decoder to enable efficient 3D generation, editing, and inpainting directly from text prompts.

Abstract

Overview of SplatFlow

SplatFlow framework: The RF model generates multi-view latents (images, depths, and Plücker ray coordinates) from text prompts, optimized for camera poses, while the GSDecoder converts them into pixel-aligned 3D Gaussian splats.

3D Generation

Simultaneous Generation of Camera Pose and 3DGS

3DGS Editing

Camera Pose Estimation

Novel View Synthesis

Citation

SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Hyojun Go1*

Byeongjun Park2*

Jiho Jang1

Jin-Young Kim1

Soonwoo Kwon1

Changick Kim2 †

1Twelvelabs

2KAIST

CVPR 2025

(* : Equal Contribution, † : Corresponding Author)

TL;DR: SplatFlow is a unified framework that combines a latent-space multi-view generator and a Gaussian Splatting Decoder to enable efficient 3D generation, editing, and inpainting directly from text prompts.

Abstract

Overview of SplatFlow

SplatFlow framework: The RF model generates multi-view latents (images, depths, and Plücker ray coordinates) from text prompts, optimized for camera poses, while the GSDecoder converts them into pixel-aligned 3D Gaussian splats.

3D Generation

Simultaneous Generation of Camera Pose and 3DGS

3DGS Editing

Camera Pose Estimation

Novel View Synthesis

Citation

Hyojun Go^1*

Byeongjun Park^2*

Jiho Jang¹

Jin-Young Kim¹

Soonwoo Kwon¹

Changick Kim^{2 †}

¹Twelvelabs

²KAIST