Stitched Value Model for Diffusion Alignment

Hyojun Go¹, Hyungjin Chung, Prune Truong², Goutam Bhat²,

Li Mi¹, Zhaochong An³, Zixiang Zhao¹, Dominik Narnhofer¹,

Serge Belongie³, Federico Tombari², Konrad Schindler¹

¹ETH Zürich· ²Google· ³University of Copenhagen

Paper arXiv Code BibTeX

TL;DR

Make any clean-image reward model work on noisy diffusion latents.

Same reward model — now scoring noisy diffusion latents directly, at the quality you expect on clean images. The first practical recipe.

②What we do

StitchVM takes a pretrained pixel reward model and, at small compute cost, lets the same model score the noisy intermediate latents — not just the final clean image.

③The impact

One fix Replace the heavy effort to denoise a noisy latent to calculate a pixel reward → with a direct StitchVM score. The four wins that fall out ↓

DPS

3.2× faster

higher quality · ↓50% GPU memory

FK steering

33% lower cost

via a more efficient scaling axis

DRaFT

+30% GenEval

& 22–26% fewer GPU-hours

DiffusionNFT

55%+ GPU-h saved

2.3× faster training

① THE QUESTION

Q: Is this noisy latent on its way to a good reward?

Most diffusion alignment methods come down to answering this question along the denoising trajectory.

Diffusion sampling will turn the noisy latent $\mathbf z_t$ into a final generation $\mathbf x_0$. Every alignment method has to answer this from a glance at $\mathbf z_t$ alone: how valuable will the final $\mathbf x_0$ be?

Formally, this is the value function $V_t(\mathbf z_t)$ — the expected reward of the clean image $\mathbf z_t$ will eventually denoise to:

$$V_t(\mathbf z_t) \;=\; \mathbb E\!\left[\,r(\mathbf z_0)\,\big|\,\mathbf z_t\right].$$

Across both inference-time and training-time alignment, evaluating $V_t$ is the common need — three examples:

Inference — gradient guidance e.g. DPS

$$u^{r}_t(\mathbf z_t) \;=\; u_t(\mathbf z_t) \;+\; c_t\,\nabla_{\mathbf z_t} V_t(\mathbf z_t).$$

The current velocity $u_t$ at each step is nudged by the value gradient, giving a corrected velocity $u^r_t$ used for sampling.

Inference — particle sampling e.g. FK steering

$$G(\mathbf z_{t_k}, \bar{\mathbf z}_{t_{k-1}}) \;=\; f\!\left(V_t(\bar{\mathbf z}_{t_{k-1}}),\; V_t(\mathbf z_{t_k})\right).$$

At each step, each particle draws a proposal $\bar{\mathbf z}_{t_{k-1}}$ and is reweighted by a potential $G$ — some function $f$ of $V_t$ at the current and proposed states (specific methods choose different $f$). Particles are then resampled in proportion to $G$.

Training — reward-based finetuning e.g. DRaFT, DiffusionNFT

$$\arg\max_{\theta}\; \mathbb E_{\mathbf z_0 \sim p_\theta}\!\left[r(\mathbf z_0)\right] \;-\; D_{\mathrm{KL}}(p_\theta \,\|\, p).$$

Each training iteration evaluates $V_t(\mathbf z_\tau)$ at an intermediate noisy $\mathbf z_\tau$ in place of the terminal reward $r(\mathbf z_0)$ — so the rollout can stop early instead of denoising all the way to a clean image.

② THE PROBLEM

Matching pixel reward quality on noisy latents has been impractical

So methods approximate $V_t$ instead — each approximation with cost and estimation problems.

Learning $V_t$ directly would be the cleanest fix — no bias, no variance, no extra evaluations. But matching a foundation-scale pixel reward model (CLIP, HPSv2, Aesthetic Predictor) on noisy latents has meant training at their scale, redone for every new backbone or reward. So practitioners approximate $V_t$ instead. Two workarounds dominate — each with its own price:

Practitioners settle for the red rows above because matching pixel-reward quality on noisy latents has meant retraining at foundation scale. StitchVM avoids that: it transfers, instead of retraining.

③ THE UNLOCK

🧑‍🤝‍🧑 Pixel-reward quality.
On noisy latents.
Via model stitching.

At the cost of a short finetune — not retraining anything from scratch.

Why this works: at the right depth, the internal features of a pretrained diffusion model are almost linearly compatible with those of a pretrained pixel reward model. So we keep the head of the diffusion model frozen (already native to noisy latents), keep the tail of the reward model (it already produces the score), and join them with a small stitching layer — initialised by a closed-form linear fit, then briefly finetuned together with the reward tail on unlabeled images. The stitched model inherits the reward model's predictive skill, but now operates directly on noisy latents.

The result is a noisy-latent value model at pixel-reward quality, built in hours on a single GPU — no training from scratch, just a transfer.

④ THE REACH · WHY IT MATTERS

One $V_t$. Every alignment recipe.

Build $V_t$ once. Drop it into every alignment recipe — inference-time guidance, particle sampling, training-time finetuning — and watch each one get cheaper at the same time.

INFERENCE-TIME ALIGNMENT

Cheaper. Sharper. Less biased.

DPS gradient guidance

FK steering particle sampling

TRAINING-TIME ALIGNMENT

Supervised at every noise level.

DRaFT · AlignProp · DiffusionNFT direct reward finetuning · RL post-training

⑤ THE WINS · RESULTS

Drop-in for every recipe.
Quality up, cost down.

The same StitchVM — one noisy-latent $V_t$, built once by stitching — plugged into four very different alignment methods. Across DPS, FK steering, DRaFT, and DiffusionNFT it delivers better quality with materially lower compute.

DPS · gradient guidance

3.2× faster

higher quality on nearly every reward×metric pair · ↓50% peak GPU memory

FK steering · particle sampling

33% lower cost

via a more efficient scaling axis — $(N{=}8, M{=}6)$ matches standard FKS at $N{=}14$

DRaFT · direct reward finetuning

+30% GenEval

higher score across all metrics & 22–26% fewer GPU-hours

DiffusionNFT · RL post-training

55%+ GPU-h saved

2.3× faster training, higher scores on every metric

RESULT 1 · NOISY-LATENT VALUE MODEL

StitchVM retains the reward model's capability on noisy latents.

Across three diffusion backbones (SD 3.5 Medium/Large, FLUX) and four reward models (OpenAI CLIP, DFN-CLIP, HPSv2, Aesthetic Predictor), StitchVM closely tracks the original clean reward model at low noise and degrades gracefully as noise rises — substantially beating both VAE-stitching and NoisyCLIP retraining at LAION-400M scale. The expensive "train a noisy-latent reward from scratch" baseline is no longer needed.

Zero-shot image-text retrieval (Avg. Recall@1) on MSCOCO and Flickr30K, vs. noise level σ

Preference accuracy on HPDv2 and ImageReward benchmarks, vs. noise level σ

Aesthetic SRCC on AVA test split, vs. noise level σ

Legend: clean reward, NoisyCLIP, VAE-stitching baselines, and StitchVM variants

Results of StitchVM on latents with different noise levels. $\oplus$ denotes stitching of a reward model with a pretrained diffusion module (VAE encoder or DiT). StitchVM (blue) tracks the clean reward (dashed line) far better than VAE-stitching or NoisyCLIP, across retrieval, preference, and aesthetic benchmarks.

RESULT 2 · A NEW SCALING AXIS FOR FK STEERING

Scaling along $M$ beats scaling along $N$.

Cheap $V_t$ scoring unlocks a second scaling axis on FK steering: instead of adding more particles ($N$), each particle spawns $M$ proposals scored by StitchVM at near-zero marginal cost (partial-DiT inference + stitching head, shared with the next denoising step). On FLUX with HPSv2 reward, FK steering + StitchVM at $(N{=}8, M{=}6)$ matches standard FKS at $N{=}14$ with 33% less compute, and is strictly above the $N$-only curve everywhere.

HPSv2 score vs GPU-hours on FLUX, comparing N-scaling (FKS) against (N,M)-scaling (FKS+StitchVM)

HPSv2 score vs. GPU-hours on FLUX. Blue: standard FK steering scaling $N$. Red/dark-red: FK steering + StitchVM scaling $M$ at fixed $N$. The StitchVM curves dominate the standard one across the whole compute range.

RESULT 3 · TRAINING-TIME ALIGNMENT

Same target, $\tau/T$ the rollout cost — better quality on top.

On SD 3.5 Medium at $512{\times}512$ with DFN-CLIP and HPSv2 as training rewards, replacing the terminal reward $r(\mathbf z_0)$ with the StitchVM value $V_t(\mathbf z_\tau)$ (early-stop at intermediate $\tau$) cuts wall-clock training cost dramatically and raises every metric:

DRaFT: 22–26% fewer GPU-hours, plus higher quality — StitchVM provides direct $V_t$ supervision at intermediate noisy latents, including high-noise regions where standard DRaFT had to truncate backprop.
DiffusionNFT: 55%+ fewer GPU-hours while reaching higher GenEval / ImageReward / PickScore / HPSv2 / DFN-CLIP scores.

GenEval vs GPU-hours: DRaFT-1 and DRaFT-3, with and without StitchVM — **DRaFT** · direct reward finetuning

GenEval vs GPU-hours: DiffusionNFT vs. DiffusionNFT + StitchVM — **DiffusionNFT** · RL post-training

GenEval vs. GPU-hours on SD 3.5 Medium with DFN-CLIP + HPSv2 reward. Red: with StitchVM. Blue: baseline. Adding StitchVM matches the baseline at roughly half the GPU-hours and keeps climbing past it.

Each StitchVM itself is a one-time, lightweight build: $\approx$ 10 GPU-hours at $512{\times}512$ (24–32 GPU-h at $1024{\times}1024$) on GH200 — trivial relative to the savings it enables downstream.

Citation

@article{go2026stitchvm,
    title   = {Stitched Value Model for Diffusion Alignment},
    author  = {Go, Hyojun and Chung, Hyungjin and Truong, Prune and Bhat, Goutam
               and Mi, Li and An, Zhaochong and Zhao, Zixiang and Narnhofer, Dominik
               and Belongie, Serge and Tombari, Federico and Schindler, Konrad},
    journal = {arXiv preprint arXiv:2605.19804},
    year    = {2026}
}