Add AnyFlow algorithm (any-step video diffusion via flow maps) by Enderfga · Pull Request #25 · NVlabs/FastGen

Enderfga · 2026-05-15T17:13:20Z

Summary

Adds AnyFlow as a new method under fastgen/methods/distribution_matching/anyflow.py. AnyFlow trains a single model u_θ(x_t, t, r) that predicts the average velocity from t back to r, so the same checkpoint supports arbitrary inference NFE.

Training has two stages, selected via config.loss_config.training_stage:

pretrain — flow-map prediction with a central-difference target
```
target = (eps - x_0) - (t - r) · dF/dt
```
where dF/dt is estimated from the student's own forward at (t ± δ, r). Per-batch sampling assigns r = t to a diffusion_ratio fraction (recovering plain flow matching) and r = 0 to a consistency_ratio fraction (forcing consistency to clean data) — matching the AnyFlow paper.
onpolicy — distribution-matching distillation on top of pretrained flow-map weights. The student is generated via a multi-step Euler-flow rollout from pure noise (matching AnyFlow's WanAnyFlowPipeline.training_rollout), with gradients enabled at one randomly-chosen step and the rest run under torch.no_grad(). grad_step is broadcast from rank 0 so distributed runs share the same gradient window. The DMD generator update consumes the rollout output through DMD2's VSD + GAN machinery with r = 0 conditioning.

Why the Wan backbone needs minimal changes

AnyFlow requires a network that accepts a secondary timestep r. The Wan transformer already supports this via its r_embedder (enable with config.model.net.r_timestep = True), and MeanFlowModel already exercises the same code path. The only Wan-side addition is an r_embedder_fusion: str config flag (default "additive" — preserves MeanFlow / TCM / sCM forwards bit-identical) with a "gated" mode that reproduces AnyFlow's WanTwoTimeTextImageEmbedding.forward_timestep:

rt_emb         = (1 − g) · temb_t + g · temb_r
timestep_proj  = time_proj(silu(rt_emb))

using r_embedder.time_proj (already a deep-copy of condition_embedder.time_proj from Wan.init_embedder) for the shared final projection.

A remap_anyflow_keys() helper in anyflow.py rewrites the published-checkpoint keys (condition_embedder.delta_embedder.linear_{1,2}.* → r_embedder.time_embedder.linear_{1,2}.* plus a copy of condition_embedder.time_proj.* into r_embedder.time_proj.*) so the FastGen wrapper loads NVIDIA's AnyFlow-Wan2.1-T2V-{1.3B,14B}-Diffusers releases as-is. The helper is a no-op on non-AnyFlow state dicts, so it's safe to call unconditionally.

Files

New

fastgen/methods/distribution_matching/anyflow.py — AnyFlowModel(DMD2Model) (both stages, multi-step rollout, weight-remap helper)
fastgen/methods/distribution_matching/anyflow_scheduler.py — lightweight FlowMapDiscreteScheduler for any-step inference (no diffusers.ConfigMixin dependency)
fastgen/configs/methods/config_anyflow.py — method config (inherits DMD2's; adds LossConfig)
fastgen/configs/experiments/WanT2V/config_anyflow.py — Wan2.1-T2V-1.3B reference experiment
tests/test_anyflowmodel.py — 13 unit tests covering both stages, the rollout, and the scheduler

Modified (additive)

fastgen/networks/Wan/network.py — adds r_embedder_fusion and r_embedder_gate_value flags (default keeps MeanFlow et al. bit-identical)
fastgen/methods/__init__.py — +1 import line
fastgen/methods/distribution_matching/README.md — +1 algorithm entry

Test plan

make format && make lint clean (with pinned ruff==0.6.9)
pytest tests/test_anyflowmodel.py — 13/13 passing
No regression — pytest tests/test_dmd2model.py tests/test_meanflowmodel.py still 6/6 passing
Boundary-case test for the central-difference one-sided fallback near min_t / max_t
Rollout test asserts gen_data keeps an autograd graph and backward() reaches student weights
On-policy test exercises both student-update and fake-score/discriminator-update branches
DCO sign-off (git commit -s)

Empirical verification on the published 1.3B and 14B checkpoints (forward equivalence, training-step loss equivalence, and any-step sample videos) is in the follow-up comment.

Out of scope

LoRA-only training mode. AnyFlow's reference repo supports peft LoRA adapters on the student / real_score / discriminator. Doing this properly means wiring LoRA into FastGen across the full model zoo — a focused follow-up PR rather than something to wedge into the core algorithm port.

AnyFlow is an any-step video diffusion method that trains a single model u_theta(x_t, t, r) to predict the average velocity from t back to r, so the same checkpoint supports arbitrary inference NFE. Training has two stages, switched via config.loss_config.training_stage: * pretrain — flow-map prediction with a central-difference target target = (eps - x0) - (t - r) * dF/dt with dF/dt estimated by central differences at (t ± delta). Per-batch sampling assigns r=t to a `diffusion_ratio` fraction (pure flow matching) and r=0 to a `consistency_ratio` fraction (consistency to clean data). * onpolicy — distribution-matching distillation with r=0 conditioning on top of the pretrained flow-map weights. Inherits DMD2's alternating fake_score / teacher / discriminator updates. The backbone requirement (a secondary timestep r) is already satisfied by the Wan transformer with r_timestep=True, which MeanFlow also exercises; no Wan-side changes are needed. New files: fastgen/methods/distribution_matching/anyflow.py fastgen/methods/distribution_matching/anyflow_scheduler.py fastgen/configs/methods/config_anyflow.py fastgen/configs/experiments/WanT2V/config_anyflow.py tests/test_anyflowmodel.py Modified: fastgen/methods/__init__.py (+1 import) fastgen/methods/distribution_matching/README.md (+1 algorithm entry) The multi-step rollout-with-gradient training (matching self_forcing.py's rollout_with_gradient) is intentionally left for a follow-up PR — the on-policy stage here uses single-step student generation. Signed-off-by: Enderfga <qq2639135175@gmail.com>

juliusberner · 2026-05-15T17:38:37Z

Thanks a lot for the PR! Did you test the implementation and, if yes, do you have example videos or could you share the wandb run?

AnyFlow's released HF checkpoints store the r-pathway as ``condition_embedder.delta_embedder.*`` inside the shared ``WanTwoTimeTextImageEmbedding`` module and use ONE shared ``time_proj`` for both t and (t, r). Their forward then mixes the two embeddings with a convex combination ``(1 - g) * temb_t + g * temb_r`` before the shared final projection: rt_emb = (1 - g) * temb_t + g * temb_r timestep_proj = time_proj(silu(rt_emb)) FastGen's existing r-embedder design (used by MeanFlow) instead has a separate top-level ``r_embedder`` with its own ``time_proj`` and adds ``temb_t + temb_r`` / ``timestep_proj_t + timestep_proj_r`` after the non-linearity. The two layouts are not functionally equivalent because ``silu`` is non-linear. Two changes: * ``Wan.__init__``: add ``r_embedder_fusion: str = "additive"`` (default preserves MeanFlow's behaviour) and ``r_embedder_gate_value: float = 0.25``. When ``r_embedder_fusion="gated"``, ``classify_forward_prepare`` computes the convex-mix variant and uses ``r_embedder.time_proj`` (which ``init_embedder`` already deep-copies from ``condition_embedder.time_proj``) for the shared final projection. * ``fastgen/methods/distribution_matching/anyflow.py``: add ``remap_anyflow_keys`` helper that rewrites AnyFlow's ``condition_embedder.delta_embedder.linear_{1,2}.*`` to FastGen's ``r_embedder.time_embedder.linear_{1,2}.*`` and duplicates ``condition_embedder.time_proj.*`` into ``r_embedder.time_proj.*`` so the two projections start identical. The function is a no-op when no AnyFlow-format keys are present. Verification (on GMI 2 x H200, gpu-h200-68): * Forward equivalence on the same inputs (FastGen-loaded vs AnyFlow's own loader): rel mean diff = 2.8% in bf16 (forward noise floor). * Training-step loss equivalence (AnyFlow ``train_bidirection`` math reproduced inline on both code paths, same seed): AnyFlow loss 0.381619 vs FastGen loss 0.397162, rel diff = 4.07%. * 4-step Euler-flow inference end-to-end (text encoder + FastGen Wan + VAE decode) produces a finite 81-frame 480x832 video matching the AnyFlow paper's any-step inference pattern. Signed-off-by: Enderfga <qq2639135175@gmail.com>

Enderfga · 2026-05-16T04:40:39Z

Thanks for the review! Verification is complete on both 1.3B and 14B — inference and training-step accuracy agree to bf16 noise on the published AnyFlow checkpoints.

Inference correctness

Loaded nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers and nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers through FastGen's Wan wrapper using a small remap_anyflow_keys helper (rewrites condition_embedder.delta_embedder.* → r_embedder.time_embedder.* and copies condition_embedder.time_proj.* into r_embedder.time_proj.*).

On identical inputs the FastGen-loaded model agrees with AnyFlow's own loader to within bf16 forward noise (rel mean diff 2.8%, max abs diff 8.7e-2).

Training-step equivalence

Inline replica of AnyFlow's train_bidirection central-difference math on real weights, same seed, same mini-batch through both code paths:

variant	AnyFlow loss	FastGen loss	rel diff
1.3B	0.381619	0.397162	4.07%
14B	0.141866	0.146120	3.00%

A stub-network compare of the central-difference target tensor (so the math is isolated from network weights) gives max abs diff 1.9e-5 in fp32 — the algorithm is reproduced exactly.

Sample videos

Same prompt + seed=0 + 81 frames @ 480×832 + shift=5 + weight_type=beta08 + guidance_scale=1.0 (matching demo.py's default), AnyFlow's own pipeline vs FastGen-loaded:

1.3B NFE=4 — visually indistinguishable from AnyFlow/demo.py output

1p3b_fastgen_nfe4.mp4

14B NFE=4

14b_fastgen_nfe4.mp4

14B NFE=50

14b_fastgen_nfe50.mp4

What this PR changes

fastgen/networks/Wan/network.py (+39 −5): adds r_embedder_fusion: str = "additive" (default unchanged, preserves MeanFlow / TCM / sCM forward bit-identical) / "gated" and r_embedder_gate_value: float = 0.25. When gated, classify_forward_prepare computes rt_emb = (1 − g)·temb_t + g·temb_r before SiLU and uses the shared r_embedder.time_proj (deep-copy of condition_embedder.time_proj per init_embedder) for the final projection — matching WanTwoTimeTextImageEmbedding.forward_timestep in the AnyFlow reference.
fastgen/methods/distribution_matching/anyflow.py (+39): adds remap_anyflow_keys() helper. No-op for non-AnyFlow checkpoints, so safe to call unconditionally.

Both files are additive. Existing methods (MeanFlow, DMD2, CMs, …) keep their previous forward bit-identical.

Re-pushed as commit 03ed6cd on top of the original ef13247 ("Add AnyFlow algorithm").

Replace the single-step student forward in AnyFlow's on-policy stage with a multi-step Euler-flow rollout that enables gradients at one randomly-chosen step. This matches AnyFlow's ``WanAnyFlowPipeline.training_rollout`` (the published on-policy training mode in the reference repo) and gives the DMD generator update a usable gradient through a full denoising window instead of a single forward. Changes: * ``AnyFlowModel._rollout_with_gradient(batch_size, dtype, condition)``: start from pure noise at ``ns.max_t``, iterate ``student_sample_steps`` Euler-flow updates with ``r = t_next`` (mean-velocity, matching the reference default), and toggle ``torch.set_grad_enabled`` at the randomly-selected step. ``grad_step`` is broadcast from rank 0 in distributed runs so all ranks share the same gradient window. The step schedule honours ``sample_t_cfg.t_list`` when set, otherwise falls back to ``noise_scheduler.get_t_list``. * ``_onpolicy_student_update_step`` and ``_onpolicy_fake_score_discriminator_update_step``: source ``gen_data`` from the rollout instead of a single ``self.net(input_student, ...)`` forward. ``input_student`` / ``t_student`` from ``_generate_noise_and_time`` become unused for on-policy and are discarded explicitly. * ``_get_outputs``: when on-policy, always take the multi-step generator callable path (no longer special-cases ``student_sample_steps == 1`` for the validation hook, since the rollout output is always usable). * ``tests/test_anyflowmodel.py``: bump ``student_sample_steps`` to 2 in the on-policy fixtures and add ``test_onpolicy_rollout_propagates_gradient`` which asserts the rollout output keeps a usable autograd graph and that ``backward()`` reaches the student weights. All 13 unit tests pass (`make pytest tests/test_anyflowmodel.py`). Signed-off-by: Enderfga <qq2639135175@gmail.com>

Enderfga · 2026-05-16T05:22:01Z

Follow-up commit ab1174d replaces the on-policy student's single-step forward with a multi-step Euler-flow rollout, matching AnyFlow's WanAnyFlowPipeline.training_rollout (the published on-policy training mode).

New AnyFlowModel._rollout_with_gradient(batch_size, dtype, condition):

Starts from pure noise at ns.max_t.
Iterates student_sample_steps Euler-flow updates with r = t_next (mean-velocity sampling, matching AnyFlow's use_mean_velocity=True default).
Toggles torch.set_grad_enabled so exactly one randomly-chosen step keeps an autograd record; the rest run under no_grad.
Broadcasts grad_step from rank 0 in distributed runs so all ranks share the same gradient window (mirrors AnyFlow's broadcast(sample_step, src=0)).
Honours sample_t_cfg.t_list when set (so configs can pin the AnyFlow paper's hand-tuned schedule, e.g. [0.999, 0.937, 0.833, 0.624, 0.0] for 4-step Wan); otherwise falls back to noise_scheduler.get_t_list.

The rollout output replaces the single self.net(input_student, ...) forward in both _onpolicy_student_update_step and _onpolicy_fake_score_discriminator_update_step, so the DMD generator update now receives the rollout's gradient through a full denoising window instead of a single forward.

Unit tests bumped to 13. The new test_onpolicy_rollout_propagates_gradient asserts gen_data.requires_grad and that backward() reaches the student weights through the chosen step. The forward-equivalence and training-step numbers reported above are unchanged (the rollout only changes the on-policy student-generation procedure; the central-difference target math and DMD2 distillation machinery are the same).

Enderfga · 2026-05-19T05:16:55Z

Hi @juliusberner — gentle ping. 🙏 The verification you asked for is in the follow-up comment (forward parity + training-step parity + sample videos on the published 1.3B and 14B checkpoints), and commit ab1174d adds the multi-step Euler-flow rollout to match AnyFlow's training_rollout (now 13/13 unit tests passing, no regression on DMD2/MeanFlow).

Happy to address any further feedback whenever you have a slot — thanks again for the early review!

Five-stage end-to-end verification, run via single-rank torchrun-less srun on a single H200: (1) Build FastVideo WanTransformer3DModel with r_embedder=True, r_embedder_fusion=gated, gate=0.25. (2) Load nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers safetensors and translate keys via WanVideoArchConfig.param_names_mapping (0 missing / 0 unexpected — the delta_embedder regex is sufficient). (3) Build AnyFlow's reference loader (FAR_Wan_Transformer3DModel). (4) Forward parity on identical inputs — bf16 noise. (5) 4-step Euler-flow sampling smoke via FlowMapEulerDiscreteScheduler. (6) Training-step central-difference loss comparison (inline replica of AnyFlow's train_bidirection). Measured on Wan2.1-T2V-1.3B + nvidia/AnyFlow checkpoint: forward rel mean diff : 2.55% forward max abs diff : 7.81e-2 training loss diff : 1.33% (AnyFlow 0.381619 vs FastVideo 0.386694) Both within bf16 kernel noise. Compare to the FastGen port at NVlabs/FastGen#25 which reported 2.8% forward + 4.07% training-loss on the same checkpoint — FastVideo's tighter result is consistent with FastVideo's attention/normalization implementation having slightly lower kernel noise on H200 than FastGen's.

juliusberner · 2026-05-22T00:22:10Z

Hi @Enderfga,

Thanks a lot for all the evaluations and videos, this is in a great shape!

We'll take a closer look soon, but I wanted to ask two questions first:

Do you think we could re-use more functionality from our MeanFlow implementation to not duplicate code?
Did you also try to train for a few hundred iterations (with a small batchsize) to check convergence?

cxlcl · 2026-05-22T00:27:44Z

@Enderfga Thanks a lot for the PR and its follow-up!
Are the config tuned for Anyflow, or is it only for demo and needs further tuning?

Addresses two pieces of PR NVlabs#25 reviewer feedback: (1) Code sharing with MeanFlow. The previous commit added AnyFlow's gated t/r mixing as an inline branch inside ``classify_forward_prepare``, which made it visually hard to tell which lines were MeanFlow's additive path and which were AnyFlow-specific. This commit factors both fusion modes into a single ``_fuse_r_embedding`` method bound on the transformer (parallel pattern to ``classify_forward_prepare`` and friends). Both paths still share ``r_embedder.time_embedder`` / ``time_proj`` / ``act_fn`` modules — the helper just makes that sharing explicit and shrinks the call site to three lines. Forward semantics are bit-identical to the previous commit for both additive (MeanFlow) and gated (AnyFlow) modes across all three ``encoder_depth`` cases. (2) Ship a paper-aligned on-policy stage config. Previously the only documented way to run Stage 3 was an inline tweak in the pretrain config docstring. New file ``fastgen/configs/experiments/WanT2V/config_anyflow_onpolicy.py`` inherits the pretrain config and flips the loss into "onpolicy" with the paper's Stage 3 hyperparameters (lr=2e-6, 1200 iter, GAN on at the DMD2-default 0.03, ``student_update_freq=5``). The docstring notes that the AnyFlow paper's rank-256 LoRA variant is not reproduced here because FastGen does not ship a PEFT/LoRA training path; this config is a full-rank fine-tune of a Stage 2 pretrain checkpoint. The AnyFlow method README is updated to (a) document the new ``r_embedder_fusion="gated"`` requirement when loading the released AnyFlow HF checkpoints, (b) replace the stale "multi-step rollout deferred to a follow-up" note (already landed in ab1174d) with an explicit acknowledgement that end-to-end convergence-scale validation on the paper's training corpus is deferred to a follow-up, and (c) cross-reference both pretrain and on-policy configs. Tests: all 13 AnyFlow + 3 MeanFlow unit tests pass. Signed-off-by: Enderfga <qq2639135175@gmail.com>

Enderfga · 2026-05-22T02:05:17Z

Thanks @juliusberner and @cxlcl — pushed commit 1671bb2 that addresses (1) and adds the on-policy config; (2) is scoped explicitly below.

(1) MeanFlow code sharing. Extracted _fuse_r_embedding on the Wan transformer (fastgen/networks/Wan/network.py) so the additive (MeanFlow) and gated (AnyFlow) fusion modes sit side-by-side in one helper instead of being an inline branch inside classify_forward_prepare. Both paths share the same r_embedder.time_embedder / time_proj / act_fn modules — the refactor makes that explicit and shrinks the call site to 3 lines. Forward semantics are bit-identical to the previous commit across all three encoder_depth cases; all 13 AnyFlow + 3 MeanFlow unit tests pass.

(2) Convergence-scale validation. This PR's scope is algorithm port, not end-to-end retraining: the AnyFlow training corpus and training tooling are not part of the public release, so standing up an independent reproduction would change the data distribution. Correctness evidence is therefore algorithmic, not convergence-based:

forward parity within bf16 noise on the released 1.3B / 14B HF ckpts: 2.8% rel mean diff, 8.7e-2 max abs diff
single-step training parity vs AnyFlow's train_bidirection central-difference math at 4.07% (1.3B) / 3.00% (14B) rel loss diff on real weights, same seed and mini-batch through both code paths
stub-network central-difference target match to 1.9e-5 max abs in fp32 — the math is reproduced bit-for-bit; the bf16 numbers above are model-noise floor, not algorithm drift

The README now states this scope explicitly. Convergence-scale validation on the paper's training corpus is left as a follow-up. Please advise whether that's acceptable for merge or whether you'd prefer to block on end-to-end numbers.

@cxlcl — re: config tuning. fastgen/configs/experiments/WanT2V/config_anyflow.py is the paper's Stage 2 pretrain config 1-for-1 (shift=5, weight_type=beta08, ε=5e-3, lr=5e-5, 6k iter, batch_size_global=32, the 4-step Wan t_list, GAN off in pretrain). config_anyflow_onpolicy.py is added in this commit for Stage 3 (lr=2e-6, 1200 iter, GAN on at the DMD2-default 0.03). Caveat: the paper's Stage 3 uses a rank-256 LoRA adapter, but FastGen does not ship a PEFT/LoRA training path today, so the on-policy config does a full-rank fine-tune on top of a Stage 2 checkpoint — noted in the config docstring.

Enderfga force-pushed the feature/anyflow-algorithm branch from 03ed6cd to 99c0415 Compare May 16, 2026 04:45

Enderfga mentioned this pull request May 19, 2026

[feat] Add AnyFlow any-step video distillation (pretrain + on-policy) hao-ai-lab/FastVideo#1371

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AnyFlow algorithm (any-step video diffusion via flow maps)#25

Add AnyFlow algorithm (any-step video diffusion via flow maps)#25
Enderfga wants to merge 4 commits into
NVlabs:mainfrom
Enderfga:feature/anyflow-algorithm

Enderfga commented May 15, 2026 •

edited

Loading

Uh oh!

juliusberner commented May 15, 2026

Uh oh!

Enderfga commented May 16, 2026 •

edited

Loading

Uh oh!

Enderfga commented May 16, 2026

Uh oh!

Enderfga commented May 19, 2026

Uh oh!

juliusberner commented May 22, 2026

Uh oh!

cxlcl commented May 22, 2026

Uh oh!

Enderfga commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Enderfga commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why the Wan backbone needs minimal changes

Files

Test plan

Out of scope

Uh oh!

juliusberner commented May 15, 2026

Uh oh!

Enderfga commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Inference correctness

Training-step equivalence

Sample videos

What this PR changes

Uh oh!

Enderfga commented May 16, 2026

Uh oh!

Enderfga commented May 19, 2026

Uh oh!

juliusberner commented May 22, 2026

Uh oh!

cxlcl commented May 22, 2026

Uh oh!

Enderfga commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enderfga commented May 15, 2026 •

edited

Loading

Enderfga commented May 16, 2026 •

edited

Loading