Skip to content

Add AnyFlow algorithm (any-step video diffusion via flow maps)#25

Open
Enderfga wants to merge 4 commits into
NVlabs:mainfrom
Enderfga:feature/anyflow-algorithm
Open

Add AnyFlow algorithm (any-step video diffusion via flow maps)#25
Enderfga wants to merge 4 commits into
NVlabs:mainfrom
Enderfga:feature/anyflow-algorithm

Conversation

@Enderfga
Copy link
Copy Markdown

@Enderfga Enderfga commented May 15, 2026

Summary

Adds AnyFlow as a new method under fastgen/methods/distribution_matching/anyflow.py. AnyFlow trains a single model u_θ(x_t, t, r) that predicts the average velocity from t back to r, so the same checkpoint supports arbitrary inference NFE.

Training has two stages, selected via config.loss_config.training_stage:

  • pretrain — flow-map prediction with a central-difference target

    target = (eps - x_0) - (t - r) · dF/dt
    

    where dF/dt is estimated from the student's own forward at (t ± δ, r). Per-batch sampling assigns r = t to a diffusion_ratio fraction (recovering plain flow matching) and r = 0 to a consistency_ratio fraction (forcing consistency to clean data) — matching the AnyFlow paper.

  • onpolicy — distribution-matching distillation on top of pretrained flow-map weights. The student is generated via a multi-step Euler-flow rollout from pure noise (matching AnyFlow's WanAnyFlowPipeline.training_rollout), with gradients enabled at one randomly-chosen step and the rest run under torch.no_grad(). grad_step is broadcast from rank 0 so distributed runs share the same gradient window. The DMD generator update consumes the rollout output through DMD2's VSD + GAN machinery with r = 0 conditioning.

Why the Wan backbone needs minimal changes

AnyFlow requires a network that accepts a secondary timestep r. The Wan transformer already supports this via its r_embedder (enable with config.model.net.r_timestep = True), and MeanFlowModel already exercises the same code path. The only Wan-side addition is an r_embedder_fusion: str config flag (default "additive" — preserves MeanFlow / TCM / sCM forwards bit-identical) with a "gated" mode that reproduces AnyFlow's WanTwoTimeTextImageEmbedding.forward_timestep:

rt_emb         = (1 − g) · temb_t + g · temb_r
timestep_proj  = time_proj(silu(rt_emb))

using r_embedder.time_proj (already a deep-copy of condition_embedder.time_proj from Wan.init_embedder) for the shared final projection.

A remap_anyflow_keys() helper in anyflow.py rewrites the published-checkpoint keys (condition_embedder.delta_embedder.linear_{1,2}.*r_embedder.time_embedder.linear_{1,2}.* plus a copy of condition_embedder.time_proj.* into r_embedder.time_proj.*) so the FastGen wrapper loads NVIDIA's AnyFlow-Wan2.1-T2V-{1.3B,14B}-Diffusers releases as-is. The helper is a no-op on non-AnyFlow state dicts, so it's safe to call unconditionally.

Files

New

  • fastgen/methods/distribution_matching/anyflow.pyAnyFlowModel(DMD2Model) (both stages, multi-step rollout, weight-remap helper)
  • fastgen/methods/distribution_matching/anyflow_scheduler.py — lightweight FlowMapDiscreteScheduler for any-step inference (no diffusers.ConfigMixin dependency)
  • fastgen/configs/methods/config_anyflow.py — method config (inherits DMD2's; adds LossConfig)
  • fastgen/configs/experiments/WanT2V/config_anyflow.py — Wan2.1-T2V-1.3B reference experiment
  • tests/test_anyflowmodel.py — 13 unit tests covering both stages, the rollout, and the scheduler

Modified (additive)

  • fastgen/networks/Wan/network.py — adds r_embedder_fusion and r_embedder_gate_value flags (default keeps MeanFlow et al. bit-identical)
  • fastgen/methods/__init__.py — +1 import line
  • fastgen/methods/distribution_matching/README.md — +1 algorithm entry

Test plan

  • make format && make lint clean (with pinned ruff==0.6.9)
  • pytest tests/test_anyflowmodel.py — 13/13 passing
  • No regression — pytest tests/test_dmd2model.py tests/test_meanflowmodel.py still 6/6 passing
  • Boundary-case test for the central-difference one-sided fallback near min_t / max_t
  • Rollout test asserts gen_data keeps an autograd graph and backward() reaches student weights
  • On-policy test exercises both student-update and fake-score/discriminator-update branches
  • DCO sign-off (git commit -s)

Empirical verification on the published 1.3B and 14B checkpoints (forward equivalence, training-step loss equivalence, and any-step sample videos) is in the follow-up comment.

Out of scope

  • LoRA-only training mode. AnyFlow's reference repo supports peft LoRA adapters on the student / real_score / discriminator. Doing this properly means wiring LoRA into FastGen across the full model zoo — a focused follow-up PR rather than something to wedge into the core algorithm port.

AnyFlow is an any-step video diffusion method that trains a single model
u_theta(x_t, t, r) to predict the average velocity from t back to r, so
the same checkpoint supports arbitrary inference NFE.

Training has two stages, switched via config.loss_config.training_stage:

  * pretrain  — flow-map prediction with a central-difference target
                target = (eps - x0) - (t - r) * dF/dt
                with dF/dt estimated by central differences at (t ± delta).
                Per-batch sampling assigns r=t to a `diffusion_ratio`
                fraction (pure flow matching) and r=0 to a
                `consistency_ratio` fraction (consistency to clean data).

  * onpolicy  — distribution-matching distillation with r=0 conditioning
                on top of the pretrained flow-map weights. Inherits DMD2's
                alternating fake_score / teacher / discriminator updates.

The backbone requirement (a secondary timestep r) is already satisfied by
the Wan transformer with r_timestep=True, which MeanFlow also exercises;
no Wan-side changes are needed.

New files:
  fastgen/methods/distribution_matching/anyflow.py
  fastgen/methods/distribution_matching/anyflow_scheduler.py
  fastgen/configs/methods/config_anyflow.py
  fastgen/configs/experiments/WanT2V/config_anyflow.py
  tests/test_anyflowmodel.py

Modified:
  fastgen/methods/__init__.py                       (+1 import)
  fastgen/methods/distribution_matching/README.md   (+1 algorithm entry)

The multi-step rollout-with-gradient training (matching
self_forcing.py's rollout_with_gradient) is intentionally left for a
follow-up PR — the on-policy stage here uses single-step student
generation.

Signed-off-by: Enderfga <qq2639135175@gmail.com>
@juliusberner
Copy link
Copy Markdown
Collaborator

Thanks a lot for the PR! Did you test the implementation and, if yes, do you have example videos or could you share the wandb run?

AnyFlow's released HF checkpoints store the r-pathway as
``condition_embedder.delta_embedder.*`` inside the shared
``WanTwoTimeTextImageEmbedding`` module and use ONE shared ``time_proj``
for both t and (t, r). Their forward then mixes the two embeddings
with a convex combination ``(1 - g) * temb_t + g * temb_r`` before the
shared final projection:

    rt_emb         = (1 - g) * temb_t + g * temb_r
    timestep_proj  = time_proj(silu(rt_emb))

FastGen's existing r-embedder design (used by MeanFlow) instead has a
separate top-level ``r_embedder`` with its own ``time_proj`` and adds
``temb_t + temb_r`` / ``timestep_proj_t + timestep_proj_r`` after the
non-linearity. The two layouts are not functionally equivalent because
``silu`` is non-linear.

Two changes:

* ``Wan.__init__``: add ``r_embedder_fusion: str = "additive"`` (default
  preserves MeanFlow's behaviour) and ``r_embedder_gate_value: float =
  0.25``. When ``r_embedder_fusion="gated"``, ``classify_forward_prepare``
  computes the convex-mix variant and uses ``r_embedder.time_proj``
  (which ``init_embedder`` already deep-copies from
  ``condition_embedder.time_proj``) for the shared final projection.

* ``fastgen/methods/distribution_matching/anyflow.py``: add
  ``remap_anyflow_keys`` helper that rewrites AnyFlow's
  ``condition_embedder.delta_embedder.linear_{1,2}.*`` to FastGen's
  ``r_embedder.time_embedder.linear_{1,2}.*`` and duplicates
  ``condition_embedder.time_proj.*`` into ``r_embedder.time_proj.*``
  so the two projections start identical. The function is a no-op when
  no AnyFlow-format keys are present.

Verification (on GMI 2 x H200, gpu-h200-68):

* Forward equivalence on the same inputs (FastGen-loaded vs AnyFlow's
  own loader): rel mean diff = 2.8% in bf16 (forward noise floor).
* Training-step loss equivalence (AnyFlow ``train_bidirection`` math
  reproduced inline on both code paths, same seed): AnyFlow loss
  0.381619 vs FastGen loss 0.397162, rel diff = 4.07%.
* 4-step Euler-flow inference end-to-end (text encoder + FastGen Wan +
  VAE decode) produces a finite 81-frame 480x832 video matching the
  AnyFlow paper's any-step inference pattern.

Signed-off-by: Enderfga <qq2639135175@gmail.com>
@Enderfga
Copy link
Copy Markdown
Author

Enderfga commented May 16, 2026

Thanks for the review! Verification is complete on both 1.3B and 14B — inference and training-step accuracy agree to bf16 noise on the published AnyFlow checkpoints.

Inference correctness

Loaded nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers and nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers through FastGen's Wan wrapper using a small remap_anyflow_keys helper (rewrites condition_embedder.delta_embedder.*r_embedder.time_embedder.* and copies condition_embedder.time_proj.* into r_embedder.time_proj.*).

On identical inputs the FastGen-loaded model agrees with AnyFlow's own loader to within bf16 forward noise (rel mean diff 2.8%, max abs diff 8.7e-2).

Training-step equivalence

Inline replica of AnyFlow's train_bidirection central-difference math on real weights, same seed, same mini-batch through both code paths:

variant AnyFlow loss FastGen loss rel diff
1.3B 0.381619 0.397162 4.07%
14B 0.141866 0.146120 3.00%

A stub-network compare of the central-difference target tensor (so the math is isolated from network weights) gives max abs diff 1.9e-5 in fp32 — the algorithm is reproduced exactly.

Sample videos

Same prompt + seed=0 + 81 frames @ 480×832 + shift=5 + weight_type=beta08 + guidance_scale=1.0 (matching demo.py's default), AnyFlow's own pipeline vs FastGen-loaded:

  • 1.3B NFE=4 — visually indistinguishable from AnyFlow/demo.py output
1p3b_fastgen_nfe4.mp4
  • 14B NFE=4
14b_fastgen_nfe4.mp4
  • 14B NFE=50
14b_fastgen_nfe50.mp4

What this PR changes

  • fastgen/networks/Wan/network.py (+39 −5): adds r_embedder_fusion: str = "additive" (default unchanged, preserves MeanFlow / TCM / sCM forward bit-identical) / "gated" and r_embedder_gate_value: float = 0.25. When gated, classify_forward_prepare computes rt_emb = (1 − g)·temb_t + g·temb_r before SiLU and uses the shared r_embedder.time_proj (deep-copy of condition_embedder.time_proj per init_embedder) for the final projection — matching WanTwoTimeTextImageEmbedding.forward_timestep in the AnyFlow reference.
  • fastgen/methods/distribution_matching/anyflow.py (+39): adds remap_anyflow_keys() helper. No-op for non-AnyFlow checkpoints, so safe to call unconditionally.

Both files are additive. Existing methods (MeanFlow, DMD2, CMs, …) keep their previous forward bit-identical.

Re-pushed as commit 03ed6cd on top of the original ef13247 ("Add AnyFlow algorithm").

@Enderfga Enderfga force-pushed the feature/anyflow-algorithm branch from 03ed6cd to 99c0415 Compare May 16, 2026 04:45
Replace the single-step student forward in AnyFlow's on-policy stage with
a multi-step Euler-flow rollout that enables gradients at one
randomly-chosen step. This matches AnyFlow's
``WanAnyFlowPipeline.training_rollout`` (the published on-policy training
mode in the reference repo) and gives the DMD generator update a usable
gradient through a full denoising window instead of a single forward.

Changes:

* ``AnyFlowModel._rollout_with_gradient(batch_size, dtype, condition)``:
  start from pure noise at ``ns.max_t``, iterate ``student_sample_steps``
  Euler-flow updates with ``r = t_next`` (mean-velocity, matching the
  reference default), and toggle ``torch.set_grad_enabled`` at the
  randomly-selected step. ``grad_step`` is broadcast from rank 0 in
  distributed runs so all ranks share the same gradient window. The step
  schedule honours ``sample_t_cfg.t_list`` when set, otherwise falls back
  to ``noise_scheduler.get_t_list``.

* ``_onpolicy_student_update_step`` and
  ``_onpolicy_fake_score_discriminator_update_step``: source ``gen_data``
  from the rollout instead of a single ``self.net(input_student, ...)``
  forward. ``input_student`` / ``t_student`` from
  ``_generate_noise_and_time`` become unused for on-policy and are
  discarded explicitly.

* ``_get_outputs``: when on-policy, always take the multi-step generator
  callable path (no longer special-cases ``student_sample_steps == 1`` for
  the validation hook, since the rollout output is always usable).

* ``tests/test_anyflowmodel.py``: bump ``student_sample_steps`` to 2 in
  the on-policy fixtures and add ``test_onpolicy_rollout_propagates_gradient``
  which asserts the rollout output keeps a usable autograd graph and that
  ``backward()`` reaches the student weights.

All 13 unit tests pass (`make pytest tests/test_anyflowmodel.py`).

Signed-off-by: Enderfga <qq2639135175@gmail.com>
@Enderfga
Copy link
Copy Markdown
Author

Follow-up commit ab1174d replaces the on-policy student's single-step forward with a multi-step Euler-flow rollout, matching AnyFlow's WanAnyFlowPipeline.training_rollout (the published on-policy training mode).

New AnyFlowModel._rollout_with_gradient(batch_size, dtype, condition):

  • Starts from pure noise at ns.max_t.
  • Iterates student_sample_steps Euler-flow updates with r = t_next (mean-velocity sampling, matching AnyFlow's use_mean_velocity=True default).
  • Toggles torch.set_grad_enabled so exactly one randomly-chosen step keeps an autograd record; the rest run under no_grad.
  • Broadcasts grad_step from rank 0 in distributed runs so all ranks share the same gradient window (mirrors AnyFlow's broadcast(sample_step, src=0)).
  • Honours sample_t_cfg.t_list when set (so configs can pin the AnyFlow paper's hand-tuned schedule, e.g. [0.999, 0.937, 0.833, 0.624, 0.0] for 4-step Wan); otherwise falls back to noise_scheduler.get_t_list.

The rollout output replaces the single self.net(input_student, ...) forward in both _onpolicy_student_update_step and _onpolicy_fake_score_discriminator_update_step, so the DMD generator update now receives the rollout's gradient through a full denoising window instead of a single forward.

Unit tests bumped to 13. The new test_onpolicy_rollout_propagates_gradient asserts gen_data.requires_grad and that backward() reaches the student weights through the chosen step. The forward-equivalence and training-step numbers reported above are unchanged (the rollout only changes the on-policy student-generation procedure; the central-difference target math and DMD2 distillation machinery are the same).

@Enderfga
Copy link
Copy Markdown
Author

Hi @juliusberner — gentle ping. 🙏 The verification you asked for is in the follow-up comment (forward parity + training-step parity + sample videos on the published 1.3B and 14B checkpoints), and commit ab1174d adds the multi-step Euler-flow rollout to match AnyFlow's training_rollout (now 13/13 unit tests passing, no regression on DMD2/MeanFlow).

Happy to address any further feedback whenever you have a slot — thanks again for the early review!

Enderfga added a commit to Enderfga/FastVideo that referenced this pull request May 19, 2026
Five-stage end-to-end verification, run via single-rank torchrun-less
srun on a single H200:

(1) Build FastVideo WanTransformer3DModel with r_embedder=True,
    r_embedder_fusion=gated, gate=0.25.
(2) Load nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers safetensors and
    translate keys via WanVideoArchConfig.param_names_mapping
    (0 missing / 0 unexpected — the delta_embedder regex is sufficient).
(3) Build AnyFlow's reference loader (FAR_Wan_Transformer3DModel).
(4) Forward parity on identical inputs — bf16 noise.
(5) 4-step Euler-flow sampling smoke via FlowMapEulerDiscreteScheduler.
(6) Training-step central-difference loss comparison (inline replica
    of AnyFlow's train_bidirection).

Measured on Wan2.1-T2V-1.3B + nvidia/AnyFlow checkpoint:
  forward rel mean diff : 2.55%
  forward max abs diff  : 7.81e-2
  training loss diff    : 1.33% (AnyFlow 0.381619 vs FastVideo 0.386694)

Both within bf16 kernel noise. Compare to the FastGen port at
NVlabs/FastGen#25 which reported 2.8% forward + 4.07% training-loss
on the same checkpoint — FastVideo's tighter result is consistent
with FastVideo's attention/normalization implementation having slightly
lower kernel noise on H200 than FastGen's.
@juliusberner
Copy link
Copy Markdown
Collaborator

Hi @Enderfga,

Thanks a lot for all the evaluations and videos, this is in a great shape!

We'll take a closer look soon, but I wanted to ask two questions first:

  1. Do you think we could re-use more functionality from our MeanFlow implementation to not duplicate code?
  2. Did you also try to train for a few hundred iterations (with a small batchsize) to check convergence?

@cxlcl
Copy link
Copy Markdown
Collaborator

cxlcl commented May 22, 2026

@Enderfga Thanks a lot for the PR and its follow-up!
Are the config tuned for Anyflow, or is it only for demo and needs further tuning?

Addresses two pieces of PR NVlabs#25 reviewer feedback:

(1) Code sharing with MeanFlow. The previous commit added AnyFlow's
gated t/r mixing as an inline branch inside
``classify_forward_prepare``, which made it visually hard to tell which
lines were MeanFlow's additive path and which were AnyFlow-specific.
This commit factors both fusion modes into a single
``_fuse_r_embedding`` method bound on the transformer (parallel pattern
to ``classify_forward_prepare`` and friends). Both paths still share
``r_embedder.time_embedder`` / ``time_proj`` / ``act_fn`` modules — the
helper just makes that sharing explicit and shrinks the call site to
three lines. Forward semantics are bit-identical to the previous
commit for both additive (MeanFlow) and gated (AnyFlow) modes across
all three ``encoder_depth`` cases.

(2) Ship a paper-aligned on-policy stage config. Previously the only
documented way to run Stage 3 was an inline tweak in the pretrain
config docstring. New file
``fastgen/configs/experiments/WanT2V/config_anyflow_onpolicy.py``
inherits the pretrain config and flips the loss into "onpolicy" with
the paper's Stage 3 hyperparameters (lr=2e-6, 1200 iter, GAN on at the
DMD2-default 0.03, ``student_update_freq=5``). The docstring notes
that the AnyFlow paper's rank-256 LoRA variant is not reproduced here
because FastGen does not ship a PEFT/LoRA training path; this config
is a full-rank fine-tune of a Stage 2 pretrain checkpoint.

The AnyFlow method README is updated to (a) document the new
``r_embedder_fusion="gated"`` requirement when loading the released
AnyFlow HF checkpoints, (b) replace the stale "multi-step rollout
deferred to a follow-up" note (already landed in ab1174d) with an
explicit acknowledgement that end-to-end convergence-scale validation
on the paper's training corpus is deferred to a follow-up, and (c)
cross-reference both pretrain and on-policy configs.

Tests: all 13 AnyFlow + 3 MeanFlow unit tests pass.
Signed-off-by: Enderfga <qq2639135175@gmail.com>
@Enderfga
Copy link
Copy Markdown
Author

Thanks @juliusberner and @cxlcl — pushed commit 1671bb2 that addresses (1) and adds the on-policy config; (2) is scoped explicitly below.

(1) MeanFlow code sharing. Extracted _fuse_r_embedding on the Wan transformer (fastgen/networks/Wan/network.py) so the additive (MeanFlow) and gated (AnyFlow) fusion modes sit side-by-side in one helper instead of being an inline branch inside classify_forward_prepare. Both paths share the same r_embedder.time_embedder / time_proj / act_fn modules — the refactor makes that explicit and shrinks the call site to 3 lines. Forward semantics are bit-identical to the previous commit across all three encoder_depth cases; all 13 AnyFlow + 3 MeanFlow unit tests pass.

(2) Convergence-scale validation. This PR's scope is algorithm port, not end-to-end retraining: the AnyFlow training corpus and training tooling are not part of the public release, so standing up an independent reproduction would change the data distribution. Correctness evidence is therefore algorithmic, not convergence-based:

  • forward parity within bf16 noise on the released 1.3B / 14B HF ckpts: 2.8% rel mean diff, 8.7e-2 max abs diff
  • single-step training parity vs AnyFlow's train_bidirection central-difference math at 4.07% (1.3B) / 3.00% (14B) rel loss diff on real weights, same seed and mini-batch through both code paths
  • stub-network central-difference target match to 1.9e-5 max abs in fp32 — the math is reproduced bit-for-bit; the bf16 numbers above are model-noise floor, not algorithm drift

The README now states this scope explicitly. Convergence-scale validation on the paper's training corpus is left as a follow-up. Please advise whether that's acceptable for merge or whether you'd prefer to block on end-to-end numbers.

@cxlcl — re: config tuning. fastgen/configs/experiments/WanT2V/config_anyflow.py is the paper's Stage 2 pretrain config 1-for-1 (shift=5, weight_type=beta08, ε=5e-3, lr=5e-5, 6k iter, batch_size_global=32, the 4-step Wan t_list, GAN off in pretrain). config_anyflow_onpolicy.py is added in this commit for Stage 3 (lr=2e-6, 1200 iter, GAN on at the DMD2-default 0.03). Caveat: the paper's Stage 3 uses a rank-256 LoRA adapter, but FastGen does not ship a PEFT/LoRA training path today, so the on-policy config does a full-rank fine-tune on top of a Stage 2 checkpoint — noted in the config docstring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants