Skip to content

Fix: per-frame timestep allocation for video2world (image2world) mode in CosmosPredict2#26

Open
csy2077 wants to merge 1 commit into
NVlabs:mainfrom
csy2077:fix/cosmos-predict2-video2world-per-frame-timestep
Open

Fix: per-frame timestep allocation for video2world (image2world) mode in CosmosPredict2#26
csy2077 wants to merge 1 commit into
NVlabs:mainfrom
csy2077:fix/cosmos-predict2-video2world-per-frame-timestep

Conversation

@csy2077
Copy link
Copy Markdown

@csy2077 csy2077 commented May 16, 2026

Problem

In CosmosPredict2.forward(), when running in video2world (image2world) mode, the conditioning (clean) frames were not assigned a special timestep of 0.0. Instead, the model received the same noisy timestep for all frames — including the clean conditioning frame(s). This caused the model to treat the clean first frame as a fully-noised frame, leading to:

  • Dramatic quality degradation in generated videos
  • No temporal coherence to the conditioning frame
  • Effectively broken video2world / image2world distillation

Fix

After replacing the conditioning frames in model_input, expand t to per-frame shape (B, T) and zero out the timestep for conditioning frames (indicated by condition_mask). This tells the transformer that those frames are already clean and require no denoising:

t_expanded = t.unsqueeze(1).expand(B, T)
mask_B_T = condition_mask[:, 0, :, 0, 0]  # (B, T)
t = t_expanded * (1 - mask_B_T)

The transformer already accepts timesteps_B_T of shape (B, T), so no other changes are needed.

Impact

Without this fix, CosmosPredict2 video2world distillation produces incoherent videos that ignore the conditioning frame. With this fix, the model correctly preserves the conditioning frame and generates temporally consistent video.

In video2world (image2world) mode, conditioning frames were receiving the
same noisy timestep as all other frames. This caused the transformer to
treat the clean conditioning frame as a fully-noised input, breaking
temporal coherence and causing severe quality degradation.

Fix: after replacing conditioning frames in model_input, expand t to
shape (B, T) and zero out timesteps for frames where condition_mask=1,
signaling to the model that those frames are already clean (t=0).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant