Fix: per-frame timestep allocation for video2world (image2world) mode in CosmosPredict2#26
Open
csy2077 wants to merge 1 commit into
Open
Conversation
In video2world (image2world) mode, conditioning frames were receiving the same noisy timestep as all other frames. This caused the transformer to treat the clean conditioning frame as a fully-noised input, breaking temporal coherence and causing severe quality degradation. Fix: after replacing conditioning frames in model_input, expand t to shape (B, T) and zero out timesteps for frames where condition_mask=1, signaling to the model that those frames are already clean (t=0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In
CosmosPredict2.forward(), when running in video2world (image2world) mode, the conditioning (clean) frames were not assigned a special timestep of0.0. Instead, the model received the same noisy timestep for all frames — including the clean conditioning frame(s). This caused the model to treat the clean first frame as a fully-noised frame, leading to:Fix
After replacing the conditioning frames in
model_input, expandtto per-frame shape(B, T)and zero out the timestep for conditioning frames (indicated bycondition_mask). This tells the transformer that those frames are already clean and require no denoising:The transformer already accepts
timesteps_B_Tof shape(B, T), so no other changes are needed.Impact
Without this fix,
CosmosPredict2video2world distillation produces incoherent videos that ignore the conditioning frame. With this fix, the model correctly preserves the conditioning frame and generates temporally consistent video.