Skip to content

Official Release: AMD Strix Point Native Optimization and Director St…#796

Open
Cyb3rLab5 wants to merge 13 commits into
lllyasviel:mainfrom
Cyb3rLab5:main
Open

Official Release: AMD Strix Point Native Optimization and Director St…#796
Cyb3rLab5 wants to merge 13 commits into
lllyasviel:mainfrom
Cyb3rLab5:main

Conversation

@Cyb3rLab5

Copy link
Copy Markdown

…udio

Bobby Jackson and others added 13 commits December 18, 2025 01:07
Co-authored-by: Cyb3rLab5 <224908985+Cyb3rLab5@users.noreply.github.com>
Cached the weight and bias tensors in `vae_decode_fake` based on device and dtype. This prevents re-allocation and conversion on every call, significantly reducing overhead in hot loops.

Co-authored-by: Cyb3rLab5 <224908985+Cyb3rLab5@users.noreply.github.com>
Refactored `HunyuanVideoRotaryPosEmbed` to calculate Rotary Positional Embeddings (RoPE) iteratively over 1D components (T, H, W). PyTorch's zero-copy `.expand()` and views are then used to map these vectors into 3D. This eliminates massive redundant 3D grid creations (via `torch.meshgrid`) and sequential looping over batches.

Performance benchmark shows ~4.3x forward pass speedup for batch coordinate generations.

Co-authored-by: Cyb3rLab5 <224908985+Cyb3rLab5@users.noreply.github.com>
…6079084948

⚡ Bolt: [performance improvement] 1D RoPE Expand
…858351843

⚡ Bolt: [performance improvement] Cache tensors in vae_decode_fake
Batch the torch.cuda.memory_stats queries to check only every 25 modules, significantly reducing CPU-GPU synchronization stalls.

Co-authored-by: Cyb3rLab5 <224908985+Cyb3rLab5@users.noreply.github.com>
…head-55594135257129418

⚡ Bolt: Optimize VRAM polling overhead
…27942520

⚡ Bolt: [performance improvement]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant