Nodes to run Hunyuan Image 3 locally with BF16 and NF4 quantized options in Comfyui
-
Updated
Apr 30, 2026 - Python
Nodes to run Hunyuan Image 3 locally with BF16 and NF4 quantized options in Comfyui
Stateless LLM runtime that dynamically routes, loads, executes, and unloads models per request with bounded VRAM caching and intelligent model selection.
ComfyUI custom node that controls the order of node execution with linear routing of any data type through infinite I/O slots + option to free VRAM & RAM at any point in a workflow with device-agnostic memory management utilities managed by ComfyUI that safely unload all models, while preserving all connected data & models through to the next node.
KeSSie HUGE Context Semantic recall for Large Language Models
One-click installer for ACE-Step 1.5
Ultra-Low Bit KV-Cache Compression optimization layer built on top of llama.cpp for LLM inference. Reduces VRAM overhead by ~75-80% using custom CUDA kernels.
Predictive VRAM Virtualization Engine
NMOS (Neural Memory OS) is a predictive partial execution engine enabling 70B-level reasoning on 4GB VRAM. It uses the “Zero-Lag” hypothesis, leveraging typing latency as a compute window to mask memory limits via async layer prefetching and speculative decoding.
INT8 Sparse Tensor Core GEMM for PyTorch — built for Windows
LEMA (Layer-wise Efficient Memory Abstraction): A hardware-aware framework for fine-tuning LLMs in VRAM-constrained environments using asynchronous binary pre-fetching and triple-tier memory orchestration.
PyTorch/Hugging Face batching utility that sorts variable-length text by difficulty, then dynamically increases batch size on easier samples using a pre-trained VRAM predictor to improve GPU utilization and throughput while reducing OOM risk with fallback handling.
A Proof of Concept for the LEMA (Layer-wise Efficient Memory Abstraction) framework. Enables stable fine-tuning of Llama-2-7B on consumer-grade hardware (16GB VRAM) through layer-wise weight streaming and triple-buffer memory virtualization.
Turn your PC into a private, autonomous AI lab, without melting your GPU.
Know before you train — VRAM estimation for LLM fine-tuning.
Accelerate INT8 sparse inference in PyTorch on Windows with minimal setup. Achieve high performance using Sparse Tensor Cores without Linux dependencies.
Compress the cache. Keep the quality
ComfyUI Custom Node for running Transformer LLMs with zero dependency conflicts. Provides device-agnostic VRAM/RAM cleanup options post-run & optional dynamic LLM prompt formatting with variable inputs of any type.
Tiered GPU memory architecture for consumer AI inference. VRAM as execution cache, system RAM as passive staging layer.
Sticky-block topology lottery scheduler for transformer fine-tuning. Less VRAM, less wall-clock, bigger models.
Predict large language model inference with memory prefetching and speculative decoding for faster reasoning on low VRAM hardware
Add a description, image, and links to the vram-optimization topic page so that developers can more easily learn about it.
To associate your repository with the vram-optimization topic, visit your repo's landing page and select "manage topics."