The page needs a brush-up. Bugs referenced have been fixed or worked around. Still might be useful info.
Patch Triton VAE from KJNodes - less VRAM and faster decode on 4090.Wan Chunk FeedForwardWanVideo Mem Eff Sage AttentionQ: bernini sage+chunk ff+mem eff sage is a bit slower than just sage. is it expected?
A: only if it fits memory without them
kornia==0.7.3 works
sageattention never actually utilized it’s quantization to reduce VRAM, they just focused on speed so what I’ve done is fill that gap
also without the ffn chunking it wouldn’t matter on sage, because ffn would be the peak
note that you still need to enable sage normally even with this nodez
GPU? 5090 Then use mxfp8
3090?
int8 (rowwise or conv rot) would be the best one to use, dunno if there’s one yet though; mxfp8 is slower than the simple fp8 scaled but has better quality on that card; fp16 with fp16 fast accumulation might just be fine too if rest of your system can handle the offloading
for portable make sure to read this part GH:triton-lang/triton-windows#2-python-environment the portable is the only catch on this being easy install
woct0rdho is the reason we even have proper triton supports on windows, trust his stuff above all else imo
woct0rdho’s triton and sage work without cuda-toolkit or msvc
maybe some exceptions like cpu compile still needs libraries, but that’s rare case
Torch/CUDA:
I’m on 2.12+cu132 and it works
and you don’t need cuda-toolkit with those, everything needed is bundled
Dont use --fast ComfyUI startup param - locks up VRAM
comfy-aimdo-0.4.4 available.
comfy-kitchen needed to be updated to 0.2.9 fixes slow LTX exeuction with LoRAs when running quantized models
rattus made a proper stochastic rounding kernel so the lora application is faster and cheaper
fp8 considered quant?
yes
djbfilmz re RTX 6000 Pro:
when updated my pytorch (???can’t remember) and cuda (13.0) everything got slightly faster … also though… some old nodes that were outdated did break, so you have to be careful
been working on sage optimizations.. not sure how specific this is to my setup (sm89 / rtx 4090, ltx video w/ a frozen audio mask input), but sharing … was using the woct0rdho fork of sage (even tho on linux).. it has some updates over the thu-ml original.
found two things:
packaging omission, woct0rdho fork only: sm80 build gate is 8.0, 8.6, 8.7 — no 8.9. so on rtx 40xx / ada, building from source silently skips
_qattn_sm80andsageattn_qk_int8_pv_fp16_cuda(the fp16 fallback) is just missing. one liner fix insetup.pyand only matters if you build from source on an ada-only [RTX 40xx] box and explicitly pick the fp16 kernel. auto-dispatch picks fp8++ on sm89 anyway
cuda mask path silently drops masks - the original sage repo thu-ml lineage, every fork, inherent to all sageattention (2.x AND 3.x). the cuda kernels only understand
{none, causal}masks.. so if you pass anattn_masktensor tosageattn_qk_int8_pv_fp16_cuda/sageattn_qk_int8_pv_fp8_cuda/sageattn3_blackwell directly, it gets blackholed into kwargs and never applied. math still runs, just against unmasked scores. output is wrong in proportion to how much of kv you masked. I measured rtol [Relative Tolerance]0.26to0.94on ltx cross-attn shapes,NaNoutright at very short kv. the triton kernelsageattn_qk_int8_pv_fp16_tritonhas proper mask plumbing and stays at rtol~0.04.
so what actually matters in practice, at least in what i’ve been working on / researching normal video self-attn, no explicit mask - fine. sage still
~2.7xfaster thantorch flashon ltx shapes (tested); any workflow with an explicitattn_mask- text prompt padding, audio token masks, controlnets, long-text encoders.. from what i can tell, the cuda path silently returns bad outputs if a node talks to it directly. i havent tested all of these but it would make sense the are impacted.
KJNodes addresses this issue at the node level. KJNodes ships two sage nodes and both solve what they’re designed to do:
PathchSageAttentionKJ(the general dropdown one): onautoit calls sage’s top-level dispatcher, which internally routes masked calls totriton, so you get the safe path by default. you only expose the underlying sage bug if you manually override tosageattn_qk_int8_pv_fp16_cuda/sageattn_qk_int8_pv_fp8_cuda/sageattn_qk_int8_pv_fp8_cuda++/sageattn3/sageattn3_per_block_mean… those bypass the dispatcher and talk to the broken cuda wrappers directlyLTX2MemoryEfficientSageAttentionPatch(the ltx-specific one) only patchesself-attn(attn1), which doesn’t carry a mask in ltx. tuned for the fp8++ path on sm89 (the fast one). scope alone means the mask bug can’t hit it
GH:fblissjr/SageAttention-ada fork: one-liner fix [for setup.py on RTX 40xx] + an ltx-shape regression test + a standalone repro [reproducer] for the mask bug
GH:blissjr/SageAttention-ada:spike/nanobind-dlpack-boundary
trying out nanobind and dlpack per the comfy PR kijai shared … this is an rtx 4090/sm89/ada sage fork that ive expanded beyond sage now … gave up on triton and went cuda path but triton is needed for certain use cases… so thinking of just targeting their nightly build
As of now Torch 2.11.0 remains too new. Workarounds in Comfy/Wrapper code are not aware of it yet. oom on decode has been reported with 3x memory consumption.
Which is a good Numpy version? I updated your wrapper and it installed 2.2.6 which broke multiple other nodes; Not sure about that, some older stuff just won’t work on any numpy 2 version; seedvr2 seems to be impacted by this, guess i’ll have to manually downgrade
Sage 3 really doesn’t seem all that great. for now, Sage 2.1/2.2 are still the mainstream options.
Sage3 quality loss on 2.1 was way too high to be useful; fp8 fast works far better with 2.2 and is even usable, while it never was for 2.1
B200 and onward is so good is because they have a giant L2 cache so cache issues are lowered
would you recommend migrating to 2.9.1 ? yes
Comfy comes with pytorch 2.9.1 … now
Unfortunately it looks like the fix bringing fp8e4 support to RTX30xx on Windows has not been implemented on Triton mainline; issue.
Sam Hodge suggested a script to install Sage Attention on Ubuntu 24.04 with an RTX 5090 in the following manner
ENV TORCH_CUDA_ARCH_LIST='8.0,8.0+PTX,8.6,8.6+PTX,8.9,8.9+PTX'
RUN git clone https://github.com/thu-ml/SageAttention.git && \
cd SageAttention && \
git reset --hard eb615cf6cf4d221338033340ee2de1c37fbdba4a && \
# sed -i "/compute_capabilities = set()/a compute_capabilities = {\"$TORCH_CUDA_ARCH_LIST\"}" setup.py && \
EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" \
MAX_JOBS=32 uv pip install -e . --no-build-isolation --break-system-packages
Python 3.12 is probably a good idea
Possible startup arguments: comfy –here launch – –reserve-vram 5 –max-upload-size 500 –use-sage-attention –disable-pinned-memory
Note: –async-offload can cause OOM-s
Around 16-17 Oct 2025 issues were reported with latest version of Comfy and other packages. Workflows started consuming more VRAM than previously. Among workarounds suggested were
One advice was to use fp32 version of Wan 2.1 VAE safetensors file; possibly a command line option might be needed as well.
Workaround has been commited to ComfyUI.
Kijai 17 Oct 2025 evening:
torch 2.9.0 has a bug that makes some conv3d operations (when using half precision) use 3x more VRAM, including the Wan VAE; it affects 2.10 too currently; both native and wrapper has workaround for the bug already anyway; they are different workarounds
Another source of higher VRAM usage was traced to triton compilation. It seems one particular reason was tritop upon seeing too many errors gave up on compiling. The other was that triton was recompiling too often. Suggestions were
force_parameter_static_shapes to false in TorchCompileModelWanVideoV2comfy/model_patcher.py file adding @torch.compiler.disable() one line above class LowVramPatch:run_every_op() from ops.py - this will undo “fast cancellation” changeKijai 18 Oct 2025:
The workarounds for the cancellation call and the torch compile disable on the problematic bit of the code are merged to comfyUI already btw
pytorch 2.8.0 was problematic so sticking with 2.7.0 was fine
2.9.0 has one problematic bit that needed workarounds for Wan VAE,
so that needs latest ComfyUI version to work
I’m on 2.10.0 dev and seems to work too
Triton 3.5 is what should have the e5 compile fix;
For Windows it’s triton-windows 3.5.0.post21 or later.
pytorch 2.8.0 works fine in Linux
and then if you happen to decide to update nvidia drivers on Linux you lose a week of your life
Q: ubuntu has nvidia drivers, doesnt it?
A: I mean latest drivers; the prebuilt stuff is fine, but usually very outdated
from distros with nvidia drivers built in I liked PopOs myself, but then I wanted some newer stuff and went with Debian-testing
SageAttenion 2.2.0 latest highly recommended, gets rid of all graph breaks so torch compile works better
https://github.com/woct0rdho/triton-windows - read instructions on the page, not simply pip install sageattention -U
https://github.com/woct0rdho/SageAttention
Youtube tutorial on installing SageAttention 2.2 on Windows - unconfirmed, advice from community: 9APXcBMpbgU.
How does Sage Attention compare against Flash Attention and Sdpa Attention? sage »»»»»»»> flash > sdpa, slight exaggeration
they all degrade quality? only Sage
“degrade quality” in this context means everything you do to differ from the reference code, like reducing steps etc. the quality loss from sage is so small in most cases that you can more than offset it from the speed gain
Flash Attention is something you have to install if you want to use the actual flash attention, the sdpa flash is different thing