Kijai’s wip wf.
that’s how the model works for I2V, the first latent is just noise and it has to be replaced by the actual input image before decoding
Experiments with running the model in ComfyUI on consumer hardware are in their initial stages. There currently exist two work-in-progress implementations:
Experiments suggest it is possible to generate longer than 10 sec clips, say 15 sec without looping or obvious quality problems.
some parts of the model need to be kept in fp32, the norms and embeddings etc. This is the mixed i2v model: https://huggingface.co/maybleMyers/kan/blob/main/diffusion_pytorch_model_i2v_pro_fp32_and_bf16.safetensors
kijai/ComfyUI-KJNodes contains NABLA Attention KJ node: “only useful if you go 10s or high res”; “Docs mention NABLA dimensions must be divisible by 128”
“Flex attention” is mentioned as an alternative (?) to Nabla.