Ovi is an open-source model capable of sound + video generation. Video quality at 5B level. 24fps.
Weights for v1.1 capable of 10 sec generation have been uploaded to Hugging Face.
New weights can be used in Kijai’s wrapper.
Kijai has quantised to fp8_e4m3fn.
Set length to 314 frames in WanVideo Empty MMAuidio Latents. We end up with 241 video frames.
Note: 10 sec v1.1 model can generate 10 sec clips only. 5 sec clips can be generated via a separate .safetensors v1.1 ovi file.
Q: SLG 11? It’s best?
There is a sample workflow in Kijai’s repository. Wiring:
| Pre Embeds Node | Pre Embeds Inputs -> Output | Embeds Node | Input from Pre / Embeds Inputs -> Output | Model | WanVideo Sampler Input |
|---|---|---|---|---|---|
| - | - | - | Kijai: “5B is different as it puts the image in the noise latent”; Ovi based on WAN 5B, so.. | - | - |
| WanVideo Encode | image, mask -> samples |
WanVideo Empty Embeds | extra_latents / width, height, number_of_frames -> image_embeds |
Ovi | image_embeds |
| … | … | WanVideo Add Extra Latent | embeds, latent_index = 0 “can be end image too technically” |
Ovi | - |
| - | - | WanVideo Empty MMAudio Latents | length -> samples | Ovi | samples |
Extra nodes: Ovi MMAudio Loader, WanVideo Decode Ovi Audio, WanVideo Ovi CFG - adds negative prompt for audio.
WanVideo EasyCache.
WanVideo Controlnet Apply - fed by WanVideo Controlnet Loader and with control_images (video) - can be chained after WanVideo Model Loader.
Kijai: “I don’t see much difference between full 50 steps and when using EasyCache, it’s pretty safe”.
Kijai: “it’s basically around 11B”.
Put MMAudio VAE model weights into either vae or mmaudio folder.
Fastwan 5B lora seems a better than 5B turbo.
Kijai: WAN 2.2 VAE is only for 5B model; Ovi is based on Wan 2.2 TI2V 5B so it fits.
Ovi means “door” in Finnish.
Apparently Ovi can do anime style quite well.
Sample prompt:
The man says: <S>Hello, how do you do?<E>
The woman replies: <S>Hello, how do you do?<E>
Kijai’s baseline testing setup: 50 steps with easycache skipping 14 steps, cfg 4.0.
adding zero star, seems to change audio slightly (whatever that means)
Possible workflow: put Ovi render into the InfiniteTalk workflow with the audio from the Ovi render and a low denoise in 3 steps for an upscale.
720 might be a good resolution for Ovi.
Ovi has been successfully used with depth control controlnet trained for Wan 2.2 TI2V 5B. It is likely HED and Canny controlnets will also work.

Initially required 24Gb VRAM to run in 8fp and 32Gb VRAM to run in 16fp.
Githubs: