AI videos with ComfyUI and HunyuanVideo model

Sometimes it’s really unbelievable how fast that AI stuff progresses. A few weeks ago a new AI video model called HunyuanVideo was released. The Internetz was pretty amazed about that model. Then you read the minimum requirements: The minimum GPU memory required is 60GB. Yeah, sure, everybody has such a card at home 😄 So I forgot about that one. Just a few weeks later end of 2024 ComfyUI announced:

With the release of ComfyUI v0.3.10, we are excited to share that it is now possible to run the HunyuanVideo model on GPUs with only 8GB VRAM.

Wow! Well, I don’t know about 8GByte of VRAM but I can confirm it works with my NVIDIA 4080 SUPER with 16 GByte of VRAM 🎉 This ComfyUI community is really amazing 💪 So lets get started!

I assume you’ve ComfyUI already installed and configured. If not please read my blog post about ComfyUI - private and locally hosted AI image generation.

Also make sure that you have the latest ComfyUI version installed. If you’ve ComfyUI-Manager installed as suggested in one of the previous blog posts it’s just a click on the Manager button on the top right, Update ComfyUI and then Restart.

To make HunyuanVideo work a few files need to be downloaded.

Download the workflow and store the JSON file somewhere temporary.

comfyui-video-hunyuan-workflow-01

Select Workflow (1) and Open (2). Open the workflow file hunyuan_video_text_to_video.json.

comfyui-video-hunyuan-workflow-02

Change

(1) tile_size to 128
(2) overlap to 32
(3) temporal_size to 32
(4) temporal_overlap to 4

Open the hunyuan_video_t2v_720p_bf16.safetensors model webpage and download the file hunyuan_video_t2v_720p_bf16.safetensors. That’s around around 26 GByte. Store the file in $HOME/comfyui/ComfyUI/models/diffusion_models.

Open the clip_l.safetensors webpage and download clip_l.safetensors file. That’s around 246 MByte. Store the file in $HOME/comfyui/ComfyUI/models/clip.

Open the llava_llama3_fp8_scaled.safetensors webpage and download llava_llama3_fp8_scaled.safetensors file. That’s around 9 GByte. Store the file in $HOME/comfyui/ComfyUI/models/text_encoders.

Open the hunyuan_video_vae_bf16.safetensors webpage and download hunyuan_video_vae_bf16.safetensors file. That’s around 500 MByte. Store the file in $HOME/comfyui/ComfyUI/models/vae.

Now use the ComfyUI-Manager to restart ComfyUI. Afterwards reload ComfyUI webpage in the browser. Next a few settings need to be checked:

comfyui-video-hunyuan-workflow-03

(1) unet_name needs to be set to hunyuan_video_t2v_720p_bf16.safetensors
(2) weight_dtype needs to be set to fp8_e4m3fn_fast
(3) clip_name1 needs to be clip_l.safetensors
(4) clip_name2 needs to be llava_llama3_fp8_scaled.safetensors
(5) vae_name needs to be hunyuan_video_vae_bf16.safetensors

For now just keep the rest of the rest parameters and also the video prompt in CLIP Text Encode (Positive Prompt). There are three more parameters you can play around later in EmptyHunyuanLatentVideo node:

(6) width is the width of the video
(7) height is the height of the video
(8) length is the length of the video in frames. So with fps (frames per second) set to 24 in SaveAnimatedWEBP node the resulting video is around 3 seconds. The greater this value is the more VRAM is needed. With my 16 GByte VRAM I was also able to set this value to 145 frames. I didn’t try other values so far. The interesting thing about this parameter is if you set it to 1 the model/workflow is also able to create images 😃 So it’s perfectly possible to create still images as with FLUX.1 model with this video model too!

So with everything setup lets press Queue to generate the video. If you used the same settings it takes around 3-5 minutes to generate the video on my 4080 SUPER. If you’ve a different graphics card the time varies of course. And for me the result looked like this:

comfyui-video-hunyuan-result-03

Next I tried a different prompt (keeping everything else as is). But I forgot the prompt 😄 But it was something like this:

A fly-by of a spacecraft at planet Saturn.

And it looked like this:

comfyui-video-hunyuan-result-01

That’s pretty impressive…

Again with the same settings just a different prompt:

A fly-by of spaceship Galactica at planet Saturn. The planet should be kept centered and should be fully visible.

comfyui-video-hunyuan-result-04

Looks like the model haven’t watched Battlestar Galactica 😄

This is a fun one 😉 In this case I used 145 frames (length parameter):

Inside a space station, the camera tracks a duck walking through a room filled with computer monitors and a big screen showing a planet outside the space station. The monitors showing some alerts.

comfyui-video-hunyuan-result-05

And there’re indeed a lot of monitors 😄 But I’ve the impression the more frames you create the better the video gets esp. for longer prompts.

And my usual test… 😉 In my previous blog posts I asked the Llama 3.2-vision:11b model to describe what’s on a book cover I provided. The answer was like this:

The picture features a number of blue spheres floating above water or another reflective surface, with some spheres casting shadows onto the surface below.

The largest sphere is centered and sits atop what appears to be a dark reflection that extends outward from its base, resembling a shadow. It reflects light in various directions, creating different shades of blue across its surface.

Surrounding this are four smaller spheres, arranged in a square formation with one in each corner. Each one casts a shadow onto the reflective surface below and also reflects light in various directions.

The image is framed by an off-white border that adds some contrast to the rest of the picture. The background color is a medium blue.

I couldn’t resist to ask the Hunyuan video model to create a video out of that description 😄 So after DALL-E, Juggernaut XL and FLUX.1-dev lets see if that description not only works for images but also for a video. To describe what motion I want to have for the video I only added the following text at the end of the description above:

The smaller spheres should turn around the largest, centered sphere.

And that’s the result:

comfyui-video-hunyuan-result-02

Well, that’s somewhat amazing 😎 And that’s also it for today. I guess someone with more knowledge can create really impressive videos. I can also imagine if one creates a few smaller clips and manages it to get some consistent style throughout the videos you can create some nice little videos.

Happy video creation!