🚀 Diffusers
Diffusers is a library that provides tools for diffusion models. AnimateDiff in this library enables the creation of videos using pre - existing Stable Diffusion Text - to - Image models. It offers a unique approach to generate dynamic video content from static text - to - image models.
✨ Features
- Video Generation from Text - to - Image Models: AnimateDiff allows the creation of videos by inserting motion module layers into a frozen text - to - image model. It trains on video clips to extract a motion prior, enabling coherent motion across image frames.
- SparseControlNetModel: An implementation of ControlNet for AnimateDiff, which supports controlled generation in text - to - video diffusion models.
- MotionAdapter and UNetMotionModel: These concepts are introduced to conveniently use motion modules with existing Stable Diffusion models.
🚀 Quick Start
The following example shows how to use the motion modules and sparse controlnet with an existing Stable Diffusion text - to - image model:
Basic Usage
import torch
from diffusers import AnimateDiffSparseControlNetPipeline
from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel
from diffusers.schedulers import DPMSolverMultistepScheduler
from diffusers.utils import export_to_gif, load_image
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3"
controlnet_id = "guoyww/animatediff-sparsectrl-rgb"
lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3"
vae_id = "stabilityai/sd-vae-ft-mse"
device = "cuda"
motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device)
controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device)
vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device)
scheduler = DPMSolverMultistepScheduler.from_pretrained(
model_id,
subfolder="scheduler",
beta_schedule="linear",
algorithm_type="dpmsolver++",
use_karras_sigmas=True,
)
pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
model_id,
motion_adapter=motion_adapter,
controlnet=controlnet,
vae=vae,
scheduler=scheduler,
torch_dtype=torch.float16,
).to(device)
pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora")
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-firework.png")
video = pipe(
prompt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background",
negative_prompt="low quality, worst quality",
num_inference_steps=25,
conditioning_frames=image,
controlnet_frame_indices=[0],
controlnet_conditioning_scale=1.0,
generator=torch.Generator().manual_seed(42),
).frames[0]
export_to_gif(video, "output.gif")
📚 Documentation
AnimateDiff achieves video creation by inserting motion module layers into a frozen text - to - image model. These motion modules are placed after the ResNet and Attention blocks in the Stable Diffusion UNet. Their main function is to introduce coherent motion across image frames.
SparseControlNetModel is an implementation of ControlNet for AnimateDiff. ControlNet was first introduced in Adding Conditional Control to Text - to - Image Diffusion Models by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. The SparseCtrl version of ControlNet was introduced in SparseCtrl: Adding Sparse Controls to Text - to - Video Diffusion Models for achieving controlled generation in text - to - video diffusion models.
The following table shows a comparison of the input and output in the example:
Property |
Details |
Input Image |
 |
Output GIF |
 |