Animatediff-SparseCtrl-RGB Open-Source Model - Effortlessly Generate Coherent Videos from Text with Stable Diffusion

Animatediff Sparsectrl Rgb

Developed by guoyww

AnimateDiff is a method that utilizes existing Stable Diffusion text-to-image models to create videos by inserting motion module layers to achieve coherent motion between frames.

Text-to-Video #Text-to-Video Generation #Motion Module Control #Sparse ControlNet

Downloads 166

Release Time : 7/18/2024

Model Overview

This model inserts motion module layers into a frozen text-to-image model and trains on video clips to extract motion priors, enabling the generation of coherent videos from text.

Model Features

Motion Module Insertion

Inserts motion modules after ResNet and attention blocks in existing Stable Diffusion models to achieve coherent inter-frame motion.

Sparse ControlNet Support

Supports SparseControlNet for controllable video generation, allowing sparse conditional control over generated content.

Compatibility with Existing Models

Works with existing Stable Diffusion text-to-image models without requiring retraining from scratch.

Model Capabilities

Text-to-Video Generation

Controllable Video Generation

Image Animation

Use Cases

Creative Content Generation

Character Animation

Generates coherent character animations based on text descriptions.

Produces character animation sequences with natural motion.

Scene Animation

Transforms static scene descriptions into dynamic videos.

Generates scene videos with dynamic elements.

Advertising & Marketing

Product Showcase

Generates animated product showcases.

Creates engaging dynamic product presentations.

🚀 Diffusers

Diffusers is a library that provides tools for diffusion models. AnimateDiff in this library enables the creation of videos using pre - existing Stable Diffusion Text - to - Image models. It offers a unique approach to generate dynamic video content from static text - to - image models.

✨ Features

Video Generation from Text - to - Image Models: AnimateDiff allows the creation of videos by inserting motion module layers into a frozen text - to - image model. It trains on video clips to extract a motion prior, enabling coherent motion across image frames.
SparseControlNetModel: An implementation of ControlNet for AnimateDiff, which supports controlled generation in text - to - video diffusion models.
MotionAdapter and UNetMotionModel: These concepts are introduced to conveniently use motion modules with existing Stable Diffusion models.

🚀 Quick Start

The following example shows how to use the motion modules and sparse controlnet with an existing Stable Diffusion text - to - image model:

Basic Usage

import torch

from diffusers import AnimateDiffSparseControlNetPipeline
from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel
from diffusers.schedulers import DPMSolverMultistepScheduler
from diffusers.utils import export_to_gif, load_image


model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3"
controlnet_id = "guoyww/animatediff-sparsectrl-rgb"
lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3"
vae_id = "stabilityai/sd-vae-ft-mse"
device = "cuda"

motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device)
controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device)
vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device)
scheduler = DPMSolverMultistepScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    beta_schedule="linear",
    algorithm_type="dpmsolver++",
    use_karras_sigmas=True,
)
pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
    model_id,
    motion_adapter=motion_adapter,
    controlnet=controlnet,
    vae=vae,
    scheduler=scheduler,
    torch_dtype=torch.float16,
).to(device)
pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora")

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-firework.png")

video = pipe(
    prompt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background",
    negative_prompt="low quality, worst quality",
    num_inference_steps=25,
    conditioning_frames=image,
    controlnet_frame_indices=[0],
    controlnet_conditioning_scale=1.0,
    generator=torch.Generator().manual_seed(42),
).frames[0]
export_to_gif(video, "output.gif")

📚 Documentation

AnimateDiff achieves video creation by inserting motion module layers into a frozen text - to - image model. These motion modules are placed after the ResNet and Attention blocks in the Stable Diffusion UNet. Their main function is to introduce coherent motion across image frames.

SparseControlNetModel is an implementation of ControlNet for AnimateDiff. ControlNet was first introduced in Adding Conditional Control to Text - to - Image Diffusion Models by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. The SparseCtrl version of ControlNet was introduced in SparseCtrl: Adding Sparse Controls to Text - to - Video Diffusion Models for achieving controlled generation in text - to - video diffusion models.

The following table shows a comparison of the input and output in the example:

Property	Details
Input Image
Output GIF

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご