VidToMe Open-Source Video Editing Solution - Zero-Sample Operation, Improve Coherence and Save Memory!

Vidtome

Developed by jadechoghari

A zero-shot video editing solution based on diffusion models, improving temporal coherence and reducing memory consumption by merging self-attention tokens across video frames.

Text-to-Video Open Source License:MIT #Zero-shot Video Editing #Cross-frame Token Merging #Self-attention Optimization

Downloads 15

Release Time : 10/7/2024

Model Overview

VidToMe is a video editing technique that requires no model fine-tuning. It achieves harmonious video generation and editing through cross-frame alignment and redundant token compression, ensuring smooth transitions and coherent output.

Model Features

Zero-shot Editing

Directly edit video content via natural language prompts without model fine-tuning.

Cross-frame Token Merging

Significantly enhances temporal coherence by merging self-attention tokens across video frames.

Memory Optimization

Reduces memory consumption by compressing redundant tokens, suitable for processing long videos and complex scenes.

Model Capabilities

Video Style Transfer

Prompt-based Video Editing

Temporal Coherence Optimization

Use Cases

Content Creation

Video Style Transfer

Convert original videos into different styles (e.g., origami style) via natural language prompts.

Achieves artistic style transformation while preserving the original content structure.

Film Production

Special Effects Editing

Add or modify elements in videos without complex post-processing.

Significantly lowers the technical barrier for professional video editing.

🚀 VidToMe: Video Token Merging for Zero-Shot Video Editing

Edit videos instantly with just a prompt! This diffusion-based pipeline enhances temporal consistency and reduces memory usage for seamless zero-shot video editing.

🚀 Quick Start

Diffusers Implementation of VidToMe is a diffusion-based pipeline for zero-shot video editing. It enhances temporal consistency and reduces memory usage by merging self-attention tokens across video frames. This approach enables harmonious video generation and editing without the need for model fine - tuning. By aligning and compressing redundant tokens across frames, VidToMe ensures smooth transitions and coherent video output, outperforming traditional video editing methods. It is based on this paper.

💻 Usage Examples

Basic Usage

from diffusers import DiffusionPipeline

# load the pretrained model
pipeline = DiffusionPipeline.from_pretrained(
    "jadechoghari/VidToMe", 
    trust_remote_code=True, 
    custom_pipeline="jadechoghari/VidToMe", 
    sd_version="depth", 
    device="cuda", 
    float_precision="fp16"
)

# set prompts for inversion and generation
inversion_prompt = "flamingos standing in the water near a tree."
generation_prompt = {"origami": "rainbow-colored origami flamingos standing in the water near a tree."}

# additional control and parameters
control_type = "none"  # No extra control, use "depth" if needed
negative_prompt = ""

# Run the video-to-image editing pipeline
generated_images = pipeline(
    video_path="path/to/video.mp4",            # add path to the input video
    video_prompt=inversion_prompt,    # inversion prompt
    edit_prompt=generation_prompt,    # edit prompt for generation
    control_type=control_type         # control type (e.g., "none", "depth")
)

Note: For more control, consider creating a configuration and follow the instructions in the GitHub repository.

✨ Features

Zero-shot video editing for content creators
Video transformation using natural language prompts
Memory-optimized video generation for longer or complex sequences

📄 License

This project is licensed under the MIT license.

Model Authors:

Xirui Li
Chao Ma
Xiaokang Yang
Ming-Hsuan Yang

For more check the Github Repo.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご