Open-source high-fidelity video generation model mochi-1-preview - Excellent motion performance with precise follow-up on prompts

Mochi 1 Preview

Developed by genmo

A high-fidelity video generation model developed by Genmo, featuring exceptional motion expressiveness and precise prompt adherence

Video Processing EnglishOpen Source License:Apache-2.0 #High-fidelity video generation #Open-weight model #Large-scale video model

Downloads 27.13k

Release Time : 10/22/2024

Model Overview

Mochi 1 is a cutting-edge open video generation model that employs an innovative Asymmetric Diffusion Transformer architecture to produce high-quality video content.

Model Features

High-fidelity video generation

Capable of generating video content with outstanding motion expressiveness and visual fidelity

Precise prompt adherence

Accurately understands and implements user-provided text prompts

Open model

Released under the permissive Apache 2.0 license, supporting community use and modification

Innovative architecture

Utilizes a novel Asymmetric Diffusion Transformer architecture with 10 billion parameters

Model Capabilities

Text-to-video generation

High-fidelity video rendering

Dynamic scene generation

Photorealistic video creation

Use Cases

Creative content generation

Advertising video production

Generate high-quality promotional videos based on product descriptions

Can produce high-fidelity videos at 480p resolution

Educational content creation

Generate visual demonstrations of scientific concepts or historical events

Accurately presents complex dynamic scenes

Entertainment industry

Short video creation

Automatically generate short video content based on scripts

Supports photorealistic video generation

🚀 Mochi 1

A state-of-the-art video generation model by Genmo, enabling high - fidelity video creation from text prompts.

Blog | Hugging Face | Playground | Careers

Video

🚀 Quick Start

Mochi 1 preview is an open state - of - the - art video generation model. It shows high - fidelity motion and strong prompt adherence in preliminary evaluation, significantly narrowing the gap between closed and open video generation systems. The model is released under the permissive Apache 2.0 license. You can try it for free on our playground.

📦 Installation

Install using uv:

git clone https://github.com/genmoai/models
cd models 
pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install setuptools
uv pip install -e . --no-build-isolation

If you want to install flash attention, you can use:

uv pip install -e .[flash] --no-build-isolation

You also need to install FFMPEG to convert your outputs into videos.

💾 Download Weights

Use download_weights.py to download the model + decoder to a local directory:

python3 ./scripts/download_weights.py <path_to_downloaded_directory>

Or, directly download the weights from Hugging Face or via magnet:?xt=urn:btih:441da1af7a16bcaa4f556964f8028d7113d21cbb&dn=weights&tr=udp://tracker.opentrackr.org:1337/announce to a folder on your computer.

💻 Running

Start the gradio UI

python3 ./demos/gradio_ui.py --model_dir "<path_to_downloaded_directory>"

Generate videos from the CLI

python3 ./demos/cli.py --model_dir "<path_to_downloaded_directory>"

Replace <path_to_downloaded_directory> with the path to your model directory.

📚 API

This repository provides a simple, composable API for programmatic model calls, which yields the highest - quality results. You can find a full example here. Here is a rough example:

from genmo.mochi_preview.pipelines import (
    DecoderModelFactory,
    DitModelFactory,
    MochiSingleGPUPipeline,
    T5ModelFactory,
    linear_quadratic_schedule,
)

pipeline = MochiSingleGPUPipeline(
    text_encoder_factory=T5ModelFactory(),
    dit_factory=DitModelFactory(
        model_path=f"{MOCHI_DIR}/dit.safetensors", model_dtype="bf16"
    ),
    decoder_factory=DecoderModelFactory(
        model_path=f"{MOCHI_DIR}/vae.safetensors",
    ),
    cpu_offload=True,
    decode_type="tiled_full",
)

video = pipeline(
    height=480,
    width=848,
    num_frames=31,
    num_inference_steps=64,
    sigma_schedule=linear_quadratic_schedule(64, 0.025),
    cfg_schedule=[4.5] * 64,
    batch_cfg=False,
    prompt="your favorite prompt here ...",
    negative_prompt="",
    seed=12345,
)

💻 Running with Diffusers

Install Diffusers

pip install git+https://github.com/huggingface/diffusers.git

High - quality example (requires 42GB VRAM)

import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video

pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview")

# Enable memory savings
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()

prompt = "Close - up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."

with torch.autocast("cuda", torch.bfloat16, cache_enabled=False):
      frames = pipe(prompt, num_frames=84).frames[0]

export_to_video(frames, "mochi.mp4", fps=30)

Lower - precision example (requires 22GB VRAM)

import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video

pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", variant="bf16", torch_dtype=torch.bfloat16)

# Enable memory savings
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()

prompt = "Close - up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
frames = pipe(prompt, num_frames=84).frames[0]

export_to_video(frames, "mochi.mp4", fps=30)

For more details, check out the Diffusers documentation.

🔧 Technical Details

Model Architecture

Mochi 1 is a major advancement in open - source video generation. It features a 10 - billion - parameter diffusion model based on the novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained from scratch, it is the largest openly released video generative model. It has a simple and hackable architecture, and an inference harness with an efficient context parallel implementation is also released.

Alongside Mochi, the video AsymmVAE is open - sourced. It uses an asymmetric encoder - decoder structure to build an efficient high - quality compression model. The AsymmVAE compresses videos to a 128x smaller size, with 8x8 spatial and 6x temporal compression to a 12 - channel latent space.

AsymmVAE Model Specs

Property	Details
Params Count	362M
Enc Base Channels	64
Dec Base Channels	128
Latent Dim	12
Spatial Compression	8x8
Temporal Compression	6x

The AsymmDiT efficiently processes user prompts and compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. It jointly attends to text and visual tokens with multi - modal self - attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. The visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. Non - square QKV and output projection layers are used to unify the modalities in self - attention, reducing inference memory requirements.

In contrast to many modern diffusion models that use multiple pretrained language models, Mochi 1 simply encodes prompts with a single T5 - XXL language model.

AsymmDiT Model Specs

Property	Details
Params Count	10B
Num Layers	48
Num Heads	24
Visual Dim	3072
Text Dim	1536
Visual Tokens	44520
Text Tokens	256

Hardware Requirements

The repository supports both multi - GPU and single - GPU operation. It requires approximately 60GB VRAM when running on a single GPU. ComfyUI can optimize Mochi to run on less than 20GB VRAM, but this implementation prioritizes flexibility over memory efficiency. It is recommended to use at least 1 H100 GPU.

⚠️ Safety

Genmo video models are general text - to - video diffusion models that may reflect biases and preconceptions in their training data. Although steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in commercial services or products.

🚧 Limitations

Under the research preview, Mochi 1 is an evolving checkpoint. There are some known limitations. The initial release generates 480p videos. Minor warping and distortions may occur in some edge cases with extreme motion. Mochi 1 is optimized for photorealistic styles and does not perform well with animated content. The community is expected to fine - tune the model for various aesthetic preferences.

🔗 Related Work

[ComfyUI - MochiWrapper](https://github.com/kijai/ComfyUI - MochiWrapper) adds ComfyUI support for Mochi. The integration of Pytorch's SDPA attention is from this repository.
[mochi - xdit](https://github.com/xdit - project/mochi - xdit) is a fork of this repository, which improves the parallel inference speed with [xDiT](https://github.com/xdit - project/xdit).

📄 License

@misc{genmo2024mochi,
      title={Mochi 1},
      author={Genmo Team},
      year={2024},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished={\url{https://github.com/genmoai/models}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご