đ Mochi 1
A state-of-the-art video generation model by Genmo, enabling high - fidelity video creation from text prompts.
Blog | Hugging Face | Playground | Careers

đ Quick Start
Mochi 1 preview is an open state - of - the - art video generation model. It shows high - fidelity motion and strong prompt adherence in preliminary evaluation, significantly narrowing the gap between closed and open video generation systems. The model is released under the permissive Apache 2.0 license. You can try it for free on our playground.
đĻ Installation
Install using uv:
git clone https://github.com/genmoai/models
cd models
pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install setuptools
uv pip install -e . --no-build-isolation
If you want to install flash attention, you can use:
uv pip install -e .[flash] --no-build-isolation
You also need to install FFMPEG to convert your outputs into videos.
đž Download Weights
Use download_weights.py to download the model + decoder to a local directory:
python3 ./scripts/download_weights.py <path_to_downloaded_directory>
Or, directly download the weights from Hugging Face or via magnet:?xt=urn:btih:441da1af7a16bcaa4f556964f8028d7113d21cbb&dn=weights&tr=udp://tracker.opentrackr.org:1337/announce
to a folder on your computer.
đģ Running
Start the gradio UI
python3 ./demos/gradio_ui.py --model_dir "<path_to_downloaded_directory>"
Generate videos from the CLI
python3 ./demos/cli.py --model_dir "<path_to_downloaded_directory>"
Replace <path_to_downloaded_directory>
with the path to your model directory.
đ API
This repository provides a simple, composable API for programmatic model calls, which yields the highest - quality results. You can find a full example here. Here is a rough example:
from genmo.mochi_preview.pipelines import (
DecoderModelFactory,
DitModelFactory,
MochiSingleGPUPipeline,
T5ModelFactory,
linear_quadratic_schedule,
)
pipeline = MochiSingleGPUPipeline(
text_encoder_factory=T5ModelFactory(),
dit_factory=DitModelFactory(
model_path=f"{MOCHI_DIR}/dit.safetensors", model_dtype="bf16"
),
decoder_factory=DecoderModelFactory(
model_path=f"{MOCHI_DIR}/vae.safetensors",
),
cpu_offload=True,
decode_type="tiled_full",
)
video = pipeline(
height=480,
width=848,
num_frames=31,
num_inference_steps=64,
sigma_schedule=linear_quadratic_schedule(64, 0.025),
cfg_schedule=[4.5] * 64,
batch_cfg=False,
prompt="your favorite prompt here ...",
negative_prompt="",
seed=12345,
)
đģ Running with Diffusers
Install Diffusers
pip install git+https://github.com/huggingface/diffusers.git
High - quality example (requires 42GB VRAM)
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview")
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()
prompt = "Close - up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
with torch.autocast("cuda", torch.bfloat16, cache_enabled=False):
frames = pipe(prompt, num_frames=84).frames[0]
export_to_video(frames, "mochi.mp4", fps=30)
Lower - precision example (requires 22GB VRAM)
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", variant="bf16", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()
prompt = "Close - up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
frames = pipe(prompt, num_frames=84).frames[0]
export_to_video(frames, "mochi.mp4", fps=30)
For more details, check out the Diffusers documentation.
đ§ Technical Details
Model Architecture
Mochi 1 is a major advancement in open - source video generation. It features a 10 - billion - parameter diffusion model based on the novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained from scratch, it is the largest openly released video generative model. It has a simple and hackable architecture, and an inference harness with an efficient context parallel implementation is also released.
Alongside Mochi, the video AsymmVAE is open - sourced. It uses an asymmetric encoder - decoder structure to build an efficient high - quality compression model. The AsymmVAE compresses videos to a 128x smaller size, with 8x8 spatial and 6x temporal compression to a 12 - channel latent space.
AsymmVAE Model Specs
Property |
Details |
Params Count |
362M |
Enc Base Channels |
64 |
Dec Base Channels |
128 |
Latent Dim |
12 |
Spatial Compression |
8x8 |
Temporal Compression |
6x |
The AsymmDiT efficiently processes user prompts and compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. It jointly attends to text and visual tokens with multi - modal self - attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. The visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. Non - square QKV and output projection layers are used to unify the modalities in self - attention, reducing inference memory requirements.
In contrast to many modern diffusion models that use multiple pretrained language models, Mochi 1 simply encodes prompts with a single T5 - XXL language model.
AsymmDiT Model Specs
Property |
Details |
Params Count |
10B |
Num Layers |
48 |
Num Heads |
24 |
Visual Dim |
3072 |
Text Dim |
1536 |
Visual Tokens |
44520 |
Text Tokens |
256 |
Hardware Requirements
The repository supports both multi - GPU and single - GPU operation. It requires approximately 60GB VRAM when running on a single GPU. ComfyUI can optimize Mochi to run on less than 20GB VRAM, but this implementation prioritizes flexibility over memory efficiency. It is recommended to use at least 1 H100 GPU.
â ī¸ Safety
Genmo video models are general text - to - video diffusion models that may reflect biases and preconceptions in their training data. Although steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in commercial services or products.
đ§ Limitations
Under the research preview, Mochi 1 is an evolving checkpoint. There are some known limitations. The initial release generates 480p videos. Minor warping and distortions may occur in some edge cases with extreme motion. Mochi 1 is optimized for photorealistic styles and does not perform well with animated content. The community is expected to fine - tune the model for various aesthetic preferences.
đ Related Work
- [ComfyUI - MochiWrapper](https://github.com/kijai/ComfyUI - MochiWrapper) adds ComfyUI support for Mochi. The integration of Pytorch's SDPA attention is from this repository.
- [mochi - xdit](https://github.com/xdit - project/mochi - xdit) is a fork of this repository, which improves the parallel inference speed with [xDiT](https://github.com/xdit - project/xdit).
đ License
@misc{genmo2024mochi,
title={Mochi 1},
author={Genmo Team},
year={2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished={\url{https://github.com/genmoai/models}}
}