OmniGen2 Open-Source Multimodal Model - Supports Visual Understanding, Image Generation, and Editing Functions

Omnigen2

Developed by OmniGen2

OmniGen2 is a powerful and efficient unified multimodal model composed of a 3B vision-language model and a 4B diffusion model, supporting visual understanding, text-to-image generation, instruction-guided image editing, and context generation.

Text-to-Image Open Source License:Apache-2.0 #Multimodal generation #Instruction-guided editing #High-fidelity image synthesis

Downloads 136

Release Time : 6/6/2025

Model Overview

OmniGen2 is a unified multimodal model that combines the capabilities of a vision-language model and a diffusion model, suitable for various visual and text generation tasks, providing an efficient basic tool for researchers and developers.

Model Features

Visual understanding

Inherits the powerful image content interpretation and analysis capabilities of Qwen-VL-2.5.

Text-to-image generation

Create high-fidelity and aesthetically pleasing images based on text prompts.

Instruction-guided image editing

Perform complex image modifications based on instructions with high precision, achieving state-of-the-art performance among open-source models.

Context generation

Capable of processing and flexibly combining various inputs, including tasks, reference objects, and scenes, to generate novel and coherent visual outputs.

Model Capabilities

Image content interpretation

Text-to-image generation

Instruction-guided image editing

Multimodal context generation

Use Cases

Creative design

Text-to-image generation

Generate high-quality images based on text prompts provided by users.

Generate high-fidelity and aesthetically pleasing images.

Image editing

Instruction-guided image editing

Perform complex modifications to images based on user instructions.

Complete image editing tasks with high precision.

Multimodal applications

Context generation

Generate coherent visual outputs by combining multiple inputs.

Generate novel and contextually appropriate visual content.

🚀 OmniGen2

OmniGen2 is a powerful and efficient unified multimodal model. It combines a 3B Vision - Language Model (VLM) and a 4B diffusion model. The frozen 3B VLM ([Qwen - VL - 2.5](https://huggingface.co/Qwen/Qwen2.5 - VL - 3B - Instruct)) interprets visual signals and user instructions, and the 4B diffusion model generates high - quality images based on this understanding. This architecture enables strong performance in visual understanding, text - to - image generation, instruction - guided image editing, and in - context generation. As an open - source project, it provides a resource - efficient foundation for exploring controllable and personalized generative AI.

News | Quick Start | Usage Tips | Online Demos | Citation | License

✨ Features

Visual Understanding: Inherits the robust ability to interpret and analyze image content from its Qwen - VL - 2.5 foundation.
Text - to - Image Generation: Creates high - fidelity and aesthetically pleasing images from textual prompts.
Instruction - guided Image Editing: Executes complex, instruction - based image modifications with high precision, achieving state - of - the - art performance among open - source models.
In - context Generation: A versatile capability to process and flexibly combine diverse inputs—including tasks, reference objects, and scenes—to produce novel and coherent visual outputs.

🚀 Quick Start

Environment Setup

Recommended Setup

# 1. Clone the repo
git clone git@github.com:VectorSpaceLab/OmniGen2.git
cd OmniGen2

# 2. (Optional) Create a clean Python environment
conda create -n omnigen2 python=3.11
conda activate omnigen2

# 3. Install dependencies
# 3.1 Install PyTorch (choose correct CUDA version)
pip install torch==2.6.0 torchvision --extra - index - url https://download.pytorch.org/whl/cu124

# 3.2 Install other required packages
pip install -r requirements.txt
pip install flash - attn --no - build - isolation

For users in Mainland China

# Install PyTorch from a domestic mirror
pip install torch==2.6.0 torchvision --index - url https://mirror.sjtu.edu.cn/pytorch - wheels/cu124

# Install other dependencies from Tsinghua mirror
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install flash - attn --no - build - isolation -i https://pypi.tuna.tsinghua.edu.cn/simple

Run Examples

# Visual Understanding
bash example_understanding.sh

# Text - to - image generation
bash example_t2i.sh

# Instruction - guided image editing
bash example_edit.sh

# Subject - driven image editing
bash example_subject_driven_edit.sh

Gradio Demo

Online Demo: We are temporarily providing 8 GPUs to support the online demos. If you notice a long queue for a particular link, please try other links:

Demo1, Demo2, Demo3, Demo4

Chat - Demo1, Chat - Demo2, Chat - Demo3, Chat - Demo4

Run Locally:

pip install gradio
python app.py
# Optional: Share demo with public link
python app.py --share

💡 Usage Tips

To achieve optimal results with OmniGen2, you can adjust the following key hyperparameters based on your specific use case.

num_inference_step: The number of sampling steps per generation. Higher values generally improve quality but increase generation time.
- Recommended Range: 28 to 50
text_guidance_scale: Controls how strictly the output adheres to the text prompt (Classifier - Free Guidance).
- For Text - to - Image: Use a higher value (e.g., 6 - 7) for simple or less detailed prompts. Use a lower value (e.g., 4) for complex and highly detailed prompts.
- For Editing/Composition: A moderate value around 4 - 5 is recommended.
image_guidance_scale: This controls how much the final image should resemble the input reference image.
- The Trade - off: A higher value (~2.0) makes the output more faithful to the reference image's structure and style, but it might ignore parts of your text prompt. A lower value (~1.5) gives the text prompt more influence.
- Tip: Start with 1.5 and increase it if you need more consistency with the reference image. For image editing task, we recommend to set it between 1.3 and 2.0; for in - context generateion task, a higher image_guidance_scale will maintian more details in input images, and we recommend to set it between 2.5 and 3.0.
max_input_image_pixels: To manage processing speed and memory consumption, reference images exceeding this total pixel count will be automatically resized.
negative_prompt: Tell the model what you don't want to see in the image.
- Example: blurry, low quality, text, watermark
- Tip: For the best results, try experimenting with different negative prompts. If you're not sure, just leave it blank.

📚 Documentation

News

2025 - 06 - 16: [Gradio](https://github.com/VectorSpaceLab/OmniGen2?tab=readme - ov - file#-gradio - demo) and Jupyter demo is available.
2025 - 06 - 16: We release OmniGen2, a multimodal generation model, model weights can be accessed in huggingface.

TODO

[ ] Technical report.
[ ] In - context generation benchmark: OmniContext.
[ ] Support CPU offload and improve inference efficiency.
[ ] Training data and scripts.
[ ] Data construction pipeline.
[ ] ComfyUI Demo (commuity support will be greatly appreciated!).

:heart: Citing Us

If you find this repository or our work useful, please consider giving a star :star: and citation :t - rex:, which would be greatly appreciated (OmniGen2 report will be available as soon as possible):

@article{xiao2024omnigen,
  title={Omnigen: Unified image generation},
  author={Xiao, Shitao and Wang, Yueze and Zhou, Junjie and Yuan, Huaying and Xing, Xingrun and Yan, Ruiran and Wang, Shuting and Huang, Tiejun and Liu, Zheng},
  journal={arXiv preprint arXiv:2409.11340},
  year={2024}
}

📄 License

This work is licensed under Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご