🚀 OmniGen2
OmniGen2 is a powerful and efficient unified multimodal model. It combines a 3B Vision - Language Model (VLM) and a 4B diffusion model. The frozen 3B VLM ([Qwen - VL - 2.5](https://huggingface.co/Qwen/Qwen2.5 - VL - 3B - Instruct)) interprets visual signals and user instructions, and the 4B diffusion model generates high - quality images based on this understanding. This architecture enables strong performance in visual understanding, text - to - image generation, instruction - guided image editing, and in - context generation. As an open - source project, it provides a resource - efficient foundation for exploring controllable and personalized generative AI.
News |
Quick Start |
Usage Tips |
Online Demos |
Citation |
License
✨ Features
- Visual Understanding: Inherits the robust ability to interpret and analyze image content from its Qwen - VL - 2.5 foundation.
- Text - to - Image Generation: Creates high - fidelity and aesthetically pleasing images from textual prompts.
- Instruction - guided Image Editing: Executes complex, instruction - based image modifications with high precision, achieving state - of - the - art performance among open - source models.
- In - context Generation: A versatile capability to process and flexibly combine diverse inputs—including tasks, reference objects, and scenes—to produce novel and coherent visual outputs.
🚀 Quick Start
Environment Setup
Recommended Setup
git clone git@github.com:VectorSpaceLab/OmniGen2.git
cd OmniGen2
conda create -n omnigen2 python=3.11
conda activate omnigen2
pip install torch==2.6.0 torchvision --extra - index - url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install flash - attn --no - build - isolation
For users in Mainland China
pip install torch==2.6.0 torchvision --index - url https://mirror.sjtu.edu.cn/pytorch - wheels/cu124
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install flash - attn --no - build - isolation -i https://pypi.tuna.tsinghua.edu.cn/simple
Run Examples
bash example_understanding.sh
bash example_t2i.sh
bash example_edit.sh
bash example_subject_driven_edit.sh
Gradio Demo
💡 Usage Tips
To achieve optimal results with OmniGen2, you can adjust the following key hyperparameters based on your specific use case.
num_inference_step
: The number of sampling steps per generation. Higher values generally improve quality but increase generation time.
- Recommended Range: 28 to 50
text_guidance_scale
: Controls how strictly the output adheres to the text prompt (Classifier - Free Guidance).
- For Text - to - Image: Use a higher value (e.g., 6 - 7) for simple or less detailed prompts. Use a lower value (e.g., 4) for complex and highly detailed prompts.
- For Editing/Composition: A moderate value around 4 - 5 is recommended.
image_guidance_scale
: This controls how much the final image should resemble the input reference image.
- The Trade - off: A higher value (~2.0) makes the output more faithful to the reference image's structure and style, but it might ignore parts of your text prompt. A lower value (~1.5) gives the text prompt more influence.
- Tip: Start with 1.5 and increase it if you need more consistency with the reference image. For image editing task, we recommend to set it between 1.3 and 2.0; for in - context generateion task, a higher image_guidance_scale will maintian more details in input images, and we recommend to set it between 2.5 and 3.0.
max_input_image_pixels
: To manage processing speed and memory consumption, reference images exceeding this total pixel count will be automatically resized.
negative_prompt
: Tell the model what you don't want to see in the image.
- Example: blurry, low quality, text, watermark
- Tip: For the best results, try experimenting with different negative prompts. If you're not sure, just leave it blank.
📚 Documentation
News
- 2025 - 06 - 16: [Gradio](https://github.com/VectorSpaceLab/OmniGen2?tab=readme - ov - file#-gradio - demo) and Jupyter demo is available.
- 2025 - 06 - 16: We release OmniGen2, a multimodal generation model, model weights can be accessed in huggingface.
TODO
- [ ] Technical report.
- [ ] In - context generation benchmark: OmniContext.
- [ ] Support CPU offload and improve inference efficiency.
- [ ] Training data and scripts.
- [ ] Data construction pipeline.
- [ ] ComfyUI Demo (commuity support will be greatly appreciated!).
:heart: Citing Us
If you find this repository or our work useful, please consider giving a star :star: and citation :t - rex:, which would be greatly appreciated (OmniGen2 report will be available as soon as possible):
@article{xiao2024omnigen,
title={Omnigen: Unified image generation},
author={Xiao, Shitao and Wang, Yueze and Zhou, Junjie and Yuan, Huaying and Xing, Xingrun and Yan, Ruiran and Wang, Shuting and Huang, Tiejun and Liu, Zheng},
journal={arXiv preprint arXiv:2409.11340},
year={2024}
}
📄 License
This work is licensed under Apache 2.0 license.