๐ Show-o2: Improved Unified Multimodal Models
Show-o2 is an enhanced native unified multimodal model that combines autoregressive modeling and flow matching. It can handle a wide range of multimodal understanding and generation tasks across text, images, and videos.
๐ Quick Start
๐ ๏ธ Environment Setup
First, set up the environment:
bash build_env.sh
Login your wandb account on your machine or server.
wandb login <your wandb keys>
Download Wan2.1 3D causal VAE model weight here and put it on the current directory.
๐ป Usage Examples
๐ Multimodal Understanding
Demo for Multimodal Understanding and you can find the results on wandb.
python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
mmu_image_path=./docs/mmu/pexels-jane-pham-727419-1571673.jpg question='Describe the image in detail.'
python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
mmu_image_path=./docs/mmu/pexels-fotios-photos-2923436.jpg question='่ฏทๅ่ฏๆๅพ็ไธญๅ็ไปไน๏ผ'
python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
mmu_image_path=./docs/mmu/pexels-taryn-elliott-4144459.jpg question='How many avocados (including the halved) are in this image? Tell me how to make an avocado milkshake in detail.'
๐จ Text-to-Image Generation
Demo for Text-to-Image Generation and you can find the results on wandb.
python3 inference_t2i.py config=configs/showo2_1.5b_demo_1024x1024.yaml \
batch_size=4 guidance_scale=7.5 num_inference_steps=50;
python3 inference_t2i.py config=configs/showo2_1.5b_demo_512x512.yaml \
batch_size=4 guidance_scale=7.5 num_inference_steps=50;
python3 inference_t2i.py config=configs/showo2_1.5b_demo_432x432.yaml \
batch_size=4 guidance_scale=7.5 num_inference_steps=50;
python3 inference_t2i.py config=configs/showo2_7b_demo_432x432.yaml \
batch_size=4 guidance_scale=7.5 num_inference_steps=50;
โจ Features
- Unified Learning: Perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, scalable for text, image, and video modalities.
- Dual - Path Fusion: A dual - path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation.
- Autoregressive and Flow Matching: Employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed - modality generation.
๐ Documentation
๐ Abstract
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual - path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two - stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL .
๐ What is new about Show-o2?
We perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, which is scalable for text, image, and video modalities. A dual - path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed - modality generation.

๐ฆ Pre - trained Model Weigths
The Show-o2 checkpoints can be found on Hugging Face:
๐ License
This project is licensed under the Apache - 2.0 license.
๐ Citation
To cite the paper and model, please use the below:
@article{xie2025showo2,
title={Show-o2: Improved Native Unified Multimodal Models},
author={Xie, Jinheng and Yang, Zhenheng and Shou, Mike Zheng},
journal={arXiv preprint},
year={2025}
}
๐ Acknowledgments
This work is heavily based on Show-o.