Open-source Show-o2-7B Multimodal Model - Freely Support Unified Understanding and Generation of Text, Images, and Videos

Show O2 7B

Developed by showlab

Show-o2 is an improved native unified multimodal model that utilizes autoregressive modeling and flow matching techniques to support unified understanding and generation of text, image, and video modalities.

Text-to-Image Open Source License:Apache-2.0 #Multimodal unified modeling #Autoregressive flow matching #3D causal VAE

Downloads 198

Release Time : 6/5/2025

Model Overview

Show-o2 is based on the 3D causal variational autoencoder space and constructs a unified visual representation through a dual-path of spatial (-temporal) fusion. It can achieve scalability between image and video modalities while ensuring effective multimodal understanding and generation.

Model Features

Unified multimodal learning

Unified learning of multimodal understanding and generation on text tokens and the 3D causal VAE space, supporting text, image, and video modalities.

Dual-path of spatial (-temporal) fusion

Construct a unified visual representation through a dual-path to adapt to different feature dependencies in multimodal understanding and generation.

Autoregressive modeling and flow matching

Adopt specific heads for autoregressive modeling and flow matching for overall unified learning of multimodal understanding, image/video, and mixed-modal generation.

Model Capabilities

Text generation

Image generation

Video generation

Multimodal understanding

Image caption generation

Visual question answering

Use Cases

Multimodal understanding

Image caption generation

Generate detailed descriptive text based on the input image.

Can generate high-quality image captions, suitable for image annotation and content understanding.

Visual question answering

Answer natural language questions about the image content.

Can accurately answer complex questions about the image content.

Multimodal generation

Text-to-image generation

Generate high-quality images based on text descriptions.

The generated images have high resolution and good visual quality.

Text-to-video generation

Generate video content based on text descriptions.

The generated video content is coherent and conforms to the text description.

🚀 Show-o2: Improved Unified Multimodal Models

Show-o2 is an enhanced native unified multimodal model that combines autoregressive modeling and flow matching. It can handle a wide range of multimodal understanding and generation tasks across text, images, and videos.

🚀 Quick Start

🛠️ Environment Setup

First, set up the environment:

bash build_env.sh

wandb login <your wandb keys>

Download Wan2.1 3D causal VAE model weight here and put it on the current directory.

💻 Usage Examples

🔍 Multimodal Understanding

Demo for Multimodal Understanding and you can find the results on wandb.

python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
                         mmu_image_path=./docs/mmu/pexels-jane-pham-727419-1571673.jpg question='Describe the image in detail.'

python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
                         mmu_image_path=./docs/mmu/pexels-fotios-photos-2923436.jpg question='请告诉我图片中写着什么？'

python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
                         mmu_image_path=./docs/mmu/pexels-taryn-elliott-4144459.jpg question='How many avocados (including the halved) are in this image? Tell me how to make an avocado milkshake in detail.'

🎨 Text-to-Image Generation

Demo for Text-to-Image Generation and you can find the results on wandb.

python3 inference_t2i.py config=configs/showo2_1.5b_demo_1024x1024.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;
         
python3 inference_t2i.py config=configs/showo2_1.5b_demo_512x512.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;
                                      
python3 inference_t2i.py config=configs/showo2_1.5b_demo_432x432.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;

python3 inference_t2i.py config=configs/showo2_7b_demo_432x432.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;

✨ Features

Unified Learning: Perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, scalable for text, image, and video modalities.
Dual - Path Fusion: A dual - path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation.
Autoregressive and Flow Matching: Employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed - modality generation.

📚 Documentation

📖 Abstract

This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual - path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two - stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL .

🆕 What is new about Show-o2?

We perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, which is scalable for text, image, and video modalities. A dual - path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed - modality generation. Overview

📦 Pre - trained Model Weigths

The Show-o2 checkpoints can be found on Hugging Face:

📄 License

This project is licensed under the Apache - 2.0 license.

📖 Citation

To cite the paper and model, please use the below:

@article{xie2025showo2,
  title={Show-o2: Improved Native Unified Multimodal Models},
  author={Xie, Jinheng and Yang, Zhenheng and Shou, Mike Zheng},
  journal={arXiv preprint},
  year={2025}
}

🙏 Acknowledgments

This work is heavily based on Show-o.

Jinheng Xie¹ Zhenheng Yang² Mike Zheng Shou¹

¹ Show Lab, National University of Singapore ² Bytedance

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご