VARGPT-v1.1 Open-source Large Model - Freely Achieve Dual Tasks of Visual Understanding and Image Generation

VARGPT V1.1

Developed by VARGPT-family

VARGPT-v1.1 is a visual autoregressive unified large model, enhanced through iterative instruction tuning and reinforcement learning, capable of performing both visual understanding and generation tasks.

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Visual Autoregressive Unified Model #Multimodal Understanding and Generation #Iterative Instruction Tuning

Downloads 954

Release Time : 4/1/2025

Model Overview

VARGPT-v1.1 is a multimodal large language model that supports visual understanding and generation tasks. It achieves visual understanding by predicting the next token and visual generation by predicting the next scale.

Model Features

Unified Understanding and Generation

Simultaneously performs visual understanding and generation tasks within a single model.

Iterative Instruction Tuning

Enhances model performance through iterative instruction tuning.

Reinforcement Learning Optimization

Further optimizes model performance using reinforcement learning.

Multimodal Support

Supports both text and image inputs and outputs.

Model Capabilities

Multimodal Understanding

Text-to-Image Generation

Image Caption Generation

Visual Question Answering

Use Cases

Creative Design

Album Cover Design

Generates fantasy-style album covers based on text descriptions.

Produces images that match the descriptions.

Content Understanding

Meme Interpretation

Provides detailed explanations of meme content and meanings.

Generates detailed textual explanations.

🚀 VARGPT-v1.1

Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning

VARGPT-v1.1 (7B+2B) models understanding and generation as two distinct paradigms within a unified model: predicting the next token for visual understanding and predicting the next scale for visual generation.

image/png

We provide a simple generation process for using our model. For more details, you can refer to VARGPT-v1.1 on Github.

🚀 Quick Start

The project provides a simple generation process for using the model. For more details, please refer to the GitHub repository.

✨ Features

Multimodal Understanding: Capable of understanding multimodal data, such as images and text.
Multimodal Generation: Can generate images based on text prompts.

💻 Usage Examples

Basic Usage - Multimodal Understanding

# Or execute the following code
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_qwen_v1_1.modeling_vargpt_qwen2_vl import VARGPTQwen2VLForConditionalGeneration
from vargpt_qwen_v1_1.prepare_vargpt_v1_1 import prepare_vargpt_qwen2vl_v1_1 
from vargpt_qwen_v1_1.processing_vargpt_qwen2_vl import VARGPTQwen2VLProcessor
from patching_utils.patching import patching

model_id = "VARGPT-family/VARGPT-v1.1"

prepare_vargpt_qwen2vl_v1_1(model_id)

model = VARGPTQwen2VLForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float32, 
    low_cpu_mem_usage=True, 
).to(0)

patching(model)

tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTQwen2VLProcessor.from_pretrained(model_id)

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Please explain the meme in detail."},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "./assets/llava_bench_demo.png"
print(prompt)

raw_image = Image.open(image_file)
inputs = processor(images=[raw_image], text=prompt, return_tensors='pt').to(0, torch.float32)

output = model.generate(
    **inputs, 
    max_new_tokens=2048, 
    do_sample=False)

print(processor.decode(output[0], skip_special_tokens=True))

Advanced Usage - Multimodal Generation

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_qwen_v1_1.modeling_vargpt_qwen2_vl import VARGPTQwen2VLForConditionalGeneration
from vargpt_qwen_v1_1.prepare_vargpt_v1_1 import prepare_vargpt_qwen2vl_v1_1 
from vargpt_qwen_v1_1.processing_vargpt_qwen2_vl import VARGPTQwen2VLProcessor
from patching_utils.patching import patching
model_id = "VARGPT-family/VARGPT-v1.1"

prepare_vargpt_qwen2vl_v1_1(model_id)

model = VARGPTQwen2VLForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float32,     
    low_cpu_mem_usage=True, 
).to(0)

patching(model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTQwen2VLProcessor.from_pretrained(model_id)

conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Can you depict a scene of A power metalalbum cover featuring a fantasy-style illustration witha white falcon."},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
print(prompt)

inputs = processor(text=prompt, return_tensors='pt').to(0, torch.float32)
model._IMAGE_GEN_PATH = "output.png"
output = model.generate(
    **inputs, 
    max_new_tokens=4096, 
    do_sample=False)

print(processor.decode(output[0][:-1], skip_special_tokens=True))

📚 Documentation

The project uses the transformers library. The following is some basic information about the model:

Property	Details
Model Type	VARGPT-v1.1
Training Data	VARGPT-family/VARGPT_datasets
Metrics	accuracy, f1
Pipeline Tag	any-to-any
Library Name	transformers

📄 License

This project is licensed under the Apache-2.0 license.

📖 Citation

To cite the datasets and model, please use the following BibTeX entries:

@misc{zhuang2025vargptunifiedunderstandinggeneration,
      title={VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model}, 
      author={Xianwei Zhuang and Yuxin Xie and Yufan Deng and Liming Liang and Jinghan Ru and Yuguo Yin and Yuexian Zou},
      year={2025},
      eprint={2501.12327},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.12327}, 
}
@misc{zhuang2025vargptv11improvevisualautoregressive,
      title={VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning}, 
      author={Xianwei Zhuang and Yuxin Xie and Yufan Deng and Dongchao Yang and Liming Liang and Jinghan Ru and Yuguo Yin and Yuexian Zou},
      year={2025},
      eprint={2504.02949},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.02949}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご