VARGPT-v1.1開源大模型 - 免費實現視覺理解與圖像生成雙重任務

首頁

VARGPT V1.1

由VARGPT-family開發

VARGPT-v1.1是一個視覺自迴歸統一大模型，通過迭代指令調優與強化學習提升，能夠同時實現視覺理解和生成任務。

文本生成圖像

Transformers

英語開源協議:Apache-2.0 #視覺自迴歸統一模型 #多模態理解生成 #迭代指令調優

下載量 954

發布時間 : 4/1/2025

模型概述

VARGPT-v1.1是一個多模態大語言模型，支持視覺理解和生成任務。通過預測下一標記實現視覺理解，通過預測下一尺度實現視覺生成。

模型特點

統一理解與生成

在單一模型中同時實現視覺理解和生成任務

迭代指令調優

通過迭代指令調優提升模型性能

強化學習優化

利用強化學習進一步優化模型表現

多模態支持

支持文本和圖像的輸入與輸出

模型能力

多模態理解

文本到圖像生成

圖像描述生成

視覺問答

使用案例

創意設計

專輯封面設計

根據文本描述生成幻想風格的專輯封面

生成符合描述的圖像

內容理解

表情包解釋

詳細解釋表情包的內容和含義

生成詳細的文本解釋

🚀 VARGPT-v1.1：通過迭代指令調優和強化學習改進視覺自迴歸大型統一模型

VARGPT-v1.1是一個視覺自迴歸大型統一模型，它將理解和生成建模為統一模型中的兩種不同範式，即通過預測下一個標記進行視覺理解，通過預測下一個尺度進行視覺生成。本項目提供了模型的簡單使用示例，更多詳情可參考GitHub倉庫。

模型相關圖片

🚀 快速開始

VARGPT-v1.1 (7B + 2B) 將理解和生成建模為統一模型中的兩種不同範式：通過預測下一個標記進行視覺理解，通過預測下一個尺度進行視覺生成。

我們提供了使用該模型的簡單生成過程。如需更多詳細信息，您可以參考 GitHub。

✨ 主要特性

多模態理解：能夠對圖像和文本等多模態信息進行理解和分析。
多模態生成：支持文本到圖像的生成任務。

💻 使用示例

基礎用法

多模態理解

以下是多模態理解的推理示例代碼：

# Or execute the following code
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_qwen_v1_1.modeling_vargpt_qwen2_vl import VARGPTQwen2VLForConditionalGeneration
from vargpt_qwen_v1_1.prepare_vargpt_v1_1 import prepare_vargpt_qwen2vl_v1_1 
from vargpt_qwen_v1_1.processing_vargpt_qwen2_vl import VARGPTQwen2VLProcessor
from patching_utils.patching import patching

model_id = "VARGPT-family/VARGPT-v1.1"

prepare_vargpt_qwen2vl_v1_1(model_id)

model = VARGPTQwen2VLForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float32, 
    low_cpu_mem_usage=True, 
).to(0)

patching(model)

tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTQwen2VLProcessor.from_pretrained(model_id)

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Please explain the meme in detail."},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "./assets/llava_bench_demo.png"
print(prompt)

raw_image = Image.open(image_file)
inputs = processor(images=[raw_image], text=prompt, return_tensors='pt').to(0, torch.float32)

output = model.generate(
    **inputs, 
    max_new_tokens=2048, 
    do_sample=False)

print(processor.decode(output[0], skip_special_tokens=True))

多模態生成

以下是文本到圖像生成的推理示例代碼：

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_qwen_v1_1.modeling_vargpt_qwen2_vl import VARGPTQwen2VLForConditionalGeneration
from vargpt_qwen_v1_1.prepare_vargpt_v1_1 import prepare_vargpt_qwen2vl_v1_1 
from vargpt_qwen_v1_1.processing_vargpt_qwen2_vl import VARGPTQwen2VLProcessor
from patching_utils.patching import patching
model_id = "VARGPT-family/VARGPT-v1.1"

prepare_vargpt_qwen2vl_v1_1(model_id)

model = VARGPTQwen2VLForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float32,     
    low_cpu_mem_usage=True, 
).to(0)

patching(model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTQwen2VLProcessor.from_pretrained(model_id)

conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Can you depict a scene of A power metalalbum cover featuring a fantasy-style illustration witha white falcon."},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
print(prompt)

inputs = processor(text=prompt, return_tensors='pt').to(0, torch.float32)
model._IMAGE_GEN_PATH = "output.png"
output = model.generate(
    **inputs, 
    max_new_tokens=4096, 
    do_sample=False)

print(processor.decode(output[0][:-1], skip_special_tokens=True))

📚 詳細文檔

本項目使用的數據集和模型相關信息如下：

屬性	詳情
模型類型	VARGPT-v1.1
訓練數據	VARGPT-family/VARGPT_datasets
評估指標	準確率、F1值
任務類型	任意到任意
庫名稱	transformers
許可證	Apache-2.0

📄 許可證

本項目採用 Apache-2.0 許可證。

📚 引用

若要引用本項目的數據集和模型，請使用以下 BibTeX 格式：

@misc{zhuang2025vargptunifiedunderstandinggeneration,
      title={VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model}, 
      author={Xianwei Zhuang and Yuxin Xie and Yufan Deng and Liming Liang and Jinghan Ru and Yuguo Yin and Yuexian Zou},
      year={2025},
      eprint={2501.12327},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.12327}, 
}
@misc{zhuang2025vargptv11improvevisualautoregressive,
      title={VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning}, 
      author={Xianwei Zhuang and Yuxin Xie and Yufan Deng and Dongchao Yang and Liming Liang and Jinghan Ru and Yuguo Yin and Yuexian Zou},
      year={2025},
      eprint={2504.02949},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.02949}, 
}