VARGPT_LLaVA-v1開源多模態模型 - 具備視覺理解與生成能力，有效實用！

首頁

VARGPT LLaVA V1

由VARGPT-family開發

VARGPT是一個統一的多模態模型，結合了視覺理解和生成能力，通過預測下一標記實現理解，預測下一尺度實現生成。

文本生成圖像

Transformers

英語開源協議:Apache-2.0 #多模態理解生成 #視覺自迴歸 #統一建模

下載量 4,291

發布時間 : 1/21/2025

模型概述

VARGPT是一個7B+2B參數的多模態大語言模型，能夠同時處理視覺理解和生成任務，支持英文交互。

模型特點

統一的理解與生成

在單一模型中整合視覺理解和生成兩種範式

多模態交互

支持圖像和文本的聯合處理與生成

自迴歸預測

通過預測下一標記/尺度實現連續生成

模型能力

圖像內容理解

文本到圖像生成

多模態對話

視覺問答

使用案例

創意設計

藝術創作

根據文本描述生成畫作

生成符合描述的藝術圖像

內容分析

表情包解析

解釋圖像表情包的含義

輸出對圖像內容的文字解釋

🚀 VARGPT：視覺自迴歸多模態大語言模型中的統一理解與生成

VARGPT（7B + 2B）將理解和生成建模為統一模型中的兩種不同範式：為視覺理解預測下一個標記，為視覺生成預測下一個尺度。本模型可實現多模態理解與生成，為用戶提供了便捷的使用體驗。

我們提供了使用模型的簡單生成過程。更多詳細信息，您可以參考 GitHub：VARGPT-v1。

🚀 快速開始

多模態理解

以下是多模態理解的推理演示代碼，您可以執行以下代碼：

# Or execute the following code
import requests
from PIL import Image

import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_llava.modeling_vargpt_llava import VARGPTLlavaForConditionalGeneration
from vargpt_llava.prepare_vargpt_llava import prepare_vargpt_llava 
from vargpt_llava.processing_vargpt_llava import VARGPTLlavaProcessor
from patching_utils.patching import patching

model_id = "VARGPT_LLaVA-v1"
prepare_vargpt_llava(model_id)

model = VARGPTLlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float32, 
    low_cpu_mem_usage=True, 
).to(0)
patching(model)

tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTLlavaProcessor.from_pretrained(model_id)

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Please explain the meme in detail."},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "./assets/llava_bench_demo.png"
print(prompt)

raw_image = Image.open(image_file)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float32)

output = model.generate(
    **inputs, 
    max_new_tokens=2048, 
    do_sample=False)

print(processor.decode(output[0], skip_special_tokens=True))

多模態生成

以下是文本到圖像生成的推理演示代碼，您可以執行以下代碼：

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_llava.modeling_vargpt_llava import VARGPTLlavaForConditionalGeneration
from vargpt_llava.prepare_vargpt_llava import prepare_vargpt_llava 
from vargpt_llava.processing_vargpt_llava import VARGPTLlavaProcessor
from patching_utils.patching import patching
model_id = "VARGPT_LLaVA-v1"

prepare_vargpt_llava(model_id)

model = VARGPTLlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float32, 
    low_cpu_mem_usage=True, 
).to(0)

patching(model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTLlavaProcessor.from_pretrained(model_id)

# some instruction examples:
# Please design a drawing of a butterfly on a flower.
# Please create a painting of a black weasel is standing in the grass.
# Can you generate a rendered photo of a rabbit sitting in the grass.
# I need a designed photo of a lighthouse is seen in the distance.
# Please create a rendered drawing of an old photo of an aircraft carrier in the water.
# Please produce a designed photo of a squirrel is standing in the snow.


conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Please design a drawing of a butterfly on a flower."},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
print(prompt)

inputs = processor(text=prompt, return_tensors='pt').to(0, torch.float32)
model._IMAGE_GEN_PATH = "output.png"
output = model.generate(
    **inputs, 
    max_new_tokens=2048, 
    do_sample=False)

print(processor.decode(output[0], skip_special_tokens=True))