VARGPT_LLaVA-v1 Open-Source Multimodal Model - Equipped with Visual Understanding and Generation Capabilities, Highly Practical!

VARGPT LLaVA V1

Developed by VARGPT-family

VARGPT is a unified multimodal model that combines visual understanding and generation capabilities, achieving understanding by predicting the next token and generation by predicting the next scale.

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal understanding and generation #Visual autoregression #Unified modeling

Downloads 4,291

Release Time : 1/21/2025

Model Overview

VARGPT is a 7B+2B parameter multimodal large language model capable of handling both visual understanding and generation tasks, supporting English interaction.

Model Features

Unified Understanding and Generation

Integrates both visual understanding and generation paradigms in a single model

Multimodal Interaction

Supports joint processing and generation of images and text

Autoregressive Prediction

Achieves continuous generation by predicting the next token/scale

Model Capabilities

Image content understanding

Text-to-image generation

Multimodal dialogue

Visual question answering

Use Cases

Creative Design

Art Creation

Generate artwork based on text descriptions

Produces artistic images matching the description

Content Analysis

Meme Interpretation

Explain the meaning of image memes

Outputs textual explanations of image content

🚀 VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

VARGPT (7B+2B) models understanding and generation as two distinct paradigms within a unified model: predicting the next token for visual understanding and predicting the next scale for visual generation. This approach enables more efficient and accurate multimodal processing.

We offer a simple generation process for using our model. For more in - depth details, please refer to our Github repository: VARGPT-v1.

📦 Dataset and Model Information

Property	Details
License	Apache - 2.0
Datasets	VARGPT-family/VARGPT_datasets
Language	English
Metrics	Accuracy, F1
Pipeline Tag	Any - to - any
Library Name	transformers

🚀 Quick Start

✨ Features

VARGPT models visual understanding and generation as two different paradigms in a single model, offering a unified approach for multimodal tasks.

💻 Usage Examples

🔍 Multimodal Understanding

Inference demo for Multimodal Understanding. You can execute the following code:

# Or execute the following code
import requests
from PIL import Image

import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_llava.modeling_vargpt_llava import VARGPTLlavaForConditionalGeneration
from vargpt_llava.prepare_vargpt_llava import prepare_vargpt_llava 
from vargpt_llava.processing_vargpt_llava import VARGPTLlavaProcessor
from patching_utils.patching import patching

model_id = "VARGPT_LLaVA-v1"
prepare_vargpt_llava(model_id)

model = VARGPTLlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float32, 
    low_cpu_mem_usage=True, 
).to(0)
patching(model)

tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTLlavaProcessor.from_pretrained(model_id)

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Please explain the meme in detail."},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "./assets/llava_bench_demo.png"
print(prompt)

raw_image = Image.open(image_file)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float32)

output = model.generate(
    **inputs, 
    max_new_tokens=2048, 
    do_sample=False)

print(processor.decode(output[0], skip_special_tokens=True))

🎨 Multimodal Generation

Inference demo for Text - to - Image Generation. You can execute the following code:

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_llava.modeling_vargpt_llava import VARGPTLlavaForConditionalGeneration
from vargpt_llava.prepare_vargpt_llava import prepare_vargpt_llava 
from vargpt_llava.processing_vargpt_llava import VARGPTLlavaProcessor
from patching_utils.patching import patching
model_id = "VARGPT_LLaVA-v1"

prepare_vargpt_llava(model_id)

model = VARGPTLlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float32, 
    low_cpu_mem_usage=True, 
).to(0)

patching(model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTLlavaProcessor.from_pretrained(model_id)

# some instruction examples:
# Please design a drawing of a butterfly on a flower.
# Please create a painting of a black weasel is standing in the grass.
# Can you generate a rendered photo of a rabbit sitting in the grass.
# I need a designed photo of a lighthouse is seen in the distance.
# Please create a rendered drawing of an old photo of an aircraft carrier in the water.
# Please produce a designed photo of a squirrel is standing in the snow.


conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Please design a drawing of a butterfly on a flower."},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
print(prompt)

inputs = processor(text=prompt, return_tensors='pt').to(0, torch.float32)
model._IMAGE_GEN_PATH = "output.png"
output = model.generate(
    **inputs, 
    max_new_tokens=2048, 
    do_sample=False)

print(processor.decode(output[0], skip_special_tokens=True))

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご