UForm-Gen2-dpo Open Source Vision-Language Model - Free Image Captioning and Visual Question Answering

Uform Gen2 Dpo

Developed by unum-cloud

UForm-Gen2-dpo is a small generative vision-language model, aligned for image caption generation and visual question answering tasks through Direct Preference Optimization (DPO) on VLFeedback and LLaVA-Human-Preference-10K preference datasets.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal Dialogue #Preference Optimization Alignment #Lightweight Vision-Language Model

Downloads 3,568

Release Time : 3/27/2024

Model Overview

This model is primarily used for image caption generation, visual question answering, and multimodal dialogue scenarios, incorporating a ViT-H/14 visual encoder with a CLIP-like architecture and the Qwen1.5-0.5B-Chat language model.

Model Features

Direct Preference Optimization Training

Trained with DPO on VLFeedback and LLaVA-Human-Preference-10K preference datasets to enhance output quality.

Efficient Training

Trained in less than a day on an 8x H100 GPU DGX-H100 server.

Multimodal Capabilities

Combines visual encoder and language model for image understanding and text generation.

Model Capabilities

Image caption generation

Visual question answering

Multimodal dialogue

Image understanding

Text generation

Use Cases

Content Generation

Detailed Image Description

Generate detailed descriptions for input images.

Example output: 'The image shows a well-lit, tranquil bedroom...'

Brief Image Description

Generate brief descriptions for input images.

Example output: 'A white and orange cat standing on its hind legs...'

Intelligent Q&A

Visual Question Answering

Answer questions about image content.

🚀 UForm-Gen2-dpo: A Generative Vision-Language Model

UForm-Gen2-dpo is a compact generative vision-language model tailored for Image Captioning and Visual Question Answering. It is fine - tuned on preference datasets VLFeedback and LLaVA - Human - Preference - 10K using Direct Preference Optimization (DPO).

Key Information

Property	Details
Library Name	transformers
Tags	image - captioning, visual - question - answering
License	apache - 2.0
Datasets	X2FD/LVIS - Instruct4V, BAAI/SVIT, HuggingFaceH4/ultrachat_200k, MMInstruction/VLFeedback, zhiqings/LLaVA - Human - Preference - 10K
Pipeline Tag	image - to - text

Model Widget Examples

Detailed caption: interior.jpg
- Output: "The image shows a serene and well - lit bedroom with a white bed, a black bed frame, and a white comforter. There’s a gray armchair with a white cushion, a black dresser with a mirror and a vase, and a white rug on the floor. The room has a large window with white curtains, and there are several decorative items, including a picture frame, a vase with a flower, and a lamp. The room is well - organized and has a calming atmosphere."
Short caption: cat.jpg
- Output: "A white and orange cat stands on its hind legs, reaching towards a wooden table with a white teapot and a basket of red raspberries. The table is on a small wooden bench, surrounded by orange flowers. The cat’s position and action create a serene, playful scene in a garden."

Model Output Example

🚀 Quick Start

Model Composition

The UForm - Gen2 - dpo model consists of two main parts:

CLIP - like ViT - H/14
[Qwen1.5 - 0.5B - Chat](https://huggingface.co/Qwen/Qwen1.5 - 0.5B - Chat)

Training Information

The model was trained in less than one day on a DGX - H100 with 8x H100 GPUs. Thanks to Nebius.ai for providing the compute resources 🤗

✨ Features

The generative model can be used for multiple purposes:

Generate captions for images.
Answer questions about images.
Engage in multimodal chat.

💻 Usage Examples

Basic Usage

from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained("unum - cloud/uform - gen2 - dpo", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("unum - cloud/uform - gen2 - dpo", trust_remote_code=True)
prompt = "Question or Instruction"
image = Image.open("image.jpg")
inputs = processor(text=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
     output = model.generate(
        **inputs,
        do_sample=False,
        use_cache=True,
        max_new_tokens=256,
        eos_token_id=151645,
        pad_token_id=processor.tokenizer.pad_token_id
    )
prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]

You can check examples of different prompts in our demo space.

📚 Documentation

Evaluation Results

The model is evaluated on the MME Benchmark across multiple categories:

Model	perception	reasoning	OCR	artwork	celebrity	code_reasoning	color	commonsense_reasoning	count	existence	landmark	numerical_calculation	position	posters	scene	text_translation
uform - gen2 - dpo	1,048.75	224.64	72.50	97.25	62.65	67.50	123.33	57.14	136.67	195.00	104.00	50.00	51.67	59.18	146.50	50.00
uform - gen2 - qwen - 500m	863.40	236.43	57.50	93.00	67.06	57.50	78.33	81.43	53.33	150.00	98.00	50.00	50.00	62.93	153.25	47.50

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご