🚀 UForm-Gen2-dpo: A Generative Vision-Language Model
UForm-Gen2-dpo is a compact generative vision-language model tailored for Image Captioning and Visual Question Answering. It is fine - tuned on preference datasets VLFeedback and LLaVA - Human - Preference - 10K using Direct Preference Optimization (DPO).
Key Information
Property |
Details |
Library Name |
transformers |
Tags |
image - captioning, visual - question - answering |
License |
apache - 2.0 |
Datasets |
X2FD/LVIS - Instruct4V, BAAI/SVIT, HuggingFaceH4/ultrachat_200k, MMInstruction/VLFeedback, zhiqings/LLaVA - Human - Preference - 10K |
Pipeline Tag |
image - to - text |
Model Widget Examples
- Detailed caption: interior.jpg
- Output: "The image shows a serene and well - lit bedroom with a white bed, a black bed frame, and a white comforter. There’s a gray armchair with a white cushion, a black dresser with a mirror and a vase, and a white rug on the floor. The room has a large window with white curtains, and there are several decorative items, including a picture frame, a vase with a flower, and a lamp. The room is well - organized and has a calming atmosphere."
- Short caption: cat.jpg
- Output: "A white and orange cat stands on its hind legs, reaching towards a wooden table with a white teapot and a basket of red raspberries. The table is on a small wooden bench, surrounded by orange flowers. The cat’s position and action create a serene, playful scene in a garden."

🚀 Quick Start
Model Composition
The UForm - Gen2 - dpo model consists of two main parts:
- CLIP - like ViT - H/14
- [Qwen1.5 - 0.5B - Chat](https://huggingface.co/Qwen/Qwen1.5 - 0.5B - Chat)
Training Information
The model was trained in less than one day on a DGX - H100 with 8x H100 GPUs. Thanks to Nebius.ai for providing the compute resources 🤗
✨ Features
The generative model can be used for multiple purposes:
- Generate captions for images.
- Answer questions about images.
- Engage in multimodal chat.
💻 Usage Examples
Basic Usage
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained("unum - cloud/uform - gen2 - dpo", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("unum - cloud/uform - gen2 - dpo", trust_remote_code=True)
prompt = "Question or Instruction"
image = Image.open("image.jpg")
inputs = processor(text=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
output = model.generate(
**inputs,
do_sample=False,
use_cache=True,
max_new_tokens=256,
eos_token_id=151645,
pad_token_id=processor.tokenizer.pad_token_id
)
prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]
You can check examples of different prompts in our demo space.
📚 Documentation
Evaluation Results
The model is evaluated on the MME Benchmark across multiple categories:
Model |
perception |
reasoning |
OCR |
artwork |
celebrity |
code_reasoning |
color |
commonsense_reasoning |
count |
existence |
landmark |
numerical_calculation |
position |
posters |
scene |
text_translation |
uform - gen2 - dpo |
1,048.75 |
224.64 |
72.50 |
97.25 |
62.65 |
67.50 |
123.33 |
57.14 |
136.67 |
195.00 |
104.00 |
50.00 |
51.67 |
59.18 |
146.50 |
50.00 |
uform - gen2 - qwen - 500m |
863.40 |
236.43 |
57.50 |
93.00 |
67.06 |
57.50 |
78.33 |
81.43 |
53.33 |
150.00 |
98.00 |
50.00 |
50.00 |
62.93 |
153.25 |
47.50 |
📄 License
This project is licensed under the apache - 2.0 license.