uform-gen: An Open-Source Vision-Language Model for Free Deployment of Image Description Generation and Visual Question Answering

Uform Gen

Developed by unum-cloud

UForm-Gen is a small generative vision-language model primarily used for image caption generation and visual question answering.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Lightweight Multimodal #Image Caption Generation #Visual Question Answering

Downloads 152

Release Time : 12/25/2023

Model Overview

UForm-Gen is a pocket-sized multimodal AI model that combines a visual encoder and a language model for content understanding and generation, excelling particularly in image captioning and visual question answering tasks.

Model Features

Lightweight and Efficient

A compact model with only 1.5B parameters, achieving an inference speed of 140 tokens/sec, which is 3.5 times faster than 7B models

Multimodal Understanding

Combines visual and linguistic capabilities to process both image and text inputs simultaneously

Versatile Generation

Can perform various tasks such as image captioning, content summarization, or visual question answering through prompt control

Model Capabilities

Image caption generation

Visual question answering

Content summarization

Multimodal understanding

Use Cases

Content Understanding

Image Captioning

Generate detailed or concise textual descriptions for images

CLIPScore reaches 0.847 (long text)/0.842 (short text)

Visual Question Answering

Answer natural language questions about image content

66.5 accuracy on VQAv2 dataset

Content Creation

Social Media Content Generation

Automatically generate captions for social media images

🚀 UForm

Pocket-Sized Multimodal AI for Content Understanding and Generation

🚀 Quick Start

To get started with UForm, you first need to install it. You can do this using the following command:

pip install uform

The generative model can be used to caption images, summarize their content, or answer questions about them. The exact behavior is controlled by prompts.

from uform.gen_model import VLMForCausalLM, VLMProcessor

model = VLMForCausalLM.from_pretrained("unum-cloud/uform-gen")
processor = VLMProcessor.from_pretrained("unum-cloud/uform-gen")

# [cap] Narrate the contents of the image with precision.
# [cap] Summarize the visual content of the image.
# [vqa] What is the main subject of the image?
prompt = "[cap] Summarize the visual content of the image."
image = Image.open("zebra.jpg")

inputs = processor(texts=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
     output = model.generate(
        **inputs,
        do_sample=False,
        use_cache=True,
        max_new_tokens=128,
        eos_token_id=32001,
        pad_token_id=processor.tokenizer.pad_token_id
    )

prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]

✨ Features

UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

uform-vl-english visual encoder,
Sheared-LLaMA-1.3B language model tuned on instruction datasets.

The model was pre-trained on: MSCOCO, SBU Captions, Visual Genome, VQAv2, GQA and a few internal datasets.

📚 Documentation

Model Information

Property	Details
Pipeline Tag	image-to-text
Tags	image-captioning, visual-question-answering
Datasets	sbu_captions, visual_genome, HuggingFaceM4/VQAv2, ChristophSchuhmann/MS_COCO_2017_URL_TEXT
Language	en
License	apache-2.0
Base Model	unum-cloud/uform-vl-english

Widget Preview

Image 1: preview-interior.png
- Output: "The living room is cozy, featuring a red leather chair and a white table. The chair is in the center, and the table is on the left side. A lamp on the left side illuminates the space. A large picture hangs on the wall, adding artistic flair. A vase on the table adds a decorative touch. The room is well-lit, creating a warm and inviting atmosphere."
Image 2: preview-girl.png
- Output: "A young girl stands in a grassy field, holding an umbrella to shield herself from the rain. She dons a yellow dress and seems to relish her time outdoors. The umbrella is open, offering protection from the rain. The field is bordered by trees, fostering a tranquil and natural ambiance"

🔧 Technical Details

Evaluation

For captioning evaluation we measure CLIPScore and RefCLIPScore¹.

Model	Size	Caption Length	CLIPScore	RefCLIPScore
`llava-hf/llava-1.5-7b-hf`	7B	Long	0.878	0.529
`llava-hf/llava-1.5-7b-hf`	7B	Short	0.886	0.531

`Salesforce/instructblip-vicuna-7b`	7B	Long	0.902	0.534
`Salesforce/instructblip-vicuna-7b`	7B	Short	0.848	0.523

`unum-cloud/uform-gen`	1.5B	Long	0.847	0.523
`unum-cloud/uform-gen`	1.5B	Short	0.842	0.522

Results for VQAv2 evaluation.

Model	Size	Accuracy
`llava-hf/llava-1.5-7b-hf`	7B	78.5
`unum-cloud/uform-gen`	1.5B	66.5

¹ We used apple/DFN5B-CLIP-ViT-H-14-378 CLIP model.

Speed

On RTX 3090, the following performance is expected on text token generation using float16, equivalent PyTorch settings, and greedy decoding.

Model	Size	Speed	Speedup
`llava-hf/llava-1.5-7b-hf`	7B	~ 40 tokens/second
`Salesforce/instructblip-vicuna-7b`	7B	~ 40 tokens/second
`unum-cloud/uform-gen`	1.5B	~ 140 tokens/second	x 3.5

📄 License

This project is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご