PaliGemma-3B-Chat-v0.2 Open-source Multimodal Dialogue Model - Free Deployment and Adaptable to Multi-turn Dialogue Scenarios

Paligemma 3B Chat V0.2

Developed by BUAADreamer

A multimodal dialogue model fine-tuned based on google/paligemma-3b-mix-448, optimized for multi-turn conversation scenarios

Text-to-Image

Transformers

Supports Multiple Languages#Multimodal Dialogue #Bilingual (Chinese-English)#Visual Question Answering

Downloads 80

Release Time : 6/4/2024

Model Overview

This model is a vision-language model capable of understanding and generating natural language descriptions about image content, supporting multi-turn conversations in both English and Chinese.

Model Features

Multimodal Understanding

Capable of processing both image and text inputs, understanding image content, and generating relevant descriptions

Multi-turn Dialogue Optimization

Designed for conversational scenarios, supporting coherent multi-turn interactions

Bilingual Support

Supports both English and Chinese input and output

Efficient Fine-tuning

Only adjusts the language model and projection layer parameters while keeping the visual encoder frozen

Model Capabilities

Image content understanding

Multi-turn dialogue

Bilingual text generation

Visual question answering

Use Cases

Intelligent Customer Service

Product Image Consultation

Users upload product images, and the model answers related questions

Provides accurate product descriptions and relevant information

Educational Assistance

Image Learning Assistant

Helps students understand image content in educational materials

Provides detailed image explanations and related knowledge points

Content Moderation

Image Content Analysis

Automatically identifies and describes the content of uploaded images

Assists manual review, improving efficiency

🚀 PaliGemma-3B-Chat-v0.2

This model is fine-tuned from google/paligemma-3b-mix-448 for multiturn chat completions. It offers an effective solution for image - text - to - text tasks, enabling seamless interaction between images and text in chat scenarios.

Try our live demo at: https://huggingface.co/spaces/llamafactory/PaliGemma-3B-Chat-v0.2

example_en example_zh

🚀 Quick Start

The following sections will guide you through the usage, training, and evaluation of the PaliGemma-3B-Chat-v0.2 model.

✨ Features

Multiturn Chat: Fine - tuned for multiturn chat completions, enhancing the interaction experience.
Image - Text Integration: Capable of handling image - text - to - text tasks, leveraging both visual and textual information.

💻 Usage Examples

Basic Usage

import requests
import torch
from PIL import Image
from transformers import AutoModelForVision2Seq, AutoProcessor, AutoTokenizer, TextStreamer

model_id = "BUAADreamer/PaliGemma-3B-Chat-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = processor(images=[image], return_tensors="pt").to(model.device)["pixel_values"]

messages = [
    {"role": "user", "content": "What is in this image?"}
]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
image_token_id = tokenizer.convert_tokens_to_ids("<image>")
image_prefix = torch.empty((1, getattr(processor, "image_seq_length")), dtype=input_ids.dtype).fill_(image_token_id)
input_ids = torch.cat((image_prefix, input_ids), dim=-1).to(model.device)

generate_ids = model.generate(input_ids, pixel_values=pixel_values, streamer=streamer, max_new_tokens=50)

🔧 Technical Details

Training procedure

We used LLaMA Factory to fine - tune this model. During fine - tuning, we freezed the vision tower and adjusted the parameters in the language model and projector layer.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.000003
num_train_epochs: 2.0
train_batch_size: 4
gradient_accumulation_steps: 16
total_train_batch_size: 64
seed: 42
lr_scheduler_type: cosine
mixed_precision_training: bf16

Show Llama Factory Config [CLICK TO EXPAND]

### model
model_name_or_path: google/paligemma-3b-mix-448
visual_inputs: true

### method
stage: sft
do_train: true
finetuning_type: full

### ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: identity,llava_150k_en,llava_150k_zh
template: gemma
cutoff_len: 1536
overwrite_cache: true
preprocessing_num_workers: 16
tokenized_path: cache/paligemma-identity-llava-zh-en-300k

### output
output_dir: models/paligemma-3b-chat-v0.2
logging_steps: 10
save_steps: 1000
plot_loss: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 0.000003
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_steps: 50
bf16: true
do_eval: false

Framework versions

Pytorch 2.3.0
Transformers 4.41.0

📚 Documentation

Evaluation Results

Model	MMMU_Val	CMMMU_Val
Yi - VL - 6B	36.8	32.2
Paligemma - 3B - Chat - v0.2	33.0	29.0

📄 License

The license of this project is gemma.

Property	Details
Model Type	PaliGemma-3B-Chat-v0.2
Training Data	BUAADreamer/llava-en-zh-300k
Library Name	transformers
Pipeline Tag	image-text-to-text
Base Model	google/paligemma-3b-mix-448
Inference	false
Tags	paligemma, llama-factory, mllm, vlm
Language	en, zh

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご