PaliGemma-3B-Chat-v0.2開源多模態對話模型 - 免費部署適配多輪對話場景

首頁

Paligemma 3B Chat V0.2

由BUAADreamer開發

基於google/paligemma-3b-mix-448微調的多模態對話模型，專為多輪對話場景優化

文本生成圖像

Transformers

支持多種語言#多模態對話 #中英雙語 #視覺問答

下載量 80

發布時間 : 6/4/2024

模型概述

該模型是一個視覺語言模型，能夠理解和生成關於圖像內容的自然語言描述，支持中英文多輪對話。

模型特點

多模態理解

能夠同時處理圖像和文本輸入，理解圖像內容並生成相關描述

多輪對話優化

專為對話場景設計，支持連貫的多輪交互

雙語支持

同時支持英文和中文的輸入輸出

高效微調

僅調整語言模型和投影層參數，保持視覺編碼器凍結

模型能力

圖像內容理解

多輪對話

雙語文本生成

視覺問答

使用案例

智能客服

產品圖像諮詢

用戶上傳產品圖片，模型回答相關問題

提供準確的產品描述和相關信息

教育輔助

圖像學習助手

幫助學生理解教材中的圖像內容

提供詳細的圖像解釋和相關知識點

內容審核

圖像內容分析

自動識別和描述上傳圖像的內容

輔助人工審核，提高效率

🚀 PaliGemma-3B-Chat-v0.2

本模型是基於 google/paligemma-3b-mix-448 微調而來，用於多輪聊天完成任務。

您可以在以下鏈接體驗我們的即時演示：https://huggingface.co/spaces/llamafactory/PaliGemma-3B-Chat-v0.2

example_en example_zh

🚀 快速開始

✨ 主要特性

基於預訓練模型 google/paligemma-3b-mix-448 進行微調，適用於多輪聊天完成任務。
支持圖像文本到文本的轉換。
提供即時演示，方便用戶體驗。

📦 安裝指南

此部分原文檔未提供具體安裝命令，跳過。

💻 使用示例

基礎用法

import requests
import torch
from PIL import Image
from transformers import AutoModelForVision2Seq, AutoProcessor, AutoTokenizer, TextStreamer

model_id = "BUAADreamer/PaliGemma-3B-Chat-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = processor(images=[image], return_tensors="pt").to(model.device)["pixel_values"]

messages = [
    {"role": "user", "content": "What is in this image?"}
]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
image_token_id = tokenizer.convert_tokens_to_ids("<image>")
image_prefix = torch.empty((1, getattr(processor, "image_seq_length")), dtype=input_ids.dtype).fill_(image_token_id)
input_ids = torch.cat((image_prefix, input_ids), dim=-1).to(model.device)

generate_ids = model.generate(input_ids, pixel_values=pixel_values, streamer=streamer, max_new_tokens=50)

📚 詳細文檔

訓練過程

我們使用 LLaMA Factory 對該模型進行微調。在微調過程中，我們凍結了視覺塔，並調整了語言模型和投影層的參數。

訓練超參數

以下是訓練過程中使用的超參數：

學習率：0.000003
訓練輪數：2.0
訓練批次大小：4
梯度累積步數：16
總訓練批次大小：64
隨機種子：42
學習率調度器類型：餘弦
混合精度訓練：bf16

顯示 Llama Factory 配置 [點擊展開]

### model
model_name_or_path: google/paligemma-3b-mix-448
visual_inputs: true

### method
stage: sft
do_train: true
finetuning_type: full

### ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: identity,llava_150k_en,llava_150k_zh
template: gemma
cutoff_len: 1536
overwrite_cache: true
preprocessing_num_workers: 16
tokenized_path: cache/paligemma-identity-llava-zh-en-300k

### output
output_dir: models/paligemma-3b-chat-v0.2
logging_steps: 10
save_steps: 1000
plot_loss: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 0.000003
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_steps: 50
bf16: true
do_eval: false

框架版本

Pytorch 2.3.0
Transformers 4.41.0

評估結果

模型	MMMU_Val	CMMMU_Val
Yi-VL-6B	36.8	32.2
Paligemma-3B-Chat-v0.2	33.0	29.0

🔧 技術細節

此部分原文檔技術說明較少，跳過。

📄 許可證

許可證：gemma

信息表格

屬性	詳情
模型類型	基於 `google/paligemma-3b-mix-448` 微調的多輪聊天完成模型
訓練數據	BUAADreamer/llava-en-zh-300k
支持語言	英文、中文
庫名稱	transformers
任務類型	圖像文本到文本
基礎模型	google/paligemma-3b-mix-448
推理功能	否
標籤	paligemma、llama-factory、mllm、vlm