Paligemma-3b-ft-cococap-224開源視覺語言模型 - 多語言支持適用多視覺語言任務

首頁

Paligemma 3b Ft Cococap 224

由google開發

PaliGemma是一款多功能輕量級視覺語言模型（VLM），支持多語言輸入輸出，適用於多種視覺語言任務。

圖像生成文本

Transformers

#多模態視覺語言 #輕量級VLM #多語言字幕生成

下載量 209

發布時間 : 5/13/2024

模型概述

PaliGemma基於開放組件構建，結合了SigLIP視覺模型和Gemma語言模型，能夠處理圖像和短視頻字幕、視覺問答、文本閱讀、目標檢測和分割等任務。

模型特點

多功能性

能夠處理多種視覺語言任務，如問答、字幕生成、分割等。

多語言支持

支持多種語言的輸入和輸出。

輕量級設計

模型參數相對較少，便於在不同設備上進行研究和應用。

模型能力

圖像字幕生成

視覺問答

文本閱讀

目標檢測

目標分割

使用案例

多媒體處理

圖像字幕生成

為圖像或短視頻生成多語言字幕。

生成準確描述圖像內容的字幕

視覺問答

回答關於圖像內容的自然語言問題。

提供準確的問題答案

計算機視覺

目標檢測

檢測圖像中的目標並輸出邊界框座標。

精確識別和定位圖像中的目標

目標分割

對圖像中的目標進行像素級分割。

生成精確的目標分割掩碼

🚀 PaliGemma模型卡片

PaliGemma是一款多功能輕量級視覺語言模型（VLM），它以圖像和文本作為輸入，並生成文本輸出，支持多語言。該模型適用於圖像和短視頻字幕、視覺問答、文本閱讀、目標檢測和目標分割等多種視覺語言任務。

🚀 快速開始

若要在Hugging Face上使用PaliGemma模型，您需要查看並同意Google的使用許可。請確保您已登錄Hugging Face，然後點擊下方按鈕，請求將立即得到處理。 [確認許可](javascript:void(0))

✨ 主要特性

多功能性：能夠處理多種視覺語言任務，如問答、字幕生成、分割等。
多語言支持：支持多種語言的輸入和輸出。
輕量級設計：模型參數相對較少，便於在不同設備上進行研究和應用。

📦 安裝指南

若要使用4位或8位精度自動運行推理，您需要安裝bitsandbytes：

pip install bitsandbytes accelerate

💻 使用示例

基礎用法

在CPU上以默認精度（float32）運行：

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# 指示模型用西班牙語創建字幕
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

高級用法

在CUDA上以其他精度運行

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
    device_map=device,
    revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# 指示模型用西班牙語創建字幕
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

以4位/8位加載

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
from transformers import BitsAndBytesConfig

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, quantization_config=quantization_config
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# 指示模型用西班牙語創建字幕
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

📚 詳細文檔

模型信息

模型概述

PaliGemma受PaLI - 3啟發，基於開放組件（如SigLIP視覺模型和Gemma語言模型）構建。它由一個Transformer解碼器和一個視覺Transformer圖像編碼器組成，總共有30億個參數。

輸入：圖像和文本字符串，如為圖像添加字幕的提示或問題。
輸出：針對輸入生成的文本，如圖像字幕、問題答案、目標邊界框座標列表或分割碼字。

模型數據

預訓練數據集：PaliGemma在多個數據集的混合上進行預訓練，包括WebLI、CC3M - 35L、VQ²A - CC3M - 35L/VQG - CC3M - 35L、OpenImages和WIT。
數據責任過濾：為了在乾淨的數據上訓練模型，對WebLI應用了多種過濾方法，包括色情圖像過濾、文本安全過濾、文本毒性過濾、文本個人信息過濾等。