amoral-gemma3-12B-vision開源模型 - 支持多模態任務的視覺增強大語言工具

首頁

Amoral Gemma3 12B Vision

由gghfez開發

基於soob3123/amoral-gemma3-12B的視覺增強版本，結合了Gemma3-12B大語言模型與視覺編碼器，支持多模態任務

圖像生成文本

Transformers

英語#多模態視覺理解 #高精度圖像描述 #自然語言生成

下載量 25

發布時間 : 3/21/2025

模型概述

這是一個多模態模型，能夠處理圖像和文本輸入，生成詳細的圖像描述或回答相關問題。相比基礎Gemma3-12B模型，在視覺理解方面表現更優

模型特點

多模態能力

同時處理圖像和文本輸入，實現跨模態理解

詳細圖像描述

相比基礎Gemma3-12B模型，能生成更豐富、更準確的圖像描述

高效推理

支持設備自動映射(device_map)和bfloat16精度，優化推理效率

模型能力

圖像理解

圖像描述生成

視覺問答

多模態對話

使用案例

內容分析

圖像描述生成

為上傳的圖片生成詳細文字描述

輸出包含物體、場景、顏色、光線等要素的豐富描述

輔助工具

視覺輔助

幫助視障人士理解圖像內容

提供準確、詳細的場景描述

🚀 gghfez/amoral-gemma3-12B-vision

本項目是在soob3123/amoral-gemma3-12B的基礎上重新連接了視覺編碼器，可用於圖像相關的推理任務。

🚀 快速開始

本項目基於transformers庫，使用soob3123/amoral-gemma3-12B作為基礎模型，許可證為gemma。以下是相關信息表格：

屬性	詳情
基礎模型	soob3123/amoral-gemma3-12B
語言	en
庫名稱	transformers
許可證	gemma
標籤	transformers、gemma3、gemma、google

💻 使用示例

基礎用法

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "gghfez/amoral-gemma3-12B-vision"
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=500, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)