Amoral-gemma3-12B-vision Open-source Model - A Vision-enhanced Large Language Tool for Multimodal Tasks

Amoral Gemma3 12B Vision

Developed by gghfez

Vision-enhanced version based on soob3123/amoral-gemma3-12B, combining Gemma3-12B large language model with visual encoder for multimodal tasks

Image-to-Text

Transformers

English#Multimodal visual understanding #High-precision image captioning #Natural language generation

Downloads 25

Release Time : 3/21/2025

Model Overview

This is a multimodal model capable of processing both image and text inputs to generate detailed image descriptions or answer related questions. It outperforms the base Gemma3-12B model in visual understanding

Model Features

Multimodal capability

Processes both image and text inputs simultaneously for cross-modal understanding

Detailed image captioning

Generates richer and more accurate image descriptions compared to the base Gemma3-12B model

Efficient inference

Supports automatic device mapping (device_map) and bfloat16 precision for optimized inference efficiency

Model Capabilities

Image understanding

Image caption generation

Visual question answering

Multimodal conversation

Use Cases

Content analysis

Image caption generation

Generates detailed textual descriptions for uploaded images

Outputs rich descriptions including objects, scenes, colors, lighting and other elements

Assistive tools

Visual assistance

Helps visually impaired individuals understand image content

Provides accurate, detailed scene descriptions

🚀 gghfez/amoral-gemma3-12B-vision

This project reattaches the vision encoder to soob3123/amoral-gemma3-12B, enabling it to handle visual information.

Model Information

Property	Details
Base Model	soob3123/amoral-gemma3-12B
Language	en
Library Name	transformers
License	gemma
Tags	transformers, gemma3, gemma, google

🚀 Quick Start

💻 Usage Examples

Basic Usage

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "gghfez/amoral-gemma3-12B-vision"
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=500, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Output

Here's a detailed description of the image:

Overall Impression: The image is a close-up shot of a vibrant garden scene, focusing on pink cosmos flowers and a busy bumblebee. The composition is well-balanced, with the flowers and bee as the main subjects against a backdrop of greenery and other plants.

Flowers: The primary focus is on the pink cosmos flowers. They have delicate, slightly ruffled petals in shades of pink, with a bright yellow center. Some of the flowers are in full bloom, while others appear to be past their prime, with dried or wilted petals. The flowers are clustered together, creating a sense of abundance and natural beauty.

Bumblebee: A bumblebee is prominently featured on one of the cosmos flowers. It's positioned in the center of the frame, actively collecting nectar or pollen. The bee has a fuzzy, black and yellow body, and its wings are slightly blurred due to its movement.

Background: The background consists of a mix of green foliage, including large leaves and smaller plants. There are also some dried or faded flowers in the background, adding texture and depth to the image. A few red flowers are visible in the lower right corner, providing a pop of color.

Lighting and Color: The image is well-lit, with natural light illuminating the scene. The colors are vibrant and saturated, particularly the pink of the cosmos flowers and the yellow of the bumblebee. The overall effect is one of warmth and vitality.

I tested it with other images as well, I like the results! A lot more detailed than google/gemma-3-12b-it

📄 License

This project is under the gemma license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご