Free and Open-Source MedGemma-4B-IT Medical Multimodal AI Model - Empowering Medical Text and Image Understanding

Medgemma 4b It

Developed by unsloth

MedGemma is a medical-specific multimodal AI model developed by Google, based on the Gemma 3 architecture, focusing on medical text and image understanding.

Image-to-Text

Transformers

Open Source License:Other #Multimodal Medical Analysis #High-Precision Radiology #Clinical Reasoning Optimization

Downloads 223

Release Time : 5/20/2025

Model Overview

The MedGemma series is optimized for the medical field. The 4B version integrates the SigLIP image encoder and supports multimodal medical data analysis including chest X-rays, dermatology, and pathology.

Model Features

Medical Multimodal Understanding

Specifically trained for joint analysis of medical images and text in radiology, dermatology, pathology, etc.

Clinical Reasoning Optimization

Evaluated on over 22 medical datasets, significantly outperforming the base Gemma model.

Long Context Support

Supports input lengths of up to 128K tokens, suitable for processing complex medical reports.

Quantization-Friendly

Compatible with Unsloth Dynamic 2.0 quantization technology, improving inference efficiency while maintaining high accuracy.

Model Capabilities

Medical Image Analysis

Radiology Report Generation

Clinical Q&A System

Multimodal Medical Reasoning

Medical Text Understanding

Use Cases

Radiology

Chest X-ray Report Generation

Automatically analyzes X-ray images and generates structured diagnostic reports.

RadGraph F1 score of 29.5, outperforming similar models.

Clinical Decision Support

Medical Visual Question Answering

Answers clinically relevant questions about medical images.

Achieves an F1 score of 62.3 on the SlakeVQA dataset.

🚀 MedGemma Model

MedGemma is a collection of models trained for medical text and image comprehension. It offers two variants, 4B and 27B, which can help developers accelerate the development of healthcare - based AI applications.

🚀 Quick Start

Prerequisites

First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0.

$ pip install -U transformers

Run model with the `pipeline` API

from transformers import pipeline
from PIL import Image
import requests
import torch

pipe = pipeline(
    "image-text-to-text",
    model="google/medgemma-4b-it",
    torch_dtype=torch.bfloat16,
    device="cuda",
)

# Image attribution: Stillwaterising, CC0, via Wikimedia Commons
image_url = "https://upload.wikimedia.org/wikipedia/commons/c/c8/Chest_Xray_PA_3-8-2010.png"
image = Image.open(requests.get(image_url, headers={"User-Agent": "example"}, stream=True).raw)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are an expert radiologist."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this X-ray"},
            {"type": "image", "image": image}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

Run the model directly

# pip install accelerate
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import requests
import torch

model_id = "google/medgemma-4b-it"

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

# Image attribution: Stillwaterising, CC0, via Wikimedia Commons
image_url = "https://upload.wikimedia.org/wikipedia/commons/c/c8/Chest_Xray_PA_3-8-2010.png"
image = Image.open(requests.get(image_url, headers={"User-Agent": "example"}, stream=True).raw)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are an expert radiologist."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this X-ray"},
            {"type": "image", "image": image}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=200, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

✨ Features

Multimodal Capability: The 4B version supports both text and vision modalities, while the 27B version focuses on text.
High Performance: Outperforms the base Gemma 3 models across various multimodal and text - only health benchmarks.
Long Context Support: Can handle a context length of at least 128K tokens.

📦 Installation

$ pip install -U transformers

💻 Usage Examples

Basic Usage

The above quick - start code snippets show the basic usage of running the model with the pipeline API and running the model directly.

Advanced Usage

For more advanced usage, such as fine - tuning the model, refer to the following Colab notebooks:

[Quick start notebook in Colab](https://colab.research.google.com/github/google - health/medgemma/blob/main/notebooks/quick_start_with_hugging_face.ipynb)
[Fine - tuning notebook in Colab](https://colab.research.google.com/github/google - health/medgemma/blob/main/notebooks/fine_tune_with_hugging_face.ipynb)

📚 Documentation

Model information

MedGemma is a collection of Gemma 3 variants. The 4B version utilizes a SigLIP image encoder pre - trained on medical data. It has both pre - trained and instruction - tuned versions. The 27B version is trained only on medical text and is optimized for inference - time computation.

Model architecture overview

The MedGemma model is based on Gemma 3 and uses the same decoder - only transformer architecture. For more details, refer to the Gemma 3 model card.

Technical specifications

Property	Details
Model Type	Decoder - only Transformer architecture, see the [Gemma 3 technical report](https://storage.googleapis.com/deepmind - media/gemma/Gemma3Report.pdf)
Modalities	4B: Text, vision; 27B: Text only
Attention mechanism	Utilizes grouped - query attention (GQA)
Context length	Supports long context, at least 128K tokens
Key publication	Coming soon
Model created	May 20, 2025
Model version	1.0.0

Inputs and outputs

Input:

Text string, such as a question or prompt
Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
Total input length of 128K tokens

Output:

Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
Total output length of 8192 tokens

Performance and validation

MedGemma was evaluated on various multimodal classification, report generation, visual question answering, and text - based tasks.

Imaging evaluations

Task and metric	MedGemma 4B	Gemma 3 4B
Medical image classification
MIMIC CXR - Average F1 for top 5 conditions	88.9	81.1
CheXpert CXR - Average F1 for top 5 conditions	48.1	31.2
DermMCQA* - Accuracy	71.8	42.6
Visual question answering
SlakeVQA (radiology) - Tokenized F1	62.3	38.6
VQA - Rad** (radiology) - Tokenized F1	49.9	38.6
PathMCQA (histopathology, internal***) - Accuracy	69.8	37.1
Knowledge and reasoning
MedXpertQA (text + multimodal questions) - Accuracy	18.8	16.4

*Based on [ref](https://www.nature.com/articles/s41591 - 020 - 0842 - 3), presented as a 4 - way MCQ per example for skin condition classification. **On balanced split, see ref. ***Based on multiple datasets, presented as 3 - 9 way MCQ per example for identification, grading, and subtype for breast, cervical, and prostate cancer.

Chest X - ray report generation

Metric	MedGemma 4B (pre - trained)	PaliGemma 2 3B (tuned for CXR)	PaliGemma 2 10B (tuned for CXR)
Chest X - ray report generation
MIMIC CXR - RadGraph F1	29.5	28.8	29.5

Text evaluations

Metric	MedGemma 27B	Gemma 3 27B	MedGemma 4B	Gemma 3 4B
MedQA (4 - op)	89.8 (best - of - 5) 87.7 (0 - shot)	74.9	64.4	50.7
MedMCQA	74.2	62.6	55.7	45.4
PubMedQA	76.8	73.4	73.4	68.4
MMLU Med (text only)	87.0	83.3	70.0	67.2
MedXpertQA (text only)	26.7	15.7	14.2	11.6
AfriMed - QA	84.0	72.0	52.0	48.0

Citation

@misc{medgemma - hf,
    author = {Google},
    title = {MedGemma Hugging Face},
    howpublished = {\url{https://huggingface.co/collections/google/medgemma - release - 680aade845f90bec6a3f60c4}},
    year = {2025},
    note = {Accessed: [Insert Date Accessed, e.g., 2025 - 05 - 20]}
}

🔧 Technical Details

The model uses grouped - query attention (GQA) in its attention mechanism.
It supports a long context length of at least 128K tokens.

📄 License

The use of MedGemma is governed by the [Health AI Developer Foundations terms of use](https://developers.google.com/health - ai - developer - foundations/terms).

To access MedGemma on Hugging Face, you're required to review and agree to [Health AI Developer Foundation's terms of use](https://developers.google.com/health - ai - developer - foundations/terms). To do this, please ensure you're logged in to Hugging Face and click below. Requests are processed immediately. [Acknowledge license](https://huggingface.co/collections/google/medgemma - release - 680aade845f90bec6a3f60c4)

Resources

[Model on Google Cloud Model Garden](https://console.cloud.google.com/vertex - ai/publishers/google/model - garden/medgemma)
[Model on Hugging Face](https://huggingface.co/collections/google/medgemma - release - 680aade845f90bec6a3f60c4)
[GitHub repository](https://github.com/google - health/medgemma)
[Quick start notebook](https://github.com/google - health/medgemma/blob/main/notebooks/quick_start_with_hugging_face.ipynb)
[Fine - tuning notebook](https://github.com/google - health/medgemma/blob/main/notebooks/fine_tune_with_hugging_face.ipynb)
Patient Education Demo
[Contact](https://developers.google.com/health - ai - developer - foundations/medgemma/get - started.md#contact)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご