Gemma 3n - E2B开源模型 - 免费部署的轻量级多模态输入输出利器

首页

Gemma 3n E2B

由 google 开发

Gemma 3n是谷歌推出的轻量级、最先进的开源模型家族，支持多模态输入和输出。

图像生成文本

Transformers

#多模态处理 #轻量级开源 #高效参数架构

下载量 206

发布时间 : 6/12/2025

模型简介

Gemma 3n是基于与Gemini模型相同的研究和技术构建的轻量级开源模型，支持文本、音频和视觉（图像和视频）输入，适用于多种任务和数据格式。

模型特点

多模态支持

能够处理文本、图像、视频和音频输入，并生成文本输出。

架构创新

使用MatFormer架构，允许在E4B模型中嵌套子模型。

资源高效

通过将低利用率矩阵从加速器中卸载，该模型的内存占用与传统的2B模型相当。

模型能力

文本生成

图像分析

视频分析

音频分析

多模态推理

使用案例

内容创作

图像描述生成

根据输入的图像生成详细的文本描述。

生成准确且详细的图像描述。

研究和教育

多模态学习

利用多模态输入进行教育和研究任务。

提升学习和研究的效率。

🚀 Gemma 3n模型介绍

Gemma 3n是谷歌推出的轻量级、最先进的开源模型家族，基于与Gemini模型相同的研究和技术构建。该模型支持文本、音频和视觉（图像和视频）输入，适用于多种任务和数据格式。

🚀 快速开始

本仓库对应Gemma 3n E2B的发布版本，可与Hugging Face的transformers库配合使用，支持文本、音频和视觉（图像和视频）输入。

✨ 主要特性

多模态支持：能够处理文本、图像、视频和音频输入，并生成文本输出。
架构创新：有基于有效参数的两种尺寸可供选择；使用MatFormer架构，允许在E4B模型中嵌套子模型。
资源高效：通过将低利用率矩阵从加速器中卸载，该模型的内存占用与传统的2B模型相当。

📦 安装指南

首先，安装transformers库。Gemma 3n从transformers 4.53.0版本开始支持。

$ pip install -U transformers

💻 使用示例

基础用法

使用pipeline API进行推理：

from transformers import pipeline
import torch

pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3n-e2b",
    device="cuda",
    torch_dtype=torch.bfloat16,
)
output = pipe(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
    text="<image_soft_token> in this image, there is"
)

print(output)
# [{'input_text': '<image_soft_token> in this image, there is',
# 'generated_text': '<image_soft_token> in this image, there is a beautiful flower and a bee is sucking nectar and pollen from the flower.'}]

高级用法

在单个GPU上运行模型：

from transformers import AutoProcessor, Gemma3nForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/gemma-3n-e2b"

model = Gemma3nForConditionalGeneration.from_pretrained(model_id, device="cuda", torch_dtype=torch.bfloat16,).eval()

processor = AutoProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "<image_soft_token> in this image, there is"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)

input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=10)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
# one picture of flowers which shows that the flower is

📚 详细文档

模型信息

属性	详情
模型类型	Gemma 3n是谷歌推出的轻量级、最先进的开源模型家族，支持多模态输入和输出。
训练数据	该模型在包含约11万亿个标记的数据集上进行训练，知识截止日期为2024年6月。训练数据包括网页文档、代码、数学、图像和音频等多种来源。

模型数据

训练数据集：这些模型在包含多种来源的数据集上进行训练，总计约11万亿个标记。训练数据的知识截止日期为2024年6月，包括网页文档、代码、数学、图像和音频等。
数据预处理：在训练数据上应用了严格的CSAM过滤、敏感数据过滤和其他基于内容质量和安全的过滤方法。

实现信息

硬件：Gemma使用张量处理单元（TPU）硬件（TPUv4p、TPUv5p和TPUv5e）进行训练。
软件：使用JAX和ML Pathways进行训练。

评估

这些模型在全精度（float32）下针对大量不同的数据集和指标进行了评估，涵盖了内容生成的不同方面。评估结果分为预训练模型（PT）和指令调优模型（IT）。

推理和事实性

基准测试	指标	n-shot	E2B PT	E4B PT
HellaSwag	准确率	10-shot	72.2	78.6
BoolQ	准确率	0-shot	76.4	81.6
PIQA	准确率	0-shot	78.9	81.0
SocialIQA	准确率	0-shot	48.8	50.0
TriviaQA	准确率	5-shot	60.8	70.2
Natural Questions	准确率	5-shot	15.5	20.9
ARC-c	准确率	25-shot	51.7	61.6
ARC-e	准确率	0-shot	75.8	81.6
WinoGrande	准确率	5-shot	66.8	71.7
BIG-Bench Hard	准确率	few-shot	44.3	52.9
DROP	标记F1分数	1-shot	53.9	60.8

多语言

基准测试	指标	n-shot	E2B IT	E4B IT
MGSM	准确率	0-shot	53.1	60.7
WMT24++ (ChrF)	字符级F分数	0-shot	42.7	50.1
Include	准确率	0-shot	38.6	57.2
MMLU (ProX)	准确率	0-shot	8.1	19.9
OpenAI MMLU	准确率	0-shot	22.3	35.6
Global-MMLU	准确率	0-shot	55.1	60.3
ECLeKTic	ECLeKTic分数	0-shot	2.5	1.9

STEM和代码

基准测试	指标	n-shot	E2B IT	E4B IT
GPQA Diamond	宽松准确率/准确率	0-shot	24.8	23.7
LiveCodeBench v5	pass@1	0-shot	18.6	25.7
Codegolf v2.2	pass@1	0-shot	11.0	16.8
AIME 2025	准确率	0-shot	6.7	11.6

其他基准测试

基准测试	指标	n-shot	E2B IT	E4B IT
MMLU	准确率	0-shot	60.1	64.9
MBPP	pass@1	3-shot	56.6	63.6
HumanEval	pass@1	0-shot	66.5	75.0
LiveCodeBench	pass@1	0-shot	13.2	13.2
HiddenMath	准确率	0-shot	27.7	37.7
Global-MMLU-Lite	准确率	0-shot	59.0	64.5
MMLU (Pro)	准确率	0-shot	40.5	50.6

伦理和安全

评估方法：包括结构化评估和内部红队测试，评估内容涵盖儿童安全、内容安全和代表性危害等方面。
评估结果：在所有安全测试领域，模型在儿童安全、内容安全和代表性危害等类别中表现出安全水平，相对于之前的Gemma模型有显著改进。

使用和限制

预期用途：该模型可用于内容创作和通信、研究和教育等多个领域。
限制：模型的性能受训练数据的质量和多样性、上下文和任务复杂度等因素的影响。

🔧 技术细节

了解更多关于这些技术的信息，请参阅技术博客文章和Gemma文档。

📄 许可证

许可证为Gemma。

引用

@article{gemma_3n_2025,
    title={Gemma 3n},
    url={https://ai.google.dev/gemma/docs/gemma-3n},
    publisher={Google DeepMind},
    author={Gemma Team},
    year={2025}
}