Gemma 3n-E4B-it-4bit-MLX Open-source Multimodal Model - Supports multiple forms of input and is compatible with low-resource devices

Gemma 3n E4B It 4bit MLX

Developed by NexaAI

Gemma 3n is a multimodal lightweight open-source model based on the Google Gemma model, supporting text, image, video, and audio inputs. It is optimized for low-resource devices.

Multimodal Fusion

Transformers

#Multimodal processing #Low-resource optimization #Multilingual support

Downloads 122

Release Time : 7/13/2025

Model Overview

Gemma 3n is a lightweight open-source model launched by Google, using the same technology as Gemini. It supports multimodal inputs and text outputs, suitable for low-resource devices.

Model Features

Multimodal support

Capable of processing text, image, audio, and video inputs and generating text outputs.

Low-resource optimization

Using selective parameter activation technology to reduce resource requirements and suitable for running on low-resource devices.

Efficient parameter management

Runs with an effective scale of 2 billion and 4 billion parameters, lower than the total number of parameters.

Multilingual support

Trained with data in over 140 spoken languages, with strong multilingual processing capabilities.

Model Capabilities

Text generation

Image content analysis

Audio data processing

Video content understanding

Multilingual text processing

Use Cases

Content generation

Document summarization

Input a long document and generate a concise summary.

Efficiently generate accurate and coherent summaries.

Question answering

Input a question and generate a detailed answer.

Performs excellently in multiple benchmark tests.

Multimodal analysis

Image description generation

Input an image and generate a detailed text description.

Supports multiple resolutions and generates high-quality descriptions.

Audio transcription

Input audio data and generate a text transcription.

Encodes 6.25 tokens per second, supporting mono.

🚀 NexaAI/gemma-3n-E4B-it-4bit-MLX

This project provides a model based on Google's Gemma series, enabling multimodal input and text output, suitable for various NLP and multimodal tasks.

🚀 Quick Start

Run them directly with nexa-sdk installed. In nexa-sdk CLI:

NexaAI/gemma-3n-E4B-it-4bit-MLX

✨ Features

Overview

Summary description and brief definition of inputs and outputs.

Description

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.

Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.

Inputs and outputs

Input:
- Text string, such as a question, a prompt, or a document to be summarized
- Images, normalized to 256x256, 512x512, or 768x768 resolution and encoded to 256 tokens each
- Audio data encoded to 6.25 tokens per second from a single channel
- Total input context of 32K tokens
Output:
- Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
- Total output length up to 32K tokens, subtracting the request input tokens

📚 Documentation

Benchmark Results

These models were evaluated at full precision (float32) against a large collection of different datasets and metrics to cover different aspects of content generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models.

Reasoning and factuality

Property	Details
Model Type	`NexaAI/gemma-3n-E4B-it-4bit-MLX`
Training Data	Data in over 140 spoken languages

Benchmark	Metric	n-shot	E2B PT	E4B PT
HellaSwag	Accuracy	10-shot	72.2	78.6
BoolQ	Accuracy	0-shot	76.4	81.6
PIQA	Accuracy	0-shot	78.9	81.0
SocialIQA	Accuracy	0-shot	48.8	50.0
TriviaQA	Accuracy	5-shot	60.8	70.2
Natural Questions	Accuracy	5-shot	15.5	20.9
ARC-c	Accuracy	25-shot	51.7	61.6
ARC-e	Accuracy	0-shot	75.8	81.6
WinoGrande	Accuracy	5-shot	66.8	71.7
BIG-Bench Hard	Accuracy	few-shot	44.3	52.9
DROP	Token F1 score	1-shot	53.9	60.8

Multilingual

Benchmark	Metric	n-shot	E2B IT	E4B IT
MGSM	Accuracy	0-shot	53.1	60.7
WMT24++ (ChrF)	Character-level F-score	0-shot	42.7	50.1
Include	Accuracy	0-shot	38.6	57.2
MMLU (ProX)	Accuracy	0-shot	8.1	19.9
OpenAI MMLU	Accuracy	0-shot	22.3	35.6
Global-MMLU	Accuracy	0-shot	55.1	60.3
ECLeKTic	ECLeKTic score	0-shot	2.5	1.9

STEM and code

Benchmark	Metric	n-shot	E2B IT	E4B IT
GPQA Diamond	RelaxedAccuracy/accuracy	0-shot	24.8	23.7
LiveCodeBench v5	pass@1	0-shot	18.6	25.7
Codegolf v2.2	pass@1	0-shot	11.0	16.8
AIME 2025	Accuracy	0-shot	6.7	11.6

Additional benchmarks

Benchmark	Metric	n-shot	E2B IT	E4B IT
MMLU	Accuracy	0-shot	60.1	64.9
MBPP	pass@1	3-shot	56.6	63.6
HumanEval	pass@1	0-shot	66.5	75.0
LiveCodeBench	pass@1	0-shot	13.2	13.2
HiddenMath	Accuracy	0-shot	27.7	37.7
Global-MMLU-Lite	Accuracy	0-shot	59.0	64.5
MMLU (Pro)	Accuracy	0-shot	40.5	50.6

Reference

Original model card: google/gemma-3n-E4B-it

⚠️ Important Note

To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately.

Click: Acknowledge license

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご