Gemma 3 Open-Source Lightweight Large Language Model - 4B Parameters W4A16 Quantized Version, Lowering the Threshold for Hardware Deployment

Gemma 3 4b It Quantized W4A16

Developed by abhishekchohan

Gemma 3 is a lightweight open-source large language model developed by Google. This repository provides its 4B parameter version with W4A16 quantization, significantly reducing hardware requirements.

Large Language Model

Transformers

#W4A16 Quantization #Instruction Tuning #Consumer-grade Deployment

Downloads 592

Release Time : 3/17/2025

Model Overview

A 4-bit weight quantized version based on the Gemma 3 instruction-tuned model, suitable for consumer-grade hardware deployment, maintaining good performance while reducing memory usage.

Model Features

Efficient Quantization

Utilizes W4A16 quantization technology, quantizing weights to 4-bit precision while keeping activations at 16-bit precision, significantly reducing memory requirements.

Instruction Tuning

Optimized through instruction tuning, enabling better understanding and execution of natural language instructions.

Consumer-grade Hardware Adaptation

The quantized model is more suitable for running on consumer-grade GPUs and CPUs, lowering the deployment barrier.

Model Capabilities

Natural Language Understanding

Text Generation

Instruction Execution

Conversational Interaction

Use Cases

Intelligent Assistant

Chatbot

Build responsive and highly understanding dialogue systems

Smooth and natural conversational experience

Content Generation

Text Creation

Assist with writing, content summarization, and other tasks

High-quality text output

🚀 Gemma 3 Quantized Models

This repository offers W4A16 quantized versions of Google's Gemma 3 instruction - tuned models. These models are more deployable on consumer hardware without sacrificing much performance.

🚀 Quick Start

This repository contains quantized versions of Google's Gemma 3 instruction - tuned models, which are more suitable for deployment on consumer hardware while maintaining good performance.

✨ Features

Quantized Models: The repository provides W4A16 quantized versions of Gemma 3 models, including abhishekchohan/gemma - 3 - 27b - it - quantized - W4A16, abhishekchohan/gemma - 3 - 12b - it - quantized - W4A16, and abhishekchohan/gemma - 3 - 4b - it - quantized - W4A16.
Reduced Memory Requirements: Through W4A16 quantization, the models have significantly reduced memory requirements.

📦 Installation

To use these models, you need to access Gemma on Hugging Face.

⚠️ Important Note

To access Gemma on Hugging Face, you're required to review and agree to Google's usage license. To do this, please ensure you're logged in to Hugging Face and click the "Acknowledge license" button. Requests are processed immediately.

💻 Usage Examples

Basic Usage

vllm serve abhishekchohan/gemma-3-{size}-it-quantized-W4A16 --chat-template templates/chat_template.jinja --enable-auto-tool-choice --tool-call-parser gemma --tool-parser-plugin tools/tool_parser.py

📚 Documentation

Repository Structure

gemma-3-{size}-it-quantized-W4A16/
├── README.md
├── templates/
│   └── chat_template.jinja
├── tools/
│   └── tool_parser.py
└── [model files]

Quantization Details

These models use W4A16 quantization via LLM Compressor:

Weights quantized to 4 - bit precision
Activations use 16 - bit precision
Significantly reduced memory requirements

📄 License

These models are subject to the Gemma license. Users must acknowledge and accept the license terms before using the models.

📚 Citation

@article{gemma_2025,
    title={Gemma 3},
    url={https://goo.gle/Gemma3Report},
    publisher={Kaggle},
    author={Gemma Team},
    year={2025}
}

Property	Details
Model Type	W4A16 quantized versions of Google's Gemma 3 instruction - tuned models
Base Model	google/gemma - 3 - 4b - it
License	Gemma
Pipeline Tag	image - text - to - text
Library Name	transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご