GLM-4-32B-0414 Quantized Model - Open-source and Compatible with Consumer-grade Hardware, Free to Deploy and Easy to Use

GLM 4 32B 0414.w4a16 Gptq

Developed by mratsim

This is a model that uses the GPTQ method to perform 4-bit quantization on GLM-4-32B-0414, suitable for consumer-grade hardware.

Large Language Model

Safetensors

Open Source License:MIT #4-bit quantization inference #Consumer-grade hardware adaptation #Long text generation

Downloads 785

Release Time : 5/4/2025

Model Overview

This model quantizes GLM-4-32B-0414 to 4 bits (only the weights are 4 bits, W4A16) through the asymmetric GPTQ quantization technology, enabling it to run on consumer-grade hardware.

Model Features

4-bit quantization

Quantize the model to 4 bits using asymmetric GPTQ, significantly reducing video memory usage.

Consumer-grade hardware adaptation

The quantized model can run on a GPU with 32GB of video memory.

High-quality calibration

Calibrate using 2048 samples with a maximum sequence length of 4096 to minimize the risk of overfitting.

Model Capabilities

Text generation

Long sequence processing

Use Cases

Text generation

Long text generation

Supports long text generation with a maximum of 130,000 tokens.

🚀 GLM-4-32B-0414 Quantized with GPTQ (4-Bit weight-only, W4A16)

This repository contains the GLM-4-32B-0414 model quantized to 4-bit using asymmetric GPTQ, making it suitable for consumer hardware.

The model was calibrated with 2048 samples of maximum sequence length 4096 from the dataset mit-han-lab/pile-val-backup.

This is my very first quantized model, and I welcome suggestions. The values of 2048/4096 were chosen instead of the default 512/2048 to minimize the risk of overfitting and maximize convergence. They also happen to fit in my GPU.

Original Model:

THUDM/GLM-4-32B-0414

🚀 Quick Start

💻 Usage Examples

Basic Usage

The model was tested with vLLM, and here is a script suitable for 32GB VRAM GPUs.

export MODEL="mratsim/GLM-4-32B-0414.w4a16-gptq"
vllm serve "${MODEL}" \
  --served-model-name glm-4-32b \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-model-len 130000 \
  --max_num_seqs 256 \
  --generation-config "${MODEL}" \
  --enable-auto-tool-choice --tool-call-parser pythonic \
  --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'

🔧 Technical Details

Quantization method

The llmcompressor library was used with the following recipe for asymmetric GPTQ:

default_stage:
  default_modifiers:
    GPTQModifier:
      dampening_frac: 0.005
      config_groups:
        group_0:
          targets: [Linear]
          weights: {num_bits: 4, type: int, symmetric: false, group_size: 128, strategy: group,
            dynamic: false, observer: minmax}
      ignore: [lm_head]

The model was calibrated on 2048 samples with a sequence length of 4096 from the dataset mit-han-lab/pile-val-backup.

📄 License

This project is licensed under the MIT license.

Property	Details
Model Type	GLM-4-32B-0414 quantized with GPTQ (4-Bit weight-only, W4A16)
Training Data	`mit-han-lab/pile-val-backup`
Pipeline Tag	text2text-generation
Tags	gptq, vllm, llmcompressor, text-generation-inference

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご