Remnant-GLM4-32B Open-Source Large Language Model - Supports Role-Playing Dialogues, Suitable for Axolotl Applications

Allura Org Remnant Glm4 32b GGUF

Developed by bartowski

Remnant-GLM4-32B is a 32B-parameter large language model based on the GLM4 architecture, supporting role-playing and conversational interactions, particularly suitable for salamander-related applications.

Large Language Model Open Source License:Apache-2.0 #High-precision Quantization #Role-playing Dialogue #Multi-scenario Adaptation

Downloads 2,198

Release Time : 5/4/2025

Model Overview

This model is a quantized version of a large language model, supporting text generation tasks, with special optimizations for role-playing and conversational interaction capabilities.

Model Features

Efficient Quantization

Uses llama.cpp for iMatrix quantization, offering multiple quantization options suitable for different hardware environments.

Role-playing Optimization

Specially optimized for role-playing and conversational interactions, ideal for gaming and virtual character applications.

Multiple Quantization Options

Provides various quantization versions from Q8_0 to IQ2_XXS to meet different performance and storage needs.

Model Capabilities

Text Generation

Role-playing

Conversational Interaction

Use Cases

Gaming

Virtual Character Dialogue

Generates dialogue content for virtual characters in games, enhancing player immersion.

Entertainment

Role-playing Chat

Used to generate role-playing chat content, suitable for social platforms or chat applications.

🚀 Llamacpp imatrix Quantizations of remnant - glm4 - 32b by allura - org

This project provides quantized versions of the allura - org/remnant - glm4 - 32b model using llama.cpp. It offers various quantization types to balance between model quality and file size, suitable for different hardware and usage scenarios.

🚀 Quick Start

Running the Model

You can run the quantized models in LM Studio.
Or run them directly with llama.cpp, or any other llama.cpp - based project.

Prompt Format

[gMASK]<sop><|system|>
{system_prompt}<|user|>
{prompt}<|assistant|>

✨ Features

Multiple Quantization Types: Offers a wide range of quantization types (e.g., bf16, Q8_0, Q6_K_L, etc.) to meet different requirements for quality and resource usage.
Online Repacking: Some quantization types support online repacking of weights, which can improve performance on ARM and AVX machines.

📦 Installation

Prerequisites

First, make sure you have huggingface - cli installed:

pip install -U "huggingface_hub[cli]"

Downloading a Specific File

You can target the specific file you want:

huggingface - cli download bartowski/allura - org_remnant - glm4 - 32b - GGUF --include "allura - org_remnant - glm4 - 32b - Q4_K_M.gguf" --local - dir ./

Downloading Split Files

If the model is bigger than 50GB, it will have been split into multiple files. To download them all to a local folder, run:

huggingface - cli download bartowski/allura - org_remnant - glm4 - 32b - GGUF --include "allura - org_remnant - glm4 - 32b - Q8_0/*" --local - dir ./

You can either specify a new local - dir (e.g., allura - org_remnant - glm4 - 32b - Q8_0) or download them all in place (./).

💻 Usage Examples

Downloading Files

# Download a specific file
huggingface - cli download bartowski/allura - org_remnant - glm4 - 32b - GGUF --include "allura - org_remnant - glm4 - 32b - Q4_K_M.gguf" --local - dir ./

# Download split files
huggingface - cli download bartowski/allura - org_remnant - glm4 - 32b - GGUF --include "allura - org_remnant - glm4 - 32b - Q8_0/*" --local - dir ./

📚 Documentation

Model Information

Property	Details
Quantized By	bartowski
Pipeline Tag	text - generation
Base Model	allura - org/remnant - glm4 - 32b
License	apache - 2.0
Tags	roleplay, conversational, axolotl
Base Model Relation	quantized

Downloadable Files

Filename	Quant type	File Size	Split	Description
[remnant - glm4 - 32b - bf16.gguf](https://huggingface.co/bartowski/allura - org_remnant - glm4 - 32b - GGUF/tree/main/allura - org_remnant - glm4 - 32b - bf16)	bf16	65.14GB	true	Full BF16 weights.
[remnant - glm4 - 32b - Q8_0.gguf](https://huggingface.co/bartowski/allura - org_remnant - glm4 - 32b - GGUF/blob/main/allura - org_remnant - glm4 - 32b - Q8_0.gguf)	Q8_0	34.62GB	false	Extremely high quality, generally unneeded but max available quant.
[remnant - glm4 - 32b - Q6_K_L.gguf](https://huggingface.co/bartowski/allura - org_remnant - glm4 - 32b - GGUF/blob/main/allura - org_remnant - glm4 - 32b - Q6_K_L.gguf)	Q6_K_L	27.18GB	false	Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended.
... (other files are omitted here for brevity, but should be listed completely in the actual README)	...	...	...	...

Embed/Output Weights

Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.

ARM/AVX Information

Previously, you would download Q4_0_4_4/4_8/8_8, and these would have their weights interleaved in memory to improve performance on ARM and AVX machines. Now, there is "online repacking" for weights. Details can be found in this PR. If you use Q4_0 and your hardware would benefit from repacking weights, it will do it automatically on the fly.

As of llama.cpp build b4282, you will not be able to run the Q4_0_X_X files and will instead need to use Q4_0.

Additionally, if you want to get slightly better quality, you can use IQ4_NL thanks to this PR, which will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower but it will result in an overall speed increase.

Which File to Choose

A great write - up with charts showing various performances is provided by Artefact2 here. The first thing to figure out is how big your available memory is and what level of quality you need.

🔧 Technical Details

Quantization Process

The quantizations are done using llama.cpp release b5270. All quants are made using the imatrix option with a dataset from here.

Performance Benchmark

Click to view benchmarks on an AVX2 system (EPYC7702)

model	size	params	backend	threads	test	t/s	% (vs Q4_0)
qwen2 3B Q4_0	1.70 GiB	3.09 B	CPU	64	pp512	204.03 ± 1.03	100%
qwen2 3B Q4_0	1.70 GiB	3.09 B	CPU	64	pp1024	282.92 ± 0.19	100%
... (other benchmark data are omitted here for brevity, but should be listed completely in the actual README)	...	...	...	...	...	...	...

Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation.

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご