GLM4-32B-Neon-v2 Open-Source Model - Excellent in Role-Playing, with Diverse Personalities and Beautiful Writing

GLM4 32B Neon V2

Developed by allura-org

A roleplay fine-tuned version based on GLM-4-32B-0414, with excellent performance, distinctive personality, diverse styles, and elegant writing.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Roleplay Fine-tuning #Long Context Support #JSON Preference

Downloads 171

Release Time : 4/28/2025

Model Overview

This is a language model optimized for roleplay and short story generation, fine-tuned from GLM-4-32B-0414, featuring distinctive personalities and diverse styles.

Model Features

Roleplay Optimization

Specially fine-tuned for roleplay scenarios, capable of generating dialogue with distinctive personalities and diverse styles.

Elegant Writing

High-quality generated text with unique writing style, different from common models like Claude or Gemini.

Long Context Support

Supports 16k tokens of long context processing, suitable for complex roleplay scenarios.

Efficient Training

Utilizes QLoRA combined with CCE and sequence parallelism for efficient training with limited VRAM.

Model Capabilities

Roleplay dialogue

Short story generation

Long text generation

JSON format system prompt processing

Use Cases

Entertainment

Interactive Roleplay

Engage in immersive roleplay dialogues with users

Generates responses with distinctive personalities and diverse styles

Short Story Creation

Generate coherent short stories based on prompts

Elegantly written stories with reasonable plots

🚀 GLM-4-32B-0414 Neon v2

An RP finetuned version of GLM-4-32B-0414, offering a rich personality, diverse responses, and intelligent interactions.

Image by CalamitousFelicitousness

🚀 Quick Start

This model is an RP finetune of GLM-4-32B-0414. It has a nice personality, offers a lot of variety, and is quite smart. However, it can be a bit quirky at times. It generates nice prose and doesn't mimic Claude or Gemini too much. Although there are some structural repetitions, this is common among modern LLMs. It seems to prefer JSON formatted system prompts.

✨ Features

Rich Personality: Provides diverse and engaging responses.
Smart Interactions: Capable of intelligent conversations, though it may play dumb occasionally.
Nice Prose: Generates high - quality text output.
JSON Preference: Responds well to JSON formatted system prompts.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The model responds to GLM4 instruct formatting, exactly like its base model. Since backends struggle to add the BOS token automatically, you'll need to do it yourself. A Jinja template can be used for chat completions.

[gMASK]<sop><|system|>
{system_prompt}<|user|>
{prompt}<|assistant|>

Advanced Usage

There are no specific advanced usage examples in the original document, so this part is skipped.

📚 Documentation

Training Notes

The model was trained on a dataset consisting of 77M tokens of synthetic RP and short - story generation data for one epoch. Training took around 28 hours on a 4xRTX 3090 workstation, generously provided by OwenArli. Sane defaults were used for the training config. QLoRA plus CCE and sequence parallelism allowed it to fit 16k on 96GB. It trained smoother than the 9B model. However, there is still an issue with NaN Eval/Loss, and the reason is unknown.

Huge thanks to ArliAI for providing compute and collaborating on this run!

Format

As mentioned before, the model follows the GLM4 instruct formatting. You need to manually add the BOS token as backends can't do it automatically.

Recommended Samplers

Temperature - 1
Min - P - 0.1
Repetition Penalty - 1.03

Running on KoboldCPP and other backends

To run GGUFs correctly, you need the most recent version of KoboldCPP. Pass --overridekv glm4.rope.dimension_count=int:64 to the CLI command or put glm4.rope.dimension_count=int:64 into the overridekv box in the GUI (under the Tokens tab at the very bottom).

Thanks to DaringDuck and tofumagnate for the information on how to apply this fix.

It should work out - of - the - box on vLLM >=0.8.5. ExLLaMAv2 currently doesn't properly support GLM - 4 - 32B, unlike the 9B model. EXL3 should work, but it's untested. The latest versions of llama.cpp server should also allow running GGUFs out - of - the - box.

Training Config

See Axolotl config

# Model
base_model: /home/owen/models/GLM-4-32B-0414
strict: false
model_type: AutoModelForCausalLM

# Liger Kernels and CCE (optimization)
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: false
liger_rms_norm: false
liger_glu_activation: false
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true

# Output and HuggingFace
output_dir: ./GLM-32B-Neon-v2
hub_model_id: AuriAetherwiing/GLM-32B-Neon-v2-LoRA
hf_use_auth_token: true
hub_strategy: "all_checkpoints"

# WandB
wandb_project: allura-org
wandb_entity:
wandb_name: GLM-32B-Neon-v2

# Data
#chat_template: chatml
#train_on_inputs: false
group_by_length: false
datasets:
  - path: ./Neon/neon.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
    train_on_eos: all
  - path: ./Neon/S2.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
    train_on_eos: all
  - path: ./Neon/SystemChat_subset_filtered_sharegpt_utf8fix.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
    train_on_eos: all

dataset_prepared_path: ./lora_last_run_prepared

chat_template: jinja
chat_template_jinja: |
  [gMASK]<sop>{%- for msg in messages %}{%- if msg.role == 'system' %}<|system|>
  {{ msg.content }}{%- elif msg.role == 'user' %}<|user|>
  {{ msg.content }}{%- elif msg.role == 'assistant' %}<|assistant|>
  {{ msg.content }}{%- endif %}{%- endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}

## Evaluation
val_set_size: 0.005
evals_per_epoch: 8
eval_table_size:
eval_max_new_tokens: 128

# Technical aspects
sequence_len: 16384
save_safetensors: true
saves_per_epoch: 4
logging_steps: 1
#special_tokens:
#  pad_token: <pad>
# Quantization
bf16: auto
fp16:
tf32: false
## For LoRA
load_in_8bit: false
load_in_4bit: true

# LoRA
peft_use_rslora: false
peft_use_dora: false # better but slower
adapter: qlora # lora or qlora
lora_model_dir:
lora_r: 64 # 64 is optimal for most trains on instruct
lora_alpha: 64
lora_dropout: 0.1
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:

# loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable
#loraplus_lr_embedding:

# Training hyperparameters
# max_steps:
num_epochs: 1

# Anti Overfit and Stability
weight_decay: 0.01
max_grad_norm: 1.0

## Learning Rate
warmup_ratio: 0.05
learning_rate: 1e-5
lr_scheduler: rex
#lr_scheduler_kwargs:
#    min_lr: 0.0000024
optimizer: adamw_torch # usually adamw_torch or paged_adamw_8bit

## Batch Size
gradient_accumulation_steps: 32     # More effective batch size - stabler train, usually. MBS also speeds it up.
micro_batch_size: 1          # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
eval_batch_size: 1

# Optimizations
pad_to_sequence_len: true
sample_packing: true
eval_sample_packing: false
flash_attention: true
xformers_attention:
gradient_checkpointing:
gradient_checkpointing_kwargs:
   use_reentrant: false
   
# Set to a divisor (> 1) of the number of GPUs available
sequence_parallel_degree: 4  # Split sequences across 4 GPUs
# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
heads_k_stride: 1
# Optional; one of "varlen_llama3", "batch_ring", "batch_zigzag", "batch_stripe". Defaults to
# "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
ring_attn_func:
   
# deepspeed: /home/owen/axolotl/deepspeed_configs/zero3_bf16_cpuoffload_all.json

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: false
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Glm4DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_activation_checkpointing: true

🔧 Technical Details

The model was trained using QLoRA plus CCE and sequence parallelism, which allowed it to fit 16k on 96GB. Training was done on a dataset of 77M tokens of synthetic RP and short - story generation data for one epoch. It faced an issue with NaN Eval/Loss, the cause of which is still unknown.

📄 License

The model is released under the MIT license.

Property	Details
Model Type	RP finetune of GLM - 4 - 32B - 0414
Training Data	Synthetic RP and short - story generation data (77M tokens)
Training Time	Around 28 hours on 4xRTX 3090 workstation
Training Config	QLoRA, CCE, sequence parallelism
Recommended Samplers	Temperature - 1, Min - P - 0.1, Repetition Penalty - 1.03

⚠️ Important Note

Backends struggle to add the BOS token automatically, so you need to add it yourself.

💡 Usage Tip

Use the Jinja template for chat completions. For running GGUFs on KoboldCPP, pass --overridekv glm4.rope.dimension_count=int:64 to the CLI command or put it in the overridekv box in the GUI.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご