GLM4-9B-Neon-v2 Open-source Role-playing Model - Enjoy a Smooth Experience with Beautiful Writing Style for Free

Home

GLM4 9B Neon V2

Developed by allura-org

A role-playing fine-tuned model based on GLM-4-9B-0414, offering smooth role-playing experiences and elegant writing.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Role-Playing Optimization #Long Context Support (16k)#QLoRA Efficient Fine-Tuning

Downloads 39

Release Time : 4/26/2025

Model Overview

This is a role-playing model fine-tuned from the GLM-4-9B-0414 large language model, featuring distinct personalities and fluent conversational abilities, especially suitable for role-playing and short story generation.

Model Features

Smooth Role-Playing Experience

The model is specially optimized to provide smooth and natural role-playing dialogue experiences.

Elegant Writing Style

Generates high-quality text with a unique literary style.

Long Context Support

Supports a long context window of 16k tokens.

Efficient Training

Optimized training efficiency using QLoRA and CCE techniques.

Model Capabilities

Role-playing dialogue generation

Short story creation

Long text generation

JSON format system prompt processing

Use Cases

Entertainment

Role-Playing Games

Used for NPC dialogue generation in games

Provides distinctive character interaction experiences

Creative Writing

Short Story Creation

Assists writers in creative writing

Generates high-quality story segments

🚀 GLM-4-9B-0414 Neon v2

This is an RP finetune of GLM-4-9B-0414. It has a pleasant feel, abundant personality, though it can be a bit quirky at times. It generates nice prose and doesn't overly resemble Claude or Gemini. However, it doesn't seem to prefer overly long system prompts or charcards, and it appears to favor JSON-formatted system prompts. The model was trained by Auri.

Image by CalamitousFelicitousness

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

✨ Features

Personality: It has a lot of personality, generating text with a unique style.
Prose Quality: Produces nice prose, distinct from Claude and Gemini.
Prompt Preference: Prefers JSON formatted system prompts over long system prompts or charcards.

💻 Usage Examples

No specific code examples are provided in the original document, so this section is skipped.

📚 Documentation

Training notes

The model was trained on a dataset consisting of 77M tokens of synthetic RP and short story generation data for one epoch. Training took around 11 hours on a 2xRTX 3090 workstation, generously provided by OwenArli. Some reasonable default settings were used for the training configuration. QLoRA plus CCE were employed for significant memory usage optimization, and 16k tokens fit nicely on 48GB with some room to spare. There seems to be an issue with the Eval/Loss being broken, but the reason is unknown. Otherwise, the training process was smooth.

A huge thanks to ArliAI for providing compute resources and collaborating on this run!

Format

The model responds to GLM4 instruct formatting, exactly like its base model. Backends struggle to add the BOS token automatically, so you'll need to do it yourself. A Jinja template should work for chat completions.

[gMASK]<sop><|system|>
{system_prompt}<|user|>
{prompt}<|assistant|>

Recommended Samplers

There's nothing special here, just classic settings.

Temperature - 1
Min-P - 0.1
Repetition Penalty - 1.03

Example master import for SillyTavern (using Shingane-v1 system prompt by Steelskull)

Running on KoboldCPP and other backends

To run GGUFs correctly, you need the most recent version of KoboldCPP. You'll need to pass --overridekv glm4.rope.dimension_count=int:64 to the CLI command or put glm4.rope.dimension_count=int:64 into the overridekv box in the GUI (under the Tokens tab at the very bottom).

Thanks to DaringDuck and tofumagnate for the information on how to apply this fix.

To run this model on vLLM, you'll need to build it from source from the git repo, as full GLM4 support hasn't reached the release version yet.

Backends based on ExLLaMAv2 and v3, such as TabbyAPI, should support the model out of the box.

The latest versions of the llama.cpp server should also allow running GGUFs out-of-the-box.

Training config

See Axolotl config

# Model
base_model: /home/owen/models/GLM-4-9B-0414
strict: false
model_type: AutoModelForCausalLM

# Liger Kernels and CCE (optimization)
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: false
liger_rms_norm: false
liger_glu_activation: false
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true

# Output and HuggingFace
output_dir: ./GLM-9B-Neon-v2
hub_model_id: AuriAetherwiing/GLM-9B-Neon-v2-LoRA
hf_use_auth_token: true
hub_strategy: "all_checkpoints"

# WandB
wandb_project: allura-org
wandb_entity:
wandb_name: GLM-9B-Neon-v2

# === Data Configuration ===

# Data
#chat_template: chatml
#train_on_inputs: false
group_by_length: false
datasets:
  - path: ./Neon/neon.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
  - path: ./Neon/S2.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
  - path: ./Neon/SystemChat_subset_filtered_sharegpt_utf8fix.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value

dataset_prepared_path: ./lora_last_run_prepared

## Evaluation
val_set_size: 0.01
evals_per_epoch: 2
eval_table_size:
eval_max_new_tokens: 128

# Technical aspects
sequence_len: 16384
save_safetensors: true
saves_per_epoch: 2
logging_steps: 1
#special_tokens:
#  pad_token: <pad>
# Quantization
bf16: auto
fp16:
tf32: false
## For LoRA
load_in_8bit: false
load_in_4bit: true

# LoRA
peft_use_rslora: false
peft_use_dora: false # better but slower
adapter: qlora # lora or qlora
lora_model_dir:
lora_r: 64 # 64 is optimal for most trains on instruct
lora_alpha: 64
lora_dropout: 0.1
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:

# loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable
#loraplus_lr_embedding:

# Training hyperparameters
# max_steps:
num_epochs: 1

# Anti Overfit and Stability
weight_decay: 0.01
max_grad_norm: 1.0

## Learning Rate
warmup_ratio: 0.05
learning_rate: 1e-5
lr_scheduler: rex
#lr_scheduler_kwargs:
#    min_lr: 0.0000024
optimizer: adamw_torch # usually adamw_torch or paged_adamw_8bit

## Batch Size
gradient_accumulation_steps: 32      # More effective batch size - stabler train, usually. MBS also speeds it up.
micro_batch_size: 1          # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
eval_batch_size: 1

# Optimizations
pad_to_sequence_len: true
sample_packing: true
eval_sample_packing: false
flash_attention: true
xformers_attention:
gradient_checkpointing:
gradient_checkpointing_kwargs:
   use_reentrant: false
   
# Set to a divisor (> 1) of the number of GPUs available
#sequence_parallel_degree: 2  # Split sequences across 4 GPUs
# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
#heads_k_stride: 1
# Optional; one of "varlen_llama3", "batch_ring", "batch_zigzag", "batch_stripe". Defaults to
# "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
#ring_attn_func:
   
# deepspeed: /home/owen/axolotl/deepspeed_configs/zero3_bf16_cpuoffload_all.json

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: false
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Glm4DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_activation_checkpointing: true

📄 License

This model is released under the MIT license.

Property	Details
Model Type	RP finetune of GLM-4-9B-0414
Training Data	Datasets: allura-org/Celeste-Filtered, allura-org/neon-41k, EVA-UNIT-01/Lilith-v0.2. 77M tokens of synthetic RP and short story gen data for one epoch.
Base Model	THUDM/GLM-4-9B-0414
Library Name	transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご