L3-8B-Stheno-v3.3-32K Open-Source Model - Free Deployment to Support Role-Playing and Creative Writing

L3 8B Stheno V3.3 32K

Developed by Sao10K

A 32K long-context model optimized from Llama-3-8B, extending context length through PoSE training, specializing in role-playing and creative writing tasks

Large Language Model

Transformers

English#PoSE extended 32K context #Role-playing optimization #Creative writing enhancement

Downloads 541

Release Time : 6/22/2024

Model Overview

This model is an optimized version of Llama-3-8B that extends the context from 8K to 32K using PoSE training, with enhanced capabilities for role-playing and creative writing while maintaining fundamental language understanding

Model Features

Extended context processing

Extends context length from 8K to 32K through PoSE training, outperforming conventional rope scaling solutions

High-quality role-playing

Deeply cleaned and manually curated role-playing samples provide excellent interactive experiences

Creative writing enhancement

Doubled creative writing training samples significantly improve generation quality

Optimized training configuration

Uses a tuned optimal Rope Theta value (2 million) configuration to ensure training stability

Model Capabilities

Long text generation

Role-playing dialogue

Creative content generation

Instruction following

Context understanding

Use Cases

Entertainment & creation

Interactive role-playing

Immersive role-playing dialogues with AI

Subjective experience reports show excellent interaction quality

Creative writing assistance

Generating creative texts like novels and poetry

Training data shows a 2x increase in creative writing samples

Long document processing

Long document summarization

Handling summarization tasks for documents up to 32K in context

Basic tests show superiority over conventional rope scaling solutions

🚀 L3-8B-Stheno-v3.3-32K Model README

This README provides detailed information about the training, features, and configurations of the L3-8B-Stheno-v3.3-32K model.

🚀 Quick Start

This section offers a high - level overview of the model and its training details. The model was trained with the compute resources from Backyard.ai. Special thanks go to them and @dynafire for their assistance.

✨ Features

Training Details

The model was initially trained at an 8K context and then expanded to a 32K context using PoSE training.

Dataset Modifications

Roleplaying samples were further cleaned up and underwent a quality check.
Low - quality samples identified through manual checks were removed, raising the baseline quality floor.
The number of creative writing samples was doubled.
The detailed instruct data was remade and refined.

Model Notes

The training run is less aggressive compared to previous Stheno versions.
The model functions well when tested in bf16 with the same configurations as in the file.
The effects of quantization on the model are unknown.
It performs well in role - playing scenarios.
There are some issues with long - context understanding and reasoning, but it is better than normal rope scaling, which is an advantage.
Note that this is not a native 32K model; it has its problems but is coherent and works well.

Sanity Check // Needle in a Haystack Results

This is a basic evaluator, less complex than RULER or NIAN. Some improper training examples had Haystack scores ranging from red to orange for most of the extended contexts.

Wandb Run

Wandb

📚 Documentation

Relevant Axolotl Configurations

The configurations were taken from winglian/Llama-3-8b-64k-PoSE. After hours of tinkering, the configurations used by the author of that model worked best, so they were adopted.
A 2M Rope Theta had the best loss results during training compared to other values.
Leaving it at 500K rope was not significantly worse, but 4M and 8M Theta made the grad_norm values worse even if the loss dropped quickly.
Mixing in pretraining data was difficult and made the formatting worse.
Pretraining or adding noise also made the Haystack results worse; the scores were mainly orange instead of all green.
Improper Rope Theta values are indicated by the Grad_Norm exploding to thousands. Although it will drop to low values, the drop is very fast and can be concerning even with gradient clipping.

sequence_len: 8192
use_pose: true
pose_max_context_len: 32768

overrides_of_model_config:
  rope_theta: 2000000.0
  max_position_embeddings: 32768

  # peft_use_dora: true
adapter: lora
peft_use_rslora: true
lora_model_dir:
lora_r: 256
lora_alpha: 256
lora_dropout: 0.1
lora_target_linear: true
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

warmup_steps: 80
gradient_accumulation_steps: 6
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine_with_min_lr
learning_rate: 0.00004
lr_scheduler_kwargs:
    min_lr: 0.000004

📄 License

This model is licensed under the CC - BY - NC - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご