đ QwQ-32B-ArliAI-RpR-v4
QwQ-32B-ArliAI-RpR-v4 is a fine - tuned model from ArliAI. It focuses on role - playing and creative writing, with features like reduced repetition, increased training sequence length, and better performance in long multi - turn chats.
Image generated using Arli AI Image Generation https://www.arliai.com/image-generation
⨠Features
RpR v4 Changes
- Reduced repetitions and impersonation: To enhance the creativity and out - of - the - box thinking of RpR v3, a more advanced filtering method was employed to eliminate examples where the LLM repeated similar phrases or impersonated the user. Any remaining repetition or impersonation is due to the training of the base QwQ model, not the RpR dataset.
- Increased training sequence length: The training sequence length was extended to 16K to improve awareness and memory, even in longer chats.
RpR Series Overview: Building on RPMax with Reasoning
RpR (RolePlay with Reasoning) is a new model series from ArliAI, building directly on the successful dataset curation and training methods of the RPMax series. These models use the same curated, deduplicated RP and creative writing dataset as RPMax, emphasizing variety to ensure high creativity and minimize cross - context repetition.
With the release of QwQ, the first high - performing open - source reasoning model that can be easily trained, it was found that existing instruct and creative writing reasoning datasets had only one response per example. This single - response dataset led to degraded output quality in long multi - turn chats. So, Arli AI created a real RP model capable of long multi - turn chat with reasoning.
To create RpR, the existing RPMax dataset was re - processed into a reasoning dataset. The base QwQ Instruct model was used to create the reasoning process for each turn in the RPMax dataset conversation examples, which were then refined. The training run was completed using axolotl with a manual template - free segments dataset to ensure the model was never trained to see the reasoning block in the context, just like during inference.
The result is consistently coherent and interesting outputs, even in long multi - turn RP chats. This is, as far as known, the first correctly - trained reasoning model for RP and creative writing.
You can access the model at https://arliai.com, and there is also a models ranking page at https://www.arliai.com/models - ranking. You can ask questions on the Discord Server https://discord.com/invite/t75KbPgwhk or the subreddit https://www.reddit.com/r/ArliAI/.
đĻ Installation
No installation steps were provided in the original README.
đģ Usage Examples
No code examples were provided in the original README.
đ Documentation
Model Description
QwQ - 32B - ArliAI - RpR - v4 is the third release in the RpR series. It is a 32 - billion parameter model fine - tuned using the RpR dataset, based on the curated RPMax dataset, combined with techniques to maintain reasoning abilities in long multi - turn chats.
Recommended Samplers
- RpR models do not work well with repetition penalty type of samplers, even more advanced ones such as XTC or DRY.
- They work best with simple sampler settings and a high max tokens value to allow for longer reasoning.
- You can download the ST master export uploaded in the files section of this repo.
It is recommended to start with:
- Temperature: 1.0
- MinP: 0.02
- TopK: 40
- Response Tokens: 2048+
Specs
Property |
Details |
Model Type |
QwQ - 32B - ArliAI - RpR - v4 |
Base Model |
QwQ - 32B |
Max Context Length |
Max 128K with Yarn (Natively 32K like base QwQ) |
Parameters |
32B |
Reasoning Model |
Yes |
Training Details
Property |
Details |
Sequence Length |
16384 |
Epochs |
1 epoch training (Inherited from RPMax methods) |
Fine - tuning Method |
RS - QLORA+ (Rank - Stabilized LoRA + LoRA Plus 8x) |
Rank/Alpha |
128 - rank 128 - alpha |
Learning Rate |
0.00001 |
Scheduler |
Rex |
Gradient accumulation |
32 |
Very Nice Training graphs :)
Quantization
Property |
Details |
BF16 |
https://huggingface.co/ArliAI/QwQ - 32B - ArliAI - RpR - v4 |
GGUF |
https://huggingface.co/ArliAI/QwQ - 32B - ArliAI - RpR - v4 - GGUF |
How to use reasoning models correctly in ST
For reasoning models:
- Set the prefix to ONLY and the suffix to ONLY without spaces or newlines.
- Ensure the reply starts with .
- Uncheck "Always add character names".
- Set "Include names" to never.
- The chat template should conform to the model being used.
Note: Reasoning models work properly only when "include names" is set to never. If enabled, it appends the character name at the end, confusing the model.
The rest of the sampler parameters can be set as desired.
If the reasoning is not wrapped inside the thinking block, the settings may be incorrect or the ST version may be too old. If the whole response is in the reasoning block, there may be an extra space or newline in the and tokens.
If you set everything up correctly, it should look like this:
The RPMax Foundation (Dataset & Training Philosophy)
The following sections detail the core philosophy behind the dataset and training methodology originally developed for RPMax, which serves as the foundation for the RpR series.
The Goal: Reduced Repetition and Higher Creativity
The goal of the dataset curation for RPMax and RpR is to reduce repetitions and increase the models' ability to write creatively in different situations. This means the models will output responses differently across various scenarios.
What is repetition and creativity?
Creativity refers to the variety in the model's output, not just pleasant writing prose. Repetition and creativity are intertwined. There are two types of repetition:
In - context repetition: This is when a model repeats the same phrases in a single conversation. While it can make the writing seem boring, in some cases, it can be intentional. RPMax and RpR do not yet focus on eliminating this type of repetition.
Cross - context repetition: This is when a model repeats the same phrases or tropes in different situations. It is always bad as it indicates over - fitting. The primary goal of the dataset curation is to reduce cross - context repetition by ensuring the dataset has no repetitions of the same situations or characters.
Dataset Curation
The success of models trained on this dataset is due to the training method and the unique dataset. It includes many open - source creative writing and RP datasets from Hugging Face, with synthetic generations removed. Llama 3.1 8B is used to de - dupe the datasets.
The Golden Rule of Fine - Tuning
For fine - tuning, quality is more important than quantity. The dataset used is smaller but results in a unique model.
Training Parameters and Unconventional Approach
The RPMax and RpR methodology uses one single epoch, low gradient accumulation, and a higher - than - normal learning rate. The loss curve is unstable during training but decreases over time. This approach allows the models to learn from each example without reinforcing single tropes.
đ§ Technical Details
The technical details are covered in the above sections, including the model's architecture, training methods, and dataset curation.
đ License
The model is licensed under the Apache - 2.0 license.