đ Phi-3-medium-4k-instruct-abliterated-v3
This is a text generation model based on the orthogonalized bfloat16 safetensor weights of microsoft/Phi-3-medium-4k-instruct, aiming to inhibit the model's ability to express refusal.
đ Quick Start
You can use the model through the widget on the Hugging Face page. Here is an example of input in the widget:
{
"messages": [
{
"role": "user",
"content": "Can you provide ways to eat combinations of bananas and dragonfruits?"
}
]
}
⨠Features
- Orthogonalized Weights: This model uses orthogonalized bfloat16 safetensor weights, which are generated based on a refined methodology described in the paper 'Refusal in LLMs is mediated by a single direction'.
- Inhibited Refusal: By manipulating certain weights, the model's ability to express refusal is "inhibited". However, it is not guaranteed that the model will not refuse or lecture about ethics/safety.
- Less Data Requirement: Compared with fine - tuning, the ablation method used in this model requires much less data and can keep most of the original model's knowledge and training intact.
đ Documentation
Summary
This is microsoft/Phi-3-medium-4k-instruct with orthogonalized bfloat16 safetensor weights. The generation of these weights is based on a refined methodology described in the preview paper/blog post: 'Refusal in LLMs is mediated by a single direction'. You can read this paper for a better understanding.
Hang on, "abliterated"? Orthogonalization? Ablation? What is this?
- Explanation of "Abliterated": It is a play - on - words combining "ablate" and "obliterated". It is used to differentiate this model from "uncensored" fine - tunes.
- Orthogonalization/Ablation: These two terms refer to the same thing here. The technique of "ablating" the refusal feature from the model is through orthogonalization. This model has had certain weights manipulated to "inhibit" the model's ability to express refusal. It is tuned the same as the original 70B instruct model in other respects, just with the strongest refusal directions orthogonalized out.
A little more on the methodology, and why this is interesting
- Advantages of Ablation: Ablation is good for inducing/removing very specific features. You can apply your system prompt in the ablation script against a blank system prompt on the same dataset and orthogonalize for the desired behaviour in the final model weights. It requires much less data than fine - tuning and can keep most of the original model's knowledge and training intact.
- Comparison with Fine - Tuning: Fine - tuning is still useful for broad behaviour changes. However, you may be able to get close to your desired behaviour with very few samples using the ablation/augmentation techniques. You can also combine orthogonalization and fine - tuning, such as orthogonalize -> fine - tune or vice - versa.
Okay, fine, but why V3? There's no V2?
The author released a V2 of an abliterated model for Meta - Llama - 3 - 8B before. But for larger models, the author wanted to refine the model before wasting compute cycles. The latest methodology seems to have induced fewer hallucinations, so the author decided to jump to V3 to show the advancement.
Quirkiness awareness notice
This model may have some quirks due to the new methodology. You are encouraged to play with the model and post any quirks you notice in the community tab. If you develop further improvements, please share. You can also reach out to the author on the Cognitive Computations Discord or through the Community tab.
đ License
This project is licensed under the MIT License.