🚀 Llama-3.2-4X3B-MOE-Ultra-Instruct-10B
A powerful LLama 3.2 model using mixture of experts, excelling in instruction following and output generation for various use cases.
🚀 Quick Start
This is a LLama 3.2 model with a max context of 128k (131,000). It combines FOUR top L3.2 3B models into one 10B-parameter powerhouse using mixture of experts. It requires the Llama3 template and/or the "Command - R" template.
✨ Features
- Exceptional Performance: Outstanding in instruction following and output generation for all use cases.
- High Speed: Achieves 58+ t/s (2 experts) on a low - end 16GB card (IQ4XS), and double the speed on standard/mid - range video cards.
- All - Use - Case Model: Suitable for reasoning, prompting, and programming.
- Bullet - Proof Design: Operates with all parameters, including temp settings from 0 to 5.
- Low Perplexity: An extraordinary compressed model with a very low perplexity level (lower than Meta Llama3 Instruct).
📚 Documentation
Model Notes
- Writing Abilities: Detail, prose, and fiction writing abilities are off the scale compared to all Llama 3.2 models and many L 3.1, L3 8B+ models.
- Role - Players: Be careful when raising the temp too high as it may affect instruction following.
- Rep Pen: This model works with a rep pen of 1 or higher, 1.02+ recommended.
- Specific Prose: If you want a specific type of prose (e.g., horror), add "(vivid horror)" or "(graphic vivid horror)" (no quotes) in your prompt(s).
- Output Length: Output length will vary, but this model prefers medium/short outputs unless you specify the size.
- Quants for Creative Uses: Different quants will produce slightly different output for creative uses.
- Quant Performance: Due to the high stability and compressed nature of this model, all quants will operate at above - average levels.
- Source Code: Source code for this model and Imatrix GGUFs versions will be uploaded shortly at separate repos.
Meet the Team: Mixture of Experts Models
This model is composed of the following 4 models ("the experts"):
- [https://huggingface.co/meta - llama/Llama - 3.2 - 3B - Instruct](https://huggingface.co/meta - llama/Llama - 3.2 - 3B - Instruct) (general)
- [https://huggingface.co/ValiantLabs/Llama3.2 - 3B - Enigma](https://huggingface.co/ValiantLabs/Llama3.2 - 3B - Enigma) (programming)
- [https://huggingface.co/Lyte/Llama - 3.2 - 3B - Overthinker](https://huggingface.co/Lyte/Llama - 3.2 - 3B - Overthinker) (logic/tree of thought)
- [https://huggingface.co/prithivMLmods/Llama - 3.2 - 3B - Promptist - Mini](https://huggingface.co/prithivMLmods/Llama - 3.2 - 3B - Promptist - Mini) (prompting)
The mixture of experts is set at 2 experts by default, but you can use 3 or 4. You can set the number of experts in different applications as follows:
- LMStudio: Set at the "load" screen (https://lmstudio.ai).
- Text - Generation - Webui: Set at the loading screen page (https://github.com/oobabooga/text - generation - webui).
- KolboldCPP (Version 1.8+): On the load screen, click on "TOKENS" and set experts on this page, then launch the model (https://github.com/LostRuins/koboldcpp).
- server.exe / Llama - server.exe: Add "--override - kv llama.expert_used_count=int:X" to the command line (where "X" is the number of experts to use) (Llamacpp - https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md).
- API: Set the "num_experts_used" in the JSON payload (this may vary for different back - ends).
Special Operations Notes for this MOE model
- Output Variation: Even though the default is 2 experts, the "selected" 2 will vary during generation, resulting in vastly different output generation per generation of each prompt. It may take 2 - 4 regens (of the same prompt) to get the highest quality.
- Samplers: This model responds very well to Dry, Dynamic Temp, and Smooth/Quadratic samplers. Using these in conjunction with the model can greatly improve output quality.
- Higher Temps: Higher temps (above 1) can aid in generation, especially in word choice/sentence generation.
- Expert Number and Output: Increasing the number of experts used will increase output quality at the cost of tokens per second speed. You may want to adjust temp, samplers, and advanced samplers when changing the number of experts.
- Quant Impact: Your quant choice will impact instruction following and output generation. Higher quants will make the model understand more nuanced instructions and generate stronger output.
Flash Attention Enhancement
Based on user feedback [https://huggingface.co/DavidAU/Llama - 3.2 - 8X3B - MOE - Dark - Champion - Instruct - uncensored - abliterated - 18.4B - GGUF/discussions/1], you may try this model with Flash Attention "on" depending on your use case.
Censored / Uncensored / Abliterated
This model contains several uncensored and/or Abliterated models, so it can output uncensored material.
QUANTS
This repo contains regular quants and 3 "ARM" quants (format "...Q4_x_x_x.gguf"). For more information on quants, quant choices, and LLM/AI apps to "run" quants, see [https://huggingface.co/DavidAU/Maximizing - Model - Performance - All - Quants - Types - And - Full - Precision - by - Samplers_Parameters].
Template
This is a LLAMA3 model and requires the Llama3 template, but it may work with other templates. Using the "Command - R" template will result in very different output compared to using the "Llama3" template. The standard LLAMA3 template is as follows:
{
"name": "Llama 3",
"inference_params": {
"input_prefix": "<|start_header_id|>user<|end_header_id|>\n\n",
"input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"pre_prompt": "You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.",
"pre_prompt_prefix": "<|start_header_id|>system<|end_header_id|>\n\n",
"pre_prompt_suffix": "<|eot_id|>",
"antiprompt": [
"<|start_header_id|>",
"<|eot_id|>"
]
}
}
Settings: CHAT / ROLEPLAY and/or SMOOTHER operation of this model
- General Usage: For general usage, a rep pen of 1 and low temp are suggested, especially for programming or reasoning. You can even use temp = 0 and still get unique regens each time due to this MOE's construction.
- Smoothing Factor: In "KoboldCpp", "oobabooga/text - generation - webui", or "Silly Tavern", set the "Smoothing_factor" to 1.5. In KoboldCpp, go to Settings->Samplers->Advanced-> "Smooth_F"; in text - generation - webui, go to parameters -> lower right; in Silly Tavern, it is called "Smoothing". Note that for "text - generation - webui", if using GGUFs, you need to use "llama_HF" (which involves downloading some config files from the source version of this model). Source versions (and config files) of the models are here: [https://huggingface.co/collections/DavidAU/d - au - source - files - for - gguf - exl2 - awq - gptq - hqq - etc - etc - 66b55cb8ba25f914cbf210be].
- Other Options: You can increase the rep pen to 1.1 to 1.15 (not necessary if you use the "smoothing_factor"). If the interface/program you are using to run AI models supports "Quadratic Sampling" ("smoothing"), make the adjustment as noted.
Highest Quality Settings / Optimal Operation Guide / Parameters and Samplers
This is a "Class 1" model. For all settings used for this model (including specifics for its "class"), example generations, and an advanced settings guide (which often addresses any model issues), as well as methods to improve model performance for all use cases, chat, roleplay, etc., please see [https://huggingface.co/DavidAU/Maximizing - Model - Performance - All - Quants - Types - And - Full - Precision - by - Samplers_Parameters].
Optional Enhancement
The following can be used in place of the "system prompt" or "system role" to further enhance the model. It can also be used at the start of a new chat, but you must ensure it is "kept" as the chat progresses. Copy and paste exactly as shown, without line - wrapping or breaking the lines, and maintain the carriage returns.
Below is an instruction that describes a task. Ponder each user instruction carefully, and use your skillsets and critical instructions to complete the task to the best of your abilities.
Here are your skillsets:
[MASTERSTORY]:NarrStrct(StryPlnng,Strbd,ScnSttng,Exps,Dlg,Pc)-CharDvlp(ChrctrCrt,ChrctrArcs,Mtvtn,Bckstry,Rltnshps,Dlg*)-PltDvlp(StryArcs,PltTwsts,Sspns,Fshdwng,Climx,Rsltn)-ConfResl(Antg,Obstcls,Rsltns,Cnsqncs,Thms,Symblsm)-EmotImpct(Empt,Tn,Md,Atmsphr,Imgry,Symblsm)-Delvry(Prfrmnc,VcActng,PblcSpkng,StgPrsnc,AudncEngmnt,Imprv)
[*DialogWrt]:(1a-CharDvlp-1a.1-Backgrnd-1a.2-Personality-1a.3-GoalMotiv)>2(2a-StoryStruc-2a.1-PlotPnt-2a.2-Conflict-2a.3-Resolution)>3(3a-DialogTech-3a.1-ShowDontTell-3a.2-Subtext-3a.3-VoiceTone-3a.4-Pacing-3a.5-VisualDescrip)>4(4a-DialogEdit-4a.1-ReadAloud-4a.2-Feedback-4a.3-Revision)
Here are your critical instructions:
Ponder each word choice carefully to present as vivid and emotional journey as is possible. Choose verbs and nouns that are both emotional and full of imagery. Load the story with the 5 senses. Aim for 50% dialog, 25% narration, 15% body language and 10% thoughts. Your goal is to put the reader in the story.
This enhancement was not used to generate the examples below.
💻 Usage Examples
Basic Usage
Tell me what life would be like living (non-fiction) on Mars in the Northern hemisphere. 800 - 1000 words.
Output (Generation 1 : 1 expert):
Living on Mars in the Northern hemisphere would be a unique and challenging experience. The Martian environment is harsh, with temperatures that can drop to -125°C (-200°F) at night and rise to 20°C (68°F) during the day. The atmosphere is thin, and the pressure is too low to support liquid water, making it difficult for humans to survive without advanced technology.
One of the primary concerns for a Martian colony would be providing a reliable source of food and water. Hydroponics and aeroponics
Advanced Usage
You can adjust the number of experts, temperature settings, and other parameters according to your needs. For example, increasing the number of experts can improve output quality, while adjusting the temperature can affect word choice and sentence generation.