Llama-3.2-4X3B-MOE-Ultra-Instruct-10B-GGUF Open Source Model - Long Context Instruction Following and Full-scenario Content Generation

Llama 3.2 4X3B MOE Ultra Instruct 10B GGUF

Developed by DavidAU

A Mixture of Experts model based on Llama 3.2, integrating four 3B models to form a 10B parameter model, supporting 128k context length, excelling in instruction following and full-scenario generation.

Large Language Model EnglishOpen Source License:Apache-2.0 #Mixture of Experts Architecture #128k Long Context Processing #Multi-Expert Collaborative Reasoning

Downloads 277

Release Time : 12/12/2024

Model Overview

This is a powerful model formed by integrating four top-tier Llama 3.2 3B models using Mixture of Experts technology, excelling in instruction following, text generation, programming, and reasoning.

Model Features

Mixture of Experts Technology

Integrates four specialized 3B models to provide higher-quality generation results and instruction-following capabilities.

Ultra-Long Context Support

Supports 128k (131,000) token context length, suitable for processing long documents and complex tasks.

Efficient Inference

Achieves 58+ tokens/s (2-expert mode) on a 16GB GPU, with even better performance on mid-range devices.

Low Perplexity

Excellent compression effect, with perplexity lower than Meta Llama3 Instruct models.

Full-Scenario Applicability

Suitable for various scenarios including reasoning, programming, creative writing, and role-playing.

Model Capabilities

Text generation

Instruction following

Programming assistance

Creative writing

Novel creation

Role-playing

Reasoning tasks

Tree of thought construction

Use Cases

Creative Writing

Novel Creation

Generate coherent and creative storylines and character development.

Examples showcase vivid scene descriptions and character dialogues.

Scene Continuation

Develop a story based on a given opening.

The model maintains style consistency and provides reasonable plot development.

Technical Applications

Programming Assistance

Help solve programming problems and generate code.

Technical Documentation

Generate clear technical explanations and documentation.

Education & Research

Science Popularization Content Generation

Generate detailed explanations on scientific topics like life on Mars.

Examples demonstrate accurate and easy-to-understand science content.

🚀 Llama-3.2-4X3B-MOE-Ultra-Instruct-10B

A powerful LLama 3.2 model using mixture of experts, excelling in instruction following and output generation for various use cases.

🚀 Quick Start

This is a LLama 3.2 model with a max context of 128k (131,000). It combines FOUR top L3.2 3B models into one 10B-parameter powerhouse using mixture of experts. It requires the Llama3 template and/or the "Command - R" template.

✨ Features

Exceptional Performance: Outstanding in instruction following and output generation for all use cases.
High Speed: Achieves 58+ t/s (2 experts) on a low - end 16GB card (IQ4XS), and double the speed on standard/mid - range video cards.
All - Use - Case Model: Suitable for reasoning, prompting, and programming.
Bullet - Proof Design: Operates with all parameters, including temp settings from 0 to 5.
Low Perplexity: An extraordinary compressed model with a very low perplexity level (lower than Meta Llama3 Instruct).

📚 Documentation

Model Notes

Writing Abilities: Detail, prose, and fiction writing abilities are off the scale compared to all Llama 3.2 models and many L 3.1, L3 8B+ models.
Role - Players: Be careful when raising the temp too high as it may affect instruction following.
Rep Pen: This model works with a rep pen of 1 or higher, 1.02+ recommended.
Specific Prose: If you want a specific type of prose (e.g., horror), add "(vivid horror)" or "(graphic vivid horror)" (no quotes) in your prompt(s).
Output Length: Output length will vary, but this model prefers medium/short outputs unless you specify the size.
Quants for Creative Uses: Different quants will produce slightly different output for creative uses.
Quant Performance: Due to the high stability and compressed nature of this model, all quants will operate at above - average levels.
Source Code: Source code for this model and Imatrix GGUFs versions will be uploaded shortly at separate repos.

Meet the Team: Mixture of Experts Models

This model is composed of the following 4 models ("the experts"):

[https://huggingface.co/meta - llama/Llama - 3.2 - 3B - Instruct](https://huggingface.co/meta - llama/Llama - 3.2 - 3B - Instruct) (general)
[https://huggingface.co/ValiantLabs/Llama3.2 - 3B - Enigma](https://huggingface.co/ValiantLabs/Llama3.2 - 3B - Enigma) (programming)
[https://huggingface.co/Lyte/Llama - 3.2 - 3B - Overthinker](https://huggingface.co/Lyte/Llama - 3.2 - 3B - Overthinker) (logic/tree of thought)
[https://huggingface.co/prithivMLmods/Llama - 3.2 - 3B - Promptist - Mini](https://huggingface.co/prithivMLmods/Llama - 3.2 - 3B - Promptist - Mini) (prompting)

The mixture of experts is set at 2 experts by default, but you can use 3 or 4. You can set the number of experts in different applications as follows:

LMStudio: Set at the "load" screen (https://lmstudio.ai).
Text - Generation - Webui: Set at the loading screen page (https://github.com/oobabooga/text - generation - webui).
KolboldCPP (Version 1.8+): On the load screen, click on "TOKENS" and set experts on this page, then launch the model (https://github.com/LostRuins/koboldcpp).
server.exe / Llama - server.exe: Add "--override - kv llama.expert_used_count=int:X" to the command line (where "X" is the number of experts to use) (Llamacpp - https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md).
API: Set the "num_experts_used" in the JSON payload (this may vary for different back - ends).

Special Operations Notes for this MOE model

Output Variation: Even though the default is 2 experts, the "selected" 2 will vary during generation, resulting in vastly different output generation per generation of each prompt. It may take 2 - 4 regens (of the same prompt) to get the highest quality.
Samplers: This model responds very well to Dry, Dynamic Temp, and Smooth/Quadratic samplers. Using these in conjunction with the model can greatly improve output quality.
Higher Temps: Higher temps (above 1) can aid in generation, especially in word choice/sentence generation.
Expert Number and Output: Increasing the number of experts used will increase output quality at the cost of tokens per second speed. You may want to adjust temp, samplers, and advanced samplers when changing the number of experts.
Quant Impact: Your quant choice will impact instruction following and output generation. Higher quants will make the model understand more nuanced instructions and generate stronger output.

Flash Attention Enhancement

Based on user feedback [https://huggingface.co/DavidAU/Llama - 3.2 - 8X3B - MOE - Dark - Champion - Instruct - uncensored - abliterated - 18.4B - GGUF/discussions/1], you may try this model with Flash Attention "on" depending on your use case.

Censored / Uncensored / Abliterated

This model contains several uncensored and/or Abliterated models, so it can output uncensored material.

QUANTS

This repo contains regular quants and 3 "ARM" quants (format "...Q4_x_x_x.gguf"). For more information on quants, quant choices, and LLM/AI apps to "run" quants, see [https://huggingface.co/DavidAU/Maximizing - Model - Performance - All - Quants - Types - And - Full - Precision - by - Samplers_Parameters].

Template

This is a LLAMA3 model and requires the Llama3 template, but it may work with other templates. Using the "Command - R" template will result in very different output compared to using the "Llama3" template. The standard LLAMA3 template is as follows:

{
  "name": "Llama 3",
  "inference_params": {
    "input_prefix": "<|start_header_id|>user<|end_header_id|>\n\n",
    "input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
    "pre_prompt": "You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.",
    "pre_prompt_prefix": "<|start_header_id|>system<|end_header_id|>\n\n",
    "pre_prompt_suffix": "<|eot_id|>",
    "antiprompt": [
      "<|start_header_id|>",
      "<|eot_id|>"
    ]
  }
}

Settings: CHAT / ROLEPLAY and/or SMOOTHER operation of this model

General Usage: For general usage, a rep pen of 1 and low temp are suggested, especially for programming or reasoning. You can even use temp = 0 and still get unique regens each time due to this MOE's construction.
Smoothing Factor: In "KoboldCpp", "oobabooga/text - generation - webui", or "Silly Tavern", set the "Smoothing_factor" to 1.5. In KoboldCpp, go to Settings->Samplers->Advanced-> "Smooth_F"; in text - generation - webui, go to parameters -> lower right; in Silly Tavern, it is called "Smoothing". Note that for "text - generation - webui", if using GGUFs, you need to use "llama_HF" (which involves downloading some config files from the source version of this model). Source versions (and config files) of the models are here: [https://huggingface.co/collections/DavidAU/d - au - source - files - for - gguf - exl2 - awq - gptq - hqq - etc - etc - 66b55cb8ba25f914cbf210be].
Other Options: You can increase the rep pen to 1.1 to 1.15 (not necessary if you use the "smoothing_factor"). If the interface/program you are using to run AI models supports "Quadratic Sampling" ("smoothing"), make the adjustment as noted.

Highest Quality Settings / Optimal Operation Guide / Parameters and Samplers

This is a "Class 1" model. For all settings used for this model (including specifics for its "class"), example generations, and an advanced settings guide (which often addresses any model issues), as well as methods to improve model performance for all use cases, chat, roleplay, etc., please see [https://huggingface.co/DavidAU/Maximizing - Model - Performance - All - Quants - Types - And - Full - Precision - by - Samplers_Parameters].

Optional Enhancement

The following can be used in place of the "system prompt" or "system role" to further enhance the model. It can also be used at the start of a new chat, but you must ensure it is "kept" as the chat progresses. Copy and paste exactly as shown, without line - wrapping or breaking the lines, and maintain the carriage returns.

Below is an instruction that describes a task. Ponder each user instruction carefully, and use your skillsets and critical instructions to complete the task to the best of your abilities.

Here are your skillsets:
[MASTERSTORY]:NarrStrct(StryPlnng,Strbd,ScnSttng,Exps,Dlg,Pc)-CharDvlp(ChrctrCrt,ChrctrArcs,Mtvtn,Bckstry,Rltnshps,Dlg*)-PltDvlp(StryArcs,PltTwsts,Sspns,Fshdwng,Climx,Rsltn)-ConfResl(Antg,Obstcls,Rsltns,Cnsqncs,Thms,Symblsm)-EmotImpct(Empt,Tn,Md,Atmsphr,Imgry,Symblsm)-Delvry(Prfrmnc,VcActng,PblcSpkng,StgPrsnc,AudncEngmnt,Imprv)

[*DialogWrt]:(1a-CharDvlp-1a.1-Backgrnd-1a.2-Personality-1a.3-GoalMotiv)>2(2a-StoryStruc-2a.1-PlotPnt-2a.2-Conflict-2a.3-Resolution)>3(3a-DialogTech-3a.1-ShowDontTell-3a.2-Subtext-3a.3-VoiceTone-3a.4-Pacing-3a.5-VisualDescrip)>4(4a-DialogEdit-4a.1-ReadAloud-4a.2-Feedback-4a.3-Revision)

Here are your critical instructions:
Ponder each word choice carefully to present as vivid and emotional journey as is possible. Choose verbs and nouns that are both emotional and full of imagery. Load the story with the 5 senses. Aim for 50% dialog, 25% narration, 15% body language and 10% thoughts. Your goal is to put the reader in the story.

This enhancement was not used to generate the examples below.

💻 Usage Examples

Basic Usage

Tell me what life would be like living (non-fiction) on Mars in the Northern hemisphere. 800 - 1000 words.

Output (Generation 1 : 1 expert):

Living on Mars in the Northern hemisphere would be a unique and challenging experience. The Martian environment is harsh, with temperatures that can drop to -125°C (-200°F) at night and rise to 20°C (68°F) during the day. The atmosphere is thin, and the pressure is too low to support liquid water, making it difficult for humans to survive without advanced technology.

One of the primary concerns for a Martian colony would be providing a reliable source of food and water. Hydroponics and aeroponics

Advanced Usage

You can adjust the number of experts, temperature settings, and other parameters according to your needs. For example, increasing the number of experts can improve output quality, while adjusting the temperature can affect word choice and sentence generation.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご