Model Overview
Model Features
Model Capabilities
Use Cases
đ Mythalion 13B - GGUF
This repository provides GGUF format model files for Mythalion 13B, a text - generation model. It offers various quantized models suitable for different use - cases and hardware setups.
đ Quick Start
Downloading the Model
- Using Clients/Libraries: Tools like LM Studio, LoLLMS Web UI, and Faraday.dev can automatically download models. They present a list of available models for you to choose from.
- In
text - generation - webui
: Under "Download Model", enter the model repoTheBloke/Mythalion - 13B - GGUF
and specify a filename (e.g.,mythalion - 13b.q4_K_M.gguf
), then click "Download". - On the Command Line:
- First, install the
huggingface - hub
Python library:
- First, install the
pip3 install huggingface - hub>=0.17.1
- Then download an individual model file:
huggingface - cli download TheBloke/Mythalion - 13B - GGUF mythalion - 13b.q4_K_M.gguf --local - dir. --local - dir - use - symlinks False
Running the Model
Example llama.cpp
command
Ensure you are using llama.cpp
from commit d0cee0d36d5be95a0d9088b674dbb27354107221 or later.
./main -ngl 32 -m mythalion - 13b.q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{prompt}\n\n### Response:"
- Adjust
-ngl 32
according to the number of layers you want to offload to the GPU. Remove it if you don't have GPU acceleration. - Modify
-c 4096
to set the desired sequence length. For extended sequence models, llama.cpp automatically reads and sets the necessary RoPE scaling parameters from the GGUF file. - To have a chat - style conversation, replace the
-p <PROMPT>
argument with-i -ins
.
Running in text - generation - webui
Refer to [text - generation - webui/docs/llama.cpp.md](https://github.com/oobabooga/text - generation - webui/blob/main/docs/llama.cpp.md) for further instructions.
Running from Python code
You can use GGUF models from Python with the [llama - cpp - python](https://github.com/abetlen/llama - cpp - python) or ctransformers libraries.
⨠Features
- Multiple Quantization Options: Offers a range of quantized models (e.g., Q2_K, Q3_K, etc.) to balance between model size and quality.
- Broad Compatibility: Compatible with many clients and libraries such as llama.cpp, text - generation - webui, KoboldCpp, etc.
- Flexible Usage: Can be used for both instruction - following tasks and chat - style conversations.
đĻ Installation
Installing Dependencies for Download
To download models on the command line, you need to install the huggingface - hub
Python library:
pip3 install huggingface - hub>=0.17.1
To accelerate downloads on fast connections, install hf_transfer
:
pip3 install hf_transfer
Installing Dependencies for Python Usage
If you want to use the model from Python:
- For
ctransformers
without GPU acceleration:
pip install ctransformers>=0.2.24
- With CUDA GPU acceleration:
pip install ctransformers[cuda]>=0.2.24
- With ROCm GPU acceleration:
CT_HIPBLAS = 1 pip install ctransformers
đģ Usage Examples
Basic Usage in llama.cpp
./main -ngl 32 -m mythalion - 13b.q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{prompt}\n\n### Response:"
Advanced Usage - Chat - style Conversation in llama.cpp
./main -ngl 32 -m mythalion - 13b.q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins
đ Documentation
About GGUF
GGUF is a new format introduced by the llama.cpp team on August 21st, 2023. It replaces GGML, which is no longer supported by llama.cpp. GGUF has several advantages over GGML, including better tokenization, support for special tokens, metadata support, and extensibility.
Here is a list of clients and libraries known to support GGUF:
- llama.cpp: The source project for GGUF, offering a CLI and a server option.
- [text - generation - webui](https://github.com/oobabooga/text - generation - webui): A widely used web UI with many features and powerful extensions, supporting GPU acceleration.
- KoboldCpp: A fully - featured web UI with GPU acceleration across all platforms and GPU architectures, great for story - telling.
- LM Studio: An easy - to - use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration.
- [LoLLMS Web UI](https://github.com/ParisNeo/lollms - webui): A web UI with many interesting and unique features, including a full model library for easy model selection.
- Faraday.dev: An attractive and easy - to - use character - based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration.
- ctransformers: A Python library with GPU acceleration, LangChain support, and an OpenAI - compatible AI server.
- [llama - cpp - python](https://github.com/abetlen/llama - cpp - python): A Python library with GPU acceleration, LangChain support, and an OpenAI - compatible API server.
- candle: A Rust ML framework focusing on performance, including GPU support and ease of use.
Repositories available
- [AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/Mythalion - 13B - AWQ)
- [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Mythalion - 13B - GPTQ)
- [2, 3, 4, 5, 6 and 8 - bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF)
- [PygmalionAI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/PygmalionAI/mythalion - 13b)
Prompt template
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{prompt}
### Response:
Compatibility
These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d36d5be95a0d9088b674dbb27354107221. They are also compatible with many third - party UIs and libraries; refer to the list at the top of this README.
Explanation of quantisation methods
Click to see details
The new methods available are:
- GGML_TYPE_Q2_K: "type - 1" 2 - bit quantization in super - blocks containing 16 blocks, each block having 16 weights. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw).
- GGML_TYPE_Q3_K: "type - 0" 3 - bit quantization in super - blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This ends up using 3.4375 bpw.
- GGML_TYPE_Q4_K: "type - 1" 4 - bit quantization in super - blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
- GGML_TYPE_Q5_K: "type - 1" 5 - bit quantization. Same super - block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw.
- GGML_TYPE_Q6_K: "type - 0" 6 - bit quantization. Super - blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw.
Refer to the Provided Files table below to see what files use which methods, and how.
Provided files
Name | Quant method | Bits | Size | Max RAM required | Use case |
---|---|---|---|---|---|
[mythalion - 13b.Q2_K.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q2_K.gguf) | Q2_K | 2 | 5.43 GB | 7.93 GB | smallest, significant quality loss - not recommended for most purposes |
[mythalion - 13b.Q3_K_S.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q3_K_S.gguf) | Q3_K_S | 3 | 5.66 GB | 8.16 GB | very small, high quality loss |
[mythalion - 13b.Q3_K_M.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q3_K_M.gguf) | Q3_K_M | 3 | 6.34 GB | 8.84 GB | very small, high quality loss |
[mythalion - 13b.Q3_K_L.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q3_K_L.gguf) | Q3_K_L | 3 | 6.93 GB | 9.43 GB | small, substantial quality loss |
[mythalion - 13b.Q4_0.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q4_0.gguf) | Q4_0 | 4 | 7.37 GB | 9.87 GB | legacy; small, very high quality loss - prefer using Q3_K_M |
[mythalion - 13b.Q4_K_S.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q4_K_S.gguf) | Q4_K_S | 4 | 7.41 GB | 9.91 GB | small, greater quality loss |
[mythalion - 13b.Q4_K_M.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q4_K_M.gguf) | Q4_K_M | 4 | 7.87 GB | 10.37 GB | medium, balanced quality - recommended |
[mythalion - 13b.Q5_0.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q5_0.gguf) | Q5_0 | 5 | 8.97 GB | 11.47 GB | legacy; medium, balanced quality - prefer using Q4_K_M |
[mythalion - 13b.Q5_K_S.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q5_K_S.gguf) | Q5_K_S | 5 | 8.97 GB | 11.47 GB | large, low quality loss - recommended |
[mythalion - 13b.Q5_K_M.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q5_K_M.gguf) | Q5_K_M | 5 | 9.23 GB | 11.73 GB | large, very low quality loss - recommended |
[mythalion - 13b.Q6_K.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q6_K.gguf) | Q6_K | 6 | 10.68 GB | 13.18 GB | very large, extremely low quality loss |
[mythalion - 13b.Q8_0.gguf](https://huggingface.co/TheBloke/Mythalion - 13B - GGUF/blob/main/mythalion - 13b.Q8_0.gguf) | Q8_0 | 8 | 13.83 GB | 16.33 GB | very large, extremely low quality loss - not recommended |
Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
đ§ Technical Details
The GGUF format is designed to be more efficient and feature - rich compared to the deprecated GGML format. It allows for better tokenization, support for special tokens, and metadata storage. Different quantisation methods are used to balance between model size and quality. For example, lower - bit quantisation methods like Q2_K result in smaller model sizes but may have significant quality loss, while higher - bit methods like Q6_K and Q8_0 have extremely low quality loss but larger model sizes.
đ License
The model is licensed under the llama2 license.

