Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Orca 2 13B - GGUF
This repository provides GGUF format model files for Microsoft's Orca 2 13B, facilitating various inference scenarios.
🚀 Quick Start
This repo offers GGUF format model files for Microsoft's Orca 2 13B. These files were quantized using hardware provided by Massed Compute.
✨ Features
- Multiple Inference Options: Offers AWQ, GPTQ models for GPU inference, and 2, 3, 4, 5, 6, and 8 - bit GGUF models for CPU+GPU inference.
- Wide Compatibility: Compatible with llama.cpp from August 27th onwards and many third - party UIs and libraries.
- Diverse Quantization Methods: Provides different quantization methods to balance model size and quality.
📦 Installation
Downloading GGUF Files
- Manual Downloaders: Avoid cloning the entire repo. Most users only need to pick a single file.
- Automated Download: Clients like LM Studio, LoLLMS Web UI, and Faraday.dev can automatically download models.
In text - generation - webui
Under Download Model, enter the model repo: TheBloke/Orca - 2 - 13B - GGUF and a specific filename (e.g., orca - 2 - 13b.Q4_K_M.gguf), then click Download.
On the Command Line
pip3 install huggingface - hub
Download an individual model file:
huggingface - cli download TheBloke/Orca - 2 - 13B - GGUF orca - 2 - 13b.Q4_K_M.gguf --local - dir. --local - dir - use - symlinks False
Download multiple files with a pattern:
huggingface - cli download TheBloke/Orca - 2 - 13B - GGUF --local - dir. --local - dir - use - symlinks False --include='*Q4_K*gguf'
To accelerate downloads on fast connections, install hf_transfer
:
pip3 install hf_transfer
Set the environment variable:
HF_HUB_ENABLE_HF_TRANSFER = 1 huggingface - cli download TheBloke/Orca - 2 - 13B - GGUF orca - 2 - 13b.Q4_K_M.gguf --local - dir. --local - dir - use - symlinks False
💻 Usage Examples
Example llama.cpp
command
./main -ngl 32 -m orca - 2 - 13b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant"
- Change
-ngl 32
to the number of layers to offload to GPU. Remove it if no GPU acceleration. - Change
-c 4096
to the desired sequence length. - For a chat - style conversation, replace
-p <PROMPT>
with-i -ins
.
How to run in text - generation - webui
Refer to [text - generation - webui/docs/04 ‐ Model Tab.md](https://github.com/oobabooga/text - generation - webui/blob/main/docs/04%20%E2%80%90%20Model%20Tab.md#llamacpp) for further instructions.
How to run from Python code
Using ctransformers
Install the package:
# Base ctransformers with no GPU acceleration
pip install ctransformers
# Or with CUDA GPU acceleration
pip install ctransformers[cuda]
# Or with AMD ROCm GPU acceleration (Linux only)
CT_HIPBLAS = 1 pip install ctransformers --no - binary ctransformers
# Or with Metal GPU acceleration for macOS systems only
CT_METAL = 1 pip install ctransformers --no - binary ctransformers
Simple example code:
from ctransformers import AutoModelForCausalLM
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Orca - 2 - 13B - GGUF", model_file="orca - 2 - 13b.Q4_K_M.gguf", model_type="llama", gpu_layers=50)
print(llm("AI is going to"))
📚 Documentation
About GGUF
GGUF is a new format introduced by the llama.cpp team on August 21st, 2023. It replaces GGML, which is no longer supported by llama.cpp.
Clients and libraries known to support GGUF include:
- llama.cpp
- [text - generation - webui](https://github.com/oobabooga/text - generation - webui)
- KoboldCpp
- LM Studio
- [LoLLMS Web UI](https://github.com/ParisNeo/lollms - webui)
- Faraday.dev
- ctransformers
- [llama - cpp - python](https://github.com/abetlen/llama - cpp - python)
- candle
Repositories available
- [AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/Orca - 2 - 13B - AWQ)
- [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Orca - 2 - 13B - GPTQ)
- [2, 3, 4, 5, 6 and 8 - bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF)
- [Microsoft's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/microsoft/Orca - 2 - 13b)
Prompt template: ChatML
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
Compatibility
These quantized GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d. They are also compatible with many third - party UIs and libraries.
Explanation of quantisation methods
Click to see details
The new methods available are:
- GGML_TYPE_Q2_K - "type - 1" 2 - bit quantization in super - blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
- GGML_TYPE_Q3_K - "type - 0" 3 - bit quantization in super - blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
- GGML_TYPE_Q4_K - "type - 1" 4 - bit quantization in super - blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
- GGML_TYPE_Q5_K - "type - 1" 5 - bit quantization. Same super - block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
- GGML_TYPE_Q6_K - "type - 0" 6 - bit quantization. Super - blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
Refer to the Provided Files table below to see what files use which methods, and how.
Provided files
Name | Quant method | Bits | Size | Max RAM required | Use case |
---|---|---|---|---|---|
[orca - 2 - 13b.Q2_K.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q2_K.gguf) | Q2_K | 2 | 5.43 GB | 7.93 GB | smallest, significant quality loss - not recommended for most purposes |
[orca - 2 - 13b.Q3_K_S.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q3_K_S.gguf) | Q3_K_S | 3 | 5.66 GB | 8.16 GB | very small, high quality loss |
[orca - 2 - 13b.Q3_K_M.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q3_K_M.gguf) | Q3_K_M | 3 | 6.34 GB | 8.84 GB | very small, high quality loss |
[orca - 2 - 13b.Q3_K_L.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q3_K_L.gguf) | Q3_K_L | 3 | 6.93 GB | 9.43 GB | small, substantial quality loss |
[orca - 2 - 13b.Q4_0.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q4_0.gguf) | Q4_0 | 4 | 7.37 GB | 9.87 GB | legacy; small, very high quality loss - prefer using Q3_K_M |
[orca - 2 - 13b.Q4_K_S.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q4_K_S.gguf) | Q4_K_S | 4 | 7.41 GB | 9.91 GB | small, greater quality loss |
[orca - 2 - 13b.Q4_K_M.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q4_K_M.gguf) | Q4_K_M | 4 | 7.87 GB | 10.37 GB | medium, balanced quality - recommended |
[orca - 2 - 13b.Q5_0.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q5_0.gguf) | Q5_0 | 5 | 8.97 GB | 11.47 GB | legacy; medium, balanced quality - prefer using Q4_K_M |
[orca - 2 - 13b.Q5_K_S.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q5_K_S.gguf) | Q5_K_S | 5 | 8.97 GB | 11.47 GB | large, low quality loss - recommended |
[orca - 2 - 13b.Q5_K_M.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q5_K_M.gguf) | Q5_K_M | 5 | 9.23 GB | 11.73 GB | large, very low quality loss - recommended |
[orca - 2 - 13b.Q6_K.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q6_K.gguf) | Q6_K | 6 | 10.68 GB | 13.18 GB | very large, extremely low quality loss |
[orca - 2 - 13b.Q8_0.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q8_0.gguf) | Q8_0 | 8 | 13.83 GB | 16.33 GB | very large, extremely low quality loss - not recommended |
Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
🔧 Technical Details
Orca 2 is a finetuned version of LLAMA - 2. Its training data is a synthetic dataset created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the Orca 2 paper.
📄 License
Orca 2 is licensed under the Microsoft Research License. Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.
Bias, Risks, and Limitations
Orca 2, built upon the LLaMA 2 model family, retains many of its limitations, as well as the common limitations of other large language models or limitation caused by its training process, including:
⚠️ Important Note
- Data Biases: Large language models, trained on extensive data, can inadvertently carry biases present in the source data. Consequently, the models may generate outputs that could be potentially biased or unfair.
- Lack of Contextual Understanding: Despite their impressive capabilities in language understanding and generation, these models exhibit limited real - world understanding, resulting in potential inaccuracies or nonsensical responses.
- Lack of Transparency: Due to the complexity and size, large language models can act as “black boxes”, making it difficult to comprehend the rationale behind specific outputs or decisions. We recommend reviewing transparency notes from Azure for more information.
- Content Harms: There are various types of content harms that large language models can cause. It is important to be aware of them when using these models, and to take actions to prevent them. It is recommended to leverage various content moderation services provided by different companies and institutions. On an important note, we hope for better regulations and standards from government and technology leaders around content harms for AI technologies in future. We value and acknowledge the important role that research and open source community can play in this direction.
- Hallucination: It is important to be aware and cautious not to entirely rely on a given language model for critical decisions or information that might have deep impact as it is not obvious how to prevent these models from fabricating content. Moreover, it is not clear whether small models may be more susceptible to hallucination in ungrounded generation use cases due to their smaller sizes and hence reduced memorization capacities. This is an active research topic and we hope there will be more rigorous measurement, understanding and mitigations around this topic.
Discord
For further support, and discussions on these models and AI in general, join us at: TheBloke AI's Discord server
Thanks, and how to contribute
Thanks to the chirper.ai team and Clay from [gpus.llm - utils.org](llm - utils)!
If you're able and willing to contribute, it will be most gratefully received and will help to keep providing more models and start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.
- Patreon: https://patreon.com/TheBlokeAI
- Ko - Fi: https://ko - fi.com/TheBlokeAI
Special thanks to: Aemon Algiz.
Patreon special mentions: Brandon Frisco, LangChain4j, Spiking Neurons AB, transmissions 11, Joseph William Delisle, Nitin Borwankar, Willem Michiel, Michael Dempsey, vamX, Jeffrey Morgan, zynix, jjj, Omer Bin Jawed, Sean Connelly, jinyuan sun, Jeromy Smith, Shadi, Pawan Osman, Chadd, Elijah Stavena, Illia Dulskyi, Sebastain Graf, Stephen Murray, terasurfer, Edmond Seymore, Celu Ramasamy, Mandus, Alex, biorpg, Ajan Kanaga, Clay Pascal, Raven Klaugh, 阿明, K, ya boyyy, usrbinkat, Alicia Loh, John Villwock, ReadyPlayerEmma, Chris Smitley, Cap'n Zoog, fincy, GodLy, S_X, sidney chen, Cory Kujawski, OG, Mano Prime, AzureBlack, Pieter, Kalila, Spencer Kim, Tom X Nguyen, Stanislav Ovsiannikov, Michael Levine, Andrey, Trailburnt, Vadim, Enrico Ros, Talal Aujan, Brandon Phillips, Jack West, Eugene Pentland, Michael Davis, Will Dee, webtim, Jonathan Leane, Alps Aficionado, Rooh Singh, Tiffany J. Kim, theTransient, Luke @flexchar, Elle, Caitlyn Gatomon, Ari Malik, subjectnull, Johann - Peter Hartmann, Trenton Dambrowitz, Imad Khwaja, Asp the Wyvern, Emad Mostaque, Rainer Wilmers, Alexandros Triantafyllidis, Nicholas, Pedro Madruga, SuperWojo, Harry Royden McLaughlin, James Bentley, Olakabola, David Ziegler, Ai Maven, Jeff Scroggin, Nikolai Manek, Deo Leter, Matthew Berman, Fen Risland, Ken Nordquist, Manuel Alberto Morcote, Luke Pendergrass, TL, Fred von Graf, Randy H, Dan Guido, NimbleBox.ai, Vitor Caleffi, Gabriel Tamborski, knownsqashed, Lone Striker, Erik Bjäreholt, John Detwiler, Leonard Tan, Iucharbius

