Orca-2-13B-GGUF Open-Source Large Language Model - Free and Available for Efficient Inference in Multi-Hardware Environments

Orca 2 13B GGUF

Developed by TheBloke

Orca 2 13B is a large language model fine - tuned by Microsoft based on LLAMA - 2. It provides quantized files in GGUF format and supports efficient inference in various hardware environments.

Large Language Model

Transformers

Open Source License:Other #Multi - quantization reasoning #ChatML dialogue #Lightweight fine - tuning

Downloads 2,596

Release Time : 11/21/2023

Model Overview

This model is the GGUF quantized version of Microsoft's Orca 2 13B. It supports multiple quantization levels and is suitable for text generation tasks under different hardware conditions.

Model Features

Multiple quantization methods

Supports multiple quantization levels from 2 - bit to 8 - bit to meet different precision and performance requirements

Wide compatibility

Compatible with llama.cpp and various third - party UIs and libraries, supporting multiple platforms

Hardware adaptability

Can run in CPU and GPU environments, supporting hardware configurations with different performances

Model Capabilities

Text generation

Dialogue system

Instruction following

Use Cases

Dialogue system

Intelligent assistant

Build an intelligent assistant based on natural language interaction

Can generate coherent and context - compliant responses

Content generation

Text creation

Assist in creative writing and content generation

Can generate various types of text content that meets the requirements

🚀 Orca 2 13B - GGUF

This repository provides GGUF format model files for Microsoft's Orca 2 13B, facilitating various inference scenarios.

🚀 Quick Start

This repo offers GGUF format model files for Microsoft's Orca 2 13B. These files were quantized using hardware provided by Massed Compute.

✨ Features

Multiple Inference Options: Offers AWQ, GPTQ models for GPU inference, and 2, 3, 4, 5, 6, and 8 - bit GGUF models for CPU+GPU inference.
Wide Compatibility: Compatible with llama.cpp from August 27th onwards and many third - party UIs and libraries.
Diverse Quantization Methods: Provides different quantization methods to balance model size and quality.

📦 Installation

Downloading GGUF Files

Manual Downloaders: Avoid cloning the entire repo. Most users only need to pick a single file.
Automated Download: Clients like LM Studio, LoLLMS Web UI, and Faraday.dev can automatically download models.

In `text - generation - webui`

Under Download Model, enter the model repo: TheBloke/Orca - 2 - 13B - GGUF and a specific filename (e.g., orca - 2 - 13b.Q4_K_M.gguf), then click Download.

On the Command Line

pip3 install huggingface - hub

Download an individual model file:

huggingface - cli download TheBloke/Orca - 2 - 13B - GGUF orca - 2 - 13b.Q4_K_M.gguf --local - dir. --local - dir - use - symlinks False

Download multiple files with a pattern:

huggingface - cli download TheBloke/Orca - 2 - 13B - GGUF --local - dir. --local - dir - use - symlinks False --include='*Q4_K*gguf'

To accelerate downloads on fast connections, install hf_transfer:

pip3 install hf_transfer

Set the environment variable:

HF_HUB_ENABLE_HF_TRANSFER = 1 huggingface - cli download TheBloke/Orca - 2 - 13B - GGUF orca - 2 - 13b.Q4_K_M.gguf --local - dir. --local - dir - use - symlinks False

💻 Usage Examples

Example `llama.cpp` command

./main -ngl 32 -m orca - 2 - 13b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant"

Change -ngl 32 to the number of layers to offload to GPU. Remove it if no GPU acceleration.
Change -c 4096 to the desired sequence length.
For a chat - style conversation, replace -p <PROMPT> with -i -ins.

How to run in `text - generation - webui`

Refer to [text - generation - webui/docs/04 ‐ Model Tab.md](https://github.com/oobabooga/text - generation - webui/blob/main/docs/04%20%E2%80%90%20Model%20Tab.md#llamacpp) for further instructions.

How to run from Python code

Using `ctransformers`

Install the package:

# Base ctransformers with no GPU acceleration
pip install ctransformers
# Or with CUDA GPU acceleration
pip install ctransformers[cuda]
# Or with AMD ROCm GPU acceleration (Linux only)
CT_HIPBLAS = 1 pip install ctransformers --no - binary ctransformers
# Or with Metal GPU acceleration for macOS systems only
CT_METAL = 1 pip install ctransformers --no - binary ctransformers

Simple example code:

from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Orca - 2 - 13B - GGUF", model_file="orca - 2 - 13b.Q4_K_M.gguf", model_type="llama", gpu_layers=50)

print(llm("AI is going to"))

📚 Documentation

About GGUF

GGUF is a new format introduced by the llama.cpp team on August 21st, 2023. It replaces GGML, which is no longer supported by llama.cpp.

Clients and libraries known to support GGUF include:

llama.cpp
[text - generation - webui](https://github.com/oobabooga/text - generation - webui)
KoboldCpp
LM Studio
[LoLLMS Web UI](https://github.com/ParisNeo/lollms - webui)
Faraday.dev
ctransformers
[llama - cpp - python](https://github.com/abetlen/llama - cpp - python)
candle

Repositories available

[AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/Orca - 2 - 13B - AWQ)
[GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Orca - 2 - 13B - GPTQ)
[2, 3, 4, 5, 6 and 8 - bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF)
[Microsoft's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/microsoft/Orca - 2 - 13b)

Prompt template: ChatML

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

Compatibility

These quantized GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d. They are also compatible with many third - party UIs and libraries.

Explanation of quantisation methods

Click to see details

The new methods available are:

GGML_TYPE_Q2_K - "type - 1" 2 - bit quantization in super - blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
GGML_TYPE_Q3_K - "type - 0" 3 - bit quantization in super - blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
GGML_TYPE_Q4_K - "type - 1" 4 - bit quantization in super - blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
GGML_TYPE_Q5_K - "type - 1" 5 - bit quantization. Same super - block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
GGML_TYPE_Q6_K - "type - 0" 6 - bit quantization. Super - blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw

Refer to the Provided Files table below to see what files use which methods, and how.

Provided files

Name	Quant method	Bits	Size	Max RAM required	Use case
[orca - 2 - 13b.Q2_K.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q2_K.gguf)	Q2_K	2	5.43 GB	7.93 GB	smallest, significant quality loss - not recommended for most purposes
[orca - 2 - 13b.Q3_K_S.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q3_K_S.gguf)	Q3_K_S	3	5.66 GB	8.16 GB	very small, high quality loss
[orca - 2 - 13b.Q3_K_M.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q3_K_M.gguf)	Q3_K_M	3	6.34 GB	8.84 GB	very small, high quality loss
[orca - 2 - 13b.Q3_K_L.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q3_K_L.gguf)	Q3_K_L	3	6.93 GB	9.43 GB	small, substantial quality loss
[orca - 2 - 13b.Q4_0.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q4_0.gguf)	Q4_0	4	7.37 GB	9.87 GB	legacy; small, very high quality loss - prefer using Q3_K_M
[orca - 2 - 13b.Q4_K_S.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q4_K_S.gguf)	Q4_K_S	4	7.41 GB	9.91 GB	small, greater quality loss
[orca - 2 - 13b.Q4_K_M.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q4_K_M.gguf)	Q4_K_M	4	7.87 GB	10.37 GB	medium, balanced quality - recommended
[orca - 2 - 13b.Q5_0.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q5_0.gguf)	Q5_0	5	8.97 GB	11.47 GB	legacy; medium, balanced quality - prefer using Q4_K_M
[orca - 2 - 13b.Q5_K_S.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q5_K_S.gguf)	Q5_K_S	5	8.97 GB	11.47 GB	large, low quality loss - recommended
[orca - 2 - 13b.Q5_K_M.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q5_K_M.gguf)	Q5_K_M	5	9.23 GB	11.73 GB	large, very low quality loss - recommended
[orca - 2 - 13b.Q6_K.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q6_K.gguf)	Q6_K	6	10.68 GB	13.18 GB	very large, extremely low quality loss
[orca - 2 - 13b.Q8_0.gguf](https://huggingface.co/TheBloke/Orca - 2 - 13B - GGUF/blob/main/orca - 2 - 13b.Q8_0.gguf)	Q8_0	8	13.83 GB	16.33 GB	very large, extremely low quality loss - not recommended

Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

🔧 Technical Details

Orca 2 is a finetuned version of LLAMA - 2. Its training data is a synthetic dataset created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the Orca 2 paper.

📄 License

Bias, Risks, and Limitations

Orca 2, built upon the LLaMA 2 model family, retains many of its limitations, as well as the common limitations of other large language models or limitation caused by its training process, including:

⚠️ Important Note

Data Biases: Large language models, trained on extensive data, can inadvertently carry biases present in the source data. Consequently, the models may generate outputs that could be potentially biased or unfair.

Lack of Contextual Understanding: Despite their impressive capabilities in language understanding and generation, these models exhibit limited real - world understanding, resulting in potential inaccuracies or nonsensical responses.

Lack of Transparency: Due to the complexity and size, large language models can act as “black boxes”, making it difficult to comprehend the rationale behind specific outputs or decisions. We recommend reviewing transparency notes from Azure for more information.

Content Harms: There are various types of content harms that large language models can cause. It is important to be aware of them when using these models, and to take actions to prevent them. It is recommended to leverage various content moderation services provided by different companies and institutions. On an important note, we hope for better regulations and standards from government and technology leaders around content harms for AI technologies in future. We value and acknowledge the important role that research and open source community can play in this direction.

Hallucination: It is important to be aware and cautious not to entirely rely on a given language model for critical decisions or information that might have deep impact as it is not obvious how to prevent these models from fabricating content. Moreover, it is not clear whether small models may be more susceptible to hallucination in ungrounded generation use cases due to their smaller sizes and hence reduced memorization capacities. This is an active research topic and we hope there will be more rigorous measurement, understanding and mitigations around this topic.

Discord

For further support, and discussions on these models and AI in general, join us at: TheBloke AI's Discord server

Thanks, and how to contribute

Thanks to the chirper.ai team and Clay from [gpus.llm - utils.org](llm - utils)!

If you're able and willing to contribute, it will be most gratefully received and will help to keep providing more models and start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.

Patreon: https://patreon.com/TheBlokeAI
Ko - Fi: https://ko - fi.com/TheBlokeAI

Special thanks to: Aemon Algiz.

Patreon special mentions: Brandon Frisco, LangChain4j, Spiking Neurons AB, transmissions 11, Joseph William Delisle, Nitin Borwankar, Willem Michiel, Michael Dempsey, vamX, Jeffrey Morgan, zynix, jjj, Omer Bin Jawed, Sean Connelly, jinyuan sun, Jeromy Smith, Shadi, Pawan Osman, Chadd, Elijah Stavena, Illia Dulskyi, Sebastain Graf, Stephen Murray, terasurfer, Edmond Seymore, Celu Ramasamy, Mandus, Alex, biorpg, Ajan Kanaga, Clay Pascal, Raven Klaugh, 阿明, K, ya boyyy, usrbinkat, Alicia Loh, John Villwock, ReadyPlayerEmma, Chris Smitley, Cap'n Zoog, fincy, GodLy, S_X, sidney chen, Cory Kujawski, OG, Mano Prime, AzureBlack, Pieter, Kalila, Spencer Kim, Tom X Nguyen, Stanislav Ovsiannikov, Michael Levine, Andrey, Trailburnt, Vadim, Enrico Ros, Talal Aujan, Brandon Phillips, Jack West, Eugene Pentland, Michael Davis, Will Dee, webtim, Jonathan Leane, Alps Aficionado, Rooh Singh, Tiffany J. Kim, theTransient, Luke @flexchar, Elle, Caitlyn Gatomon, Ari Malik, subjectnull, Johann - Peter Hartmann, Trenton Dambrowitz, Imad Khwaja, Asp the Wyvern, Emad Mostaque, Rainer Wilmers, Alexandros Triantafyllidis, Nicholas, Pedro Madruga, SuperWojo, Harry Royden McLaughlin, James Bentley, Olakabola, David Ziegler, Ai Maven, Jeff Scroggin, Nikolai Manek, Deo Leter, Matthew Berman, Fen Risland, Ken Nordquist, Manuel Alberto Morcote, Luke Pendergrass, TL, Fred von Graf, Randy H, Dan Guido, NimbleBox.ai, Vitor Caleffi, Gabriel Tamborski, knownsqashed, Lone Striker, Erik Bjäreholt, John Detwiler, Leonard Tan, Iucharbius

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご