đ Open Cabrita 3B - GGUF
Open Cabrita 3B - GGUF is a quantized version of the Open Cabrita 3B model, offering different quantization methods to balance accuracy and resource usage.
đ Quick Start
Model Information
Included Files
Name |
Quant Method |
Bits |
Size |
Description |
opencabrita3b-q4_0.gguf |
q4_0 |
4 |
1.94 GB |
4-bit quantization. |
opencabrita3b-q4_1.gguf |
q4_1 |
4 |
2.14 GB |
4-bit quantization. Higher accuracy than q4_0 but not as good as q5_0. Faster inference than q5 models. |
opencabrita3b-q5_0.gguf |
q5_0 |
5 |
2.34 GB |
5-bit quantization. Best accuracy, higher resource usage, slower inference. |
opencabrita3b-q5_1.gguf |
q5_1 |
5 |
2.53 GB |
5-bit quantization. Even better accuracy, higher resource usage, slower inference. |
opencabrita3b-q8_0.gguf |
q8_0 |
8 |
3.52 GB |
8-bit quantization. Almost indistinguishable from float16. Uses a lot of resources and is slower. |
â ī¸ Important Note
The above RAM values do not assume GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM.
đĻ Installation
Running with llama.cpp
I used the following command. Adjust it to your needs:
./main -m ./models/open-cabrita3b/opencabrita3b-q5_1.gguf --color --temp 0.5 -n 256 -p "### Instruction: {command} ### Response: "
To understand the parameters, see the llama.cpp documentation
You can try it for free on Google Colab: Open_Cabrita_llamacpp_5_1.ipynb
đ Documentation
About the GGUF Format
GGUF is a new format introduced by the llama.cpp team on August 21, 2023. It is a replacement for GGML, which is no longer supported by llama.cpp.
The main benefit of GGUF is that it is an extensible and future-proof format that stores more information about the model as metadata. It also includes significantly improved tokenization code, including full support for special tokens for the first time. This should improve performance, especially with models that use new special tokens and implement custom prompt models.
Here is a list of clients and libraries known to support GGUF:
- llama.cpp.
- text-generation-webui, the most widely used web interface. Supports GGUF with GPU acceleration via the ctransformers backend - the llama-cpp-python backend should work soon too.
- KoboldCpp, now supports GGUF starting from version 1.41! A powerful GGML web interface, with full GPU acceleration. Especially good for storytelling.
- LM Studio, versions 0.2.2 and later support GGUF. A fully equipped local GUI with GPU acceleration on both Windows (NVidia and AMD) and macOS.
- LoLLMS Web UI, should work now, choose the c_transformers backend. A great web interface with many interesting features. Supports CUDA GPU acceleration.
- ctransformers, now supports GGUF starting from version 0.2.24! A Python library with GPU acceleration, LangChain support, and an OpenAI-compatible AI server.
- llama-cpp-python, supports GGUF starting from version 0.1.79. A Python library with GPU acceleration, LangChain support, and an OpenAI-compatible API server.
- candle, added GGUF support on August 22. Candle is a Rust ML framework focused on performance, including GPU support and ease of use.
- LocalAI, added GGUF support on August 23. LocalAI provides a REST API for LLM and image generation models.
Template
### Instruction:
{prompt}
### Response:
đ License
This project is licensed under the Apache 2.0 License.