🚀 PLLuM-8x7B-chat GGUF (Unofficial)
This repository offers quantized versions of the PLLuM-8x7B-chat model in GGUF format. These versions are optimized for local execution using llama.cpp and related tools. Quantization significantly reduces the model size while maintaining good text generation quality, enabling the model to run on standard hardware.
This is the sole repository that contains both the reference (F16) and (BF16) versions of the PLLuM-8x7B-chat model, along with the (IQ3_S) quantization.
The GGUF version allows you to run the model in LM Studio or Ollama, among other platforms.
🚀 Quick Start
To quickly get started with the model, you can follow the steps below. First, download the model using the huggingface-cli
tool, and then run it using your preferred method.
✨ Features
- Multiple Quantization Options: Offers a variety of quantization types, such as Q2_K, IQ3_S, Q3_K_M, Q4_K_M, Q5_K_M, Q8_0, F16, and BF16, to meet different hardware and quality requirements.
- Local Execution: Optimized for local execution using
llama.cpp
and related tools, enabling you to run the model on your own hardware.
- Compatibility: The GGUF version is compatible with popular platforms like LM Studio and Ollama.
📦 Installation
Downloading the model using huggingface-cli
Click to see download instructions
First, make sure you have the huggingface-cli tool installed:
pip install -U "huggingface_hub[cli]"
Downloading smaller models
To download a specific model smaller than 50GB (e.g., q4_k_m):
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q4_k_m.gguf" --local-dir ./
You can also download other quantizations by changing the filename:
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q3_k_m.gguf" --local-dir ./
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-iq3_s.gguf" --local-dir ./
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q5_k_m.gguf" --local-dir ./
Downloading larger models (split into parts)
For large models, such as F16 or bf16, files are split into smaller parts. To download all parts to a local folder:
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-F16/*" --local-dir ./F16/
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-bf16/*" --local-dir ./bf16/
Faster downloads with hf_transfer
To significantly speed up downloading (up to 1GB/s), you can use the hf_transfer library:
pip install hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q4_k_m.gguf" --local-dir ./
Joining split files after downloading
If you downloaded a split model, you can join it using:
cat PLLuM-8x7B-chat-gguf-F16.part-* > PLLuM-8x7B-chat-gguf-F16.gguf
copy /b PLLuM-8x7B-chat-gguf-F16.part-* PLLuM-8x7B-chat-gguf-F16.gguf
💻 Usage Examples
Using llama.cpp
In these examples, we will use the PLLuM model from our unofficial repository. You can download your preferred quantization from the available models table above.
Once downloaded, place your model in the models
directory.
Unix-based systems (Linux, macOS, etc.):
Input prompt (One-and-done)
./llama-cli -m models/PLLuM-8x7B-chat-gguf-q4_k_m.gguf --prompt "Pytanie: Jakie są największe miasta w Polsce? Odpowiedź:"
Windows:
Input prompt (One-and-done)
./llama-cli.exe -m models\PLLuM-8x7B-chat-gguf-q4_k_m.gguf --prompt "Pytanie: Jakie są największe miasta w Polsce? Odpowiedź:"
For detailed and up-to-date information, please refer to the official llama.cpp documentation.
Using text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
pip install -r requirements.txt
python server.py --model path/to/PLLuM-8x7B-chat-gguf-q4_k_m.gguf
Using python and llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="path/to/PLLuM-8x7B-chat-gguf-q4_k_m.gguf",
n_ctx=4096,
n_threads=8,
n_batch=512
)
prompt = "Pytanie: Jakie są najciekawsze zabytki w Krakowie? Odpowiedź:"
output = llm(
prompt,
max_tokens=512,
temperature=0.7,
top_p=0.95
)
print(output["choices"][0]["text"])
📚 Documentation
Available models
Filename |
Size |
Quantization type |
Recommended hardware |
Usage |
PLLuM-8x7B-chat-gguf-q2_k.gguf |
17 GB |
Q2_K |
CPU, min. 20 GB RAM |
Very weak computers, worst quality |
PLLuM-8x7B-chat-gguf-iq3_s.gguf |
20.4 GB |
IQ3_S |
CPU, min. 24GB RAM |
Running on weaker computers with acceptable quality |
PLLuM-8x7B-chat-gguf-q3_k_m.gguf |
22.5 GB |
Q3_K_M |
CPU, min. 26GB RAM |
Good compromise between size and quality |
PLLuM-8x7B-chat-gguf-q4_k_m.gguf |
28.4 GB |
Q4_K_M |
CPU/GPU, min. 32GB RAM |
Recommended for most applications |
PLLuM-8x7B-chat-gguf-q5_k_m.gguf |
33.2 GB |
Q5_K_M |
CPU/GPU, min. 40GB RAM |
High quality with reasonable size |
PLLuM-8x7B-chat-gguf-q8_0.gguf |
49.6 GB |
Q8_0 |
GPU, min. 52GB RAM |
Highest quality, close to original |
PLLuM-8x7B-chat-gguf-F16 |
~85 GB |
F16 |
GPU, min. 85GB VRAM |
Reference model without quantization |
PLLuM-8x7B-chat-gguf-bf16 |
~85 GB |
BF16 |
GPU, min. 85GB VRAM |
Alternative full precision format |
What is quantization?
Quantization is the process of reducing the precision of model weights, which decreases memory requirements while maintaining acceptable quality of generated text. The GGUF (GPT-Generated Unified Format) format is the successor to the GGML format, which enables efficient running of large language models on consumer hardware.
Which model to choose?
- Q2_K, IQ3_S and Q3_K_M: The smallest versions of the model, ideal when memory savings are a priority
- Q4_K_M: Recommended for most applications - good balance between quality and size
- Q5_K_M: Choose when you care about better quality and have the appropriate amount of memory
- Q8_0: Highest quality on GPU, smallest quality decrease compared to the original
- F16/BF16: Full precision, reference versions without quantization
🔧 Technical Details
About the PLLuM model
PLLuM (Polish Large Language Model) is an advanced family of Polish language models developed by the Polish Ministry of Digital Affairs. This version of the model (8x7B-chat) has been optimized for conversations (chat).
Model capabilities:
- Generating text in Polish
- Answering questions
- Summarizing texts
- Creating content
- Translation
- Explaining concepts
- Conducting conversations
📄 License
The base PLLuM 8x7B-chat model is distributed under the Apache License 2.0. Quantized versions are subject to the same license.
Authors
The author of the repository and quantization is Piotr Bednarski