PLLuM-8x7B-chat-GGUF Open-source Model - An Excellent Choice for Local Inference, with Multiple Quantizations Compatible with Different Hardware

Pllum 8x7B Chat GGUF

Developed by piotrmaciejbednarski

The GGUF quantization version of PLLuM-8x7B-chat, optimized for local inference, supporting multiple quantization levels to meet different hardware requirements.

Large Language Model

Transformers

Open Source License:Apache-2.0 #Polish language optimization #Multi-quantization version #Local inference

Downloads 126

Release Time : 3/1/2025

Model Overview

This is an 8x7B parameter large language model optimized for the Polish language, designed specifically for dialogue tasks, offering multiple quantization versions to run efficiently on different hardware.

Model Features

Multiple quantization options

Offer multiple quantization levels from Q2_K to F16/BF16 to adapt to different hardware configurations and performance requirements.

Local inference optimization

Specifically optimized for llama.cpp and related tools, supporting efficient operation on consumer-grade hardware.

Polish language optimization

Specifically optimized for Polish text generation and dialogue tasks.

Tool compatibility

Support running in multiple tools such as LM Studio and Ollama.

Model Capabilities

Polish text generation

Answering questions

Text summarization

Content creation

Translation

Concept explanation

Dialogue interaction

Use Cases

Education

Polish language learning assistance

Help students understand and learn Polish grammar and vocabulary

Provide accurate Polish explanations and examples

Content creation

Polish content generation

Generate Polish articles, stories, or other creative content

Generate fluent and contextually appropriate Polish text

Customer service

Polish customer service chatbot

Handle Polish customers' inquiries and questions

Provide accurate and natural Polish dialogue responses

🚀 PLLuM-8x7B-chat GGUF (Unofficial)

This repository offers quantized versions of the PLLuM-8x7B-chat model in GGUF format. These versions are optimized for local execution using llama.cpp and related tools. Quantization significantly reduces the model size while maintaining good text generation quality, enabling the model to run on standard hardware.

This is the sole repository that contains both the reference (F16) and (BF16) versions of the PLLuM-8x7B-chat model, along with the (IQ3_S) quantization.

The GGUF version allows you to run the model in LM Studio or Ollama, among other platforms.

🚀 Quick Start

To quickly get started with the model, you can follow the steps below. First, download the model using the huggingface-cli tool, and then run it using your preferred method.

✨ Features

Multiple Quantization Options: Offers a variety of quantization types, such as Q2_K, IQ3_S, Q3_K_M, Q4_K_M, Q5_K_M, Q8_0, F16, and BF16, to meet different hardware and quality requirements.
Local Execution: Optimized for local execution using llama.cpp and related tools, enabling you to run the model on your own hardware.
Compatibility: The GGUF version is compatible with popular platforms like LM Studio and Ollama.

📦 Installation

Downloading the model using huggingface-cli

Click to see download instructions

First, make sure you have the huggingface-cli tool installed:

pip install -U "huggingface_hub[cli]"

Downloading smaller models

To download a specific model smaller than 50GB (e.g., q4_k_m):

huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q4_k_m.gguf" --local-dir ./

You can also download other quantizations by changing the filename:

# For q3_k_m version (22.5 GB)
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q3_k_m.gguf" --local-dir ./

# For iq3_s version (20.4 GB)
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-iq3_s.gguf" --local-dir ./

# For q5_k_m version (33.2 GB)
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q5_k_m.gguf" --local-dir ./

Downloading larger models (split into parts)

For large models, such as F16 or bf16, files are split into smaller parts. To download all parts to a local folder:

# For F16 version (~85 GB)
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-F16/*" --local-dir ./F16/

# For bf16 version (~85 GB)
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-bf16/*" --local-dir ./bf16/

Faster downloads with hf_transfer

To significantly speed up downloading (up to 1GB/s), you can use the hf_transfer library:

# Install hf_transfer
pip install hf_transfer

# Download with hf_transfer enabled (much faster)
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q4_k_m.gguf" --local-dir ./

Joining split files after downloading

If you downloaded a split model, you can join it using:

# On Linux/Mac systems
cat PLLuM-8x7B-chat-gguf-F16.part-* > PLLuM-8x7B-chat-gguf-F16.gguf

# On Windows systems
copy /b PLLuM-8x7B-chat-gguf-F16.part-* PLLuM-8x7B-chat-gguf-F16.gguf

💻 Usage Examples

Using llama.cpp

In these examples, we will use the PLLuM model from our unofficial repository. You can download your preferred quantization from the available models table above.

Once downloaded, place your model in the models directory.

Unix-based systems (Linux, macOS, etc.):

Input prompt (One-and-done)

./llama-cli -m models/PLLuM-8x7B-chat-gguf-q4_k_m.gguf --prompt "Pytanie: Jakie są największe miasta w Polsce? Odpowiedź:"

Windows:

Input prompt (One-and-done)

./llama-cli.exe -m models\PLLuM-8x7B-chat-gguf-q4_k_m.gguf --prompt "Pytanie: Jakie są największe miasta w Polsce? Odpowiedź:"

For detailed and up-to-date information, please refer to the official llama.cpp documentation.

Using text-generation-webui

# Install text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
pip install -r requirements.txt

# Run the server with the selected model
python server.py --model path/to/PLLuM-8x7B-chat-gguf-q4_k_m.gguf

Using python and llama-cpp-python

from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="path/to/PLLuM-8x7B-chat-gguf-q4_k_m.gguf",
    n_ctx=4096,     # Context size
    n_threads=8,    # Number of CPU threads
    n_batch=512     # Batch size
)

# Example usage
prompt = "Pytanie: Jakie są najciekawsze zabytki w Krakowie? Odpowiedź:"
output = llm(
    prompt,
    max_tokens=512,
    temperature=0.7,
    top_p=0.95
)

print(output["choices"][0]["text"])

📚 Documentation

Available models

Filename	Size	Quantization type	Recommended hardware	Usage
PLLuM-8x7B-chat-gguf-q2_k.gguf	17 GB	Q2_K	CPU, min. 20 GB RAM	Very weak computers, worst quality
PLLuM-8x7B-chat-gguf-iq3_s.gguf	20.4 GB	IQ3_S	CPU, min. 24GB RAM	Running on weaker computers with acceptable quality
PLLuM-8x7B-chat-gguf-q3_k_m.gguf	22.5 GB	Q3_K_M	CPU, min. 26GB RAM	Good compromise between size and quality
PLLuM-8x7B-chat-gguf-q4_k_m.gguf	28.4 GB	Q4_K_M	CPU/GPU, min. 32GB RAM	Recommended for most applications
PLLuM-8x7B-chat-gguf-q5_k_m.gguf	33.2 GB	Q5_K_M	CPU/GPU, min. 40GB RAM	High quality with reasonable size
PLLuM-8x7B-chat-gguf-q8_0.gguf	49.6 GB	Q8_0	GPU, min. 52GB RAM	Highest quality, close to original
PLLuM-8x7B-chat-gguf-F16	~85 GB	F16	GPU, min. 85GB VRAM	Reference model without quantization
PLLuM-8x7B-chat-gguf-bf16	~85 GB	BF16	GPU, min. 85GB VRAM	Alternative full precision format

What is quantization?

Quantization is the process of reducing the precision of model weights, which decreases memory requirements while maintaining acceptable quality of generated text. The GGUF (GPT-Generated Unified Format) format is the successor to the GGML format, which enables efficient running of large language models on consumer hardware.

Which model to choose?

Q2_K, IQ3_S and Q3_K_M: The smallest versions of the model, ideal when memory savings are a priority
Q4_K_M: Recommended for most applications - good balance between quality and size
Q5_K_M: Choose when you care about better quality and have the appropriate amount of memory
Q8_0: Highest quality on GPU, smallest quality decrease compared to the original
F16/BF16: Full precision, reference versions without quantization

🔧 Technical Details

About the PLLuM model

PLLuM (Polish Large Language Model) is an advanced family of Polish language models developed by the Polish Ministry of Digital Affairs. This version of the model (8x7B-chat) has been optimized for conversations (chat).

Model capabilities:

Generating text in Polish
Answering questions
Summarizing texts
Creating content
Translation
Explaining concepts
Conducting conversations

📄 License

The base PLLuM 8x7B-chat model is distributed under the Apache License 2.0. Quantized versions are subject to the same license.

Authors

The author of the repository and quantization is Piotr Bednarski

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご