🚀 Bielik-4.5B-v3
Bielik-4.5B-v3 is a generative text model with 4.6 billion parameters. It results from the unique collaboration between the open - science/open - source project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. Trained on carefully selected and processed Polish text corpora by the SpeakLeash team, it uses the Polish large - scale computing infrastructure in the PLGrid environment, specifically at the HPC center: ACK Cyfronet AGH. Supported by computational grants PLG/2024/017214 and PLG/2025/018338 on the Athena and Helios supercomputers, it can understand and process the Polish language exceptionally well, offering accurate responses and high - precision linguistic task performance.
⚠️ This is a base model for further fine - tuning in most use cases. If you need a model ready for chatting or following instructions out - of - the - box, use Bielik-4.5B-v3-Instruct.
📄 Technical report: https://arxiv.org/abs/2505.02550
🚀 Quick Start
This model can be easily loaded using the AutoModelForCausalLM functionality.
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "speakleash/Bielik-4.5B-v3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
Advanced Usage
In order to reduce the memory usage, you can use smaller precision (bfloat16
).
import torch
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
And then you can use HuggingFace Pipelines to generate text:
import transformers
text = "Najważniejszym celem człowieka na ziemi jest"
pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
sequences = pipeline(max_new_tokens=100, do_sample=True, top_k=50, eos_token_id=tokenizer.eos_token_id, text_inputs=text)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
Generated output:
Najważniejszym celem człowieka na ziemi jest życie w pokoju, harmonii i miłości. Dla każdego z nas bardzo ważne jest, aby otaczać się kochanymi osobami.
✨ Features
- Unique Collaboration: A product of the partnership between SpeakLeash and ACK Cyfronet AGH.
- Polish - Focused Training: Trained on high - quality Polish text corpora, enabling excellent Polish language processing.
- Advanced Infrastructure Utilization: Leveraged Polish large - scale computing infrastructure and supercomputers for training.
📦 Installation
No specific installation steps are provided in the original document.
📚 Documentation
Model
- Training Environment: The Bielik-4.5B-v3 model was trained on the Helios Supercomputer at the ACK Cyfronet AGH, using 256 NVidia GH200 cards.
- Training Dataset: Composed of Polish texts from the SpeakLeash project and a subset of CommonCrawl data. 292 billion tokens were used for 1.2 epochs of training.
- Training Framework: Trained with the original open - source framework ALLaMo implemented by Krzysztof Ociepa, which allows fast and efficient training of language models similar to LLaMA and Mistral.
Model description:
Quality evaluation
An XGBoost classification model was created to evaluate the quality of native Polish texts. Based on 93 features like the ratio of out - of - vocabulary words to all words (OOVs), the number of nouns, verbs, and average sentence length, it outputs the category of a given document (HIGH, MEDIUM, or LOW) along with the probability. This approach helps select high - quality texts for training.
🔧 Technical Details
The model training was conducted on the Helios Supercomputer at the ACK Cyfronet AGH with 256 NVidia GH200 cards. The training dataset was carefully selected and processed Polish texts from the SpeakLeash project and CommonCrawl. The ALLaMo framework was used for training, which is designed for efficient training of language models with architectures similar to LLaMA and Mistral.
📄 License
The model is licensed under Apache 2.0 and Terms of Use.
Limitations and Biases
⚠️ Important Note
Bielik-4.5B-v3 is not intended for deployment without fine - tuning. It should not be used for human - facing interactions without further guardrails and user consent. It can produce factually incorrect output and may generate lewd, false, biased, or offensive outputs due to the nature of the training data.
Citation
Please cite this model using the following format:
@misc{ociepa2025bielikv3smalltechnical,
title={Bielik v3 Small: Technical Report},
author={Krzysztof Ociepa and Łukasz Flis and Remigiusz Kinas and Krzysztof Wróbel and Adrian Gwoździej},
year={2025},
eprint={2505.02550},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.02550},
}
@misc{Bielik11Bv2b,
title = {Bielik-45B-v3 model card},
author = {Ociepa, Krzysztof and Flis, Łukasz and Wróbel, Krzysztof and Gwoździej, Adrian and {SpeakLeash Team} and {Cyfronet Team}},
year = {2025},
url = {https://huggingface.co/speakleash/Bielik-4.5B-v3},
note = {Accessed: 2025-05-06},
urldate = {2025-05-06}
}
Responsible for training the model
- Krzysztof OciepaSpeakLeash: Team leadership, conceptualizing, data preparation, process optimization, and oversight of training.
- Łukasz FlisCyfronet AGH: Coordinating and supervising the training.
- Remigiusz KinasSpeakLeash: Conceptualizing, coordinating RL trainings, data preparation, benchmarking, and quantizations.
- Adrian Gwo≈∫dziejSpeakLeash: Data preparation and ensuring data quality.
- Krzysztof WróbelSpeakLeash: Benchmarks.
The model couldn't have been created without the entire SpeakLeash team. Other contributors include Sebastian Kondracki, Igor Ciuciura, Szymon Baczyński, Jacek Chwiła, Dominika Basaj, Kuba Sołtys, Karol Jezierski, Anna Przybył, Agnieszka Ratajska, Witold Wydmański, Izabela Babis, Nina Babis.
Members of the ACK Cyfronet AGH team who provided support and expertise are Szymon Mazurek, Marek Magry≈õ, Mieszko Cholewa .
We thank the Polish high - performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support through computational grants PLG/2024/017214 and PLG/2025/018338.
Contact Us
💡 Usage Tip
If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.