Bielik-4.5B-v3 Open-Source Polish Text Generation Model - Generate High-Quality Polish Text for Free

Bielik 4.5B V3

Developed by speakleash

Bielik-4.5B-v3 is a Polish generative text model with 4.6 billion parameters, developed by SpeakLeash in collaboration with ACK Cyfronet AGH, trained on a curated Polish corpus.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Polish language generation #Supercomputer training #4.6 billion parameters

Downloads 40

Release Time : 5/5/2025

Model Overview

This model demonstrates exceptional Polish language understanding and processing capabilities, capable of accurately performing various language tasks, requiring fine-tuning for specific scenarios before use.

Model Features

High-quality Polish language processing

Uses XGBoost classification model to strictly filter training data, selecting only Polish texts with a quality rating of HIGH and confidence exceeding 90%

Supercomputer training

Trained on the Helios supercomputer at ACK Cyfronet AGH using 256 NVidia GH200 GPUs

Large-scale corpus

Trained on 292 billion tokens from the SpeakLeash curated corpus and a subset of CommonCrawl

Model Capabilities

Polish text generation

Language understanding

Contextual reasoning

Use Cases

Language processing

Content creation

Generate high-quality Polish text content

Sample output demonstrates the ability to generate coherent text fitting the context

Educational assistance

Used for Polish language learning or teaching material generation

🚀 Bielik-4.5B-v3

Bielik-4.5B-v3 is a generative text model with 4.6 billion parameters. It results from the unique collaboration between the open - science/open - source project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. Trained on carefully selected and processed Polish text corpora by the SpeakLeash team, it uses the Polish large - scale computing infrastructure in the PLGrid environment, specifically at the HPC center: ACK Cyfronet AGH. Supported by computational grants PLG/2024/017214 and PLG/2025/018338 on the Athena and Helios supercomputers, it can understand and process the Polish language exceptionally well, offering accurate responses and high - precision linguistic task performance.

⚠️ This is a base model for further fine - tuning in most use cases. If you need a model ready for chatting or following instructions out - of - the - box, use Bielik-4.5B-v3-Instruct.

📄 Technical report: https://arxiv.org/abs/2505.02550

🚀 Quick Start

This model can be easily loaded using the AutoModelForCausalLM functionality.

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "speakleash/Bielik-4.5B-v3"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Advanced Usage

In order to reduce the memory usage, you can use smaller precision (bfloat16).

import torch

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

And then you can use HuggingFace Pipelines to generate text:

import transformers

text = "Najwa≈ºniejszym celem cz≈Çowieka na ziemi jest"

pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
sequences = pipeline(max_new_tokens=100, do_sample=True, top_k=50, eos_token_id=tokenizer.eos_token_id, text_inputs=text)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Generated output:

Najwa≈ºniejszym celem cz≈Çowieka na ziemi jest ≈ºycie w pokoju, harmonii i mi≈Ço≈õci. Dla ka≈ºdego z nas bardzo wa≈ºne jest, aby otaczaƒá siƒô kochanymi osobami.

✨ Features

Unique Collaboration: A product of the partnership between SpeakLeash and ACK Cyfronet AGH.
Polish - Focused Training: Trained on high - quality Polish text corpora, enabling excellent Polish language processing.
Advanced Infrastructure Utilization: Leveraged Polish large - scale computing infrastructure and supercomputers for training.

📦 Installation

No specific installation steps are provided in the original document.

📚 Documentation

Model

Training Environment: The Bielik-4.5B-v3 model was trained on the Helios Supercomputer at the ACK Cyfronet AGH, using 256 NVidia GH200 cards.
Training Dataset: Composed of Polish texts from the SpeakLeash project and a subset of CommonCrawl data. 292 billion tokens were used for 1.2 epochs of training.
Training Framework: Trained with the original open - source framework ALLaMo implemented by Krzysztof Ociepa, which allows fast and efficient training of language models similar to LLaMA and Mistral.

Model description:

Property	Details
Developed by	SpeakLeash & ACK Cyfronet AGH
Language	Polish
Model Type	causal decoder - only
Initialized from	Qwen2.5 3B
License	Apache 2.0 and Terms of Use

Quality evaluation

An XGBoost classification model was created to evaluate the quality of native Polish texts. Based on 93 features like the ratio of out - of - vocabulary words to all words (OOVs), the number of nouns, verbs, and average sentence length, it outputs the category of a given document (HIGH, MEDIUM, or LOW) along with the probability. This approach helps select high - quality texts for training.

🔧 Technical Details

The model training was conducted on the Helios Supercomputer at the ACK Cyfronet AGH with 256 NVidia GH200 cards. The training dataset was carefully selected and processed Polish texts from the SpeakLeash project and CommonCrawl. The ALLaMo framework was used for training, which is designed for efficient training of language models with architectures similar to LLaMA and Mistral.

📄 License

The model is licensed under Apache 2.0 and Terms of Use.

Limitations and Biases

⚠️ Important Note

Bielik-4.5B-v3 is not intended for deployment without fine - tuning. It should not be used for human - facing interactions without further guardrails and user consent. It can produce factually incorrect output and may generate lewd, false, biased, or offensive outputs due to the nature of the training data.

Citation

Please cite this model using the following format:

@misc{ociepa2025bielikv3smalltechnical,
      title={Bielik v3 Small: Technical Report}, 
      author={Krzysztof Ociepa and ≈Åukasz Flis and Remigiusz Kinas and Krzysztof Wr√≥bel and Adrian Gwo≈∫dziej},
      year={2025},
      eprint={2505.02550},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.02550}, 
}

@misc{Bielik11Bv2b,
    title     = {Bielik-45B-v3 model card},
    author    = {Ociepa, Krzysztof and Flis, ≈Åukasz and Wr√≥bel, Krzysztof and Gwo≈∫dziej, Adrian and {SpeakLeash Team} and {Cyfronet Team}},
    year      = {2025},
    url       = {https://huggingface.co/speakleash/Bielik-4.5B-v3},
    note      = {Accessed: 2025-05-06},
    urldate   = {2025-05-06}
}

Responsible for training the model

Krzysztof Ociepa^SpeakLeash: Team leadership, conceptualizing, data preparation, process optimization, and oversight of training.
≈Åukasz Flis^{Cyfronet AGH}: Coordinating and supervising the training.
Remigiusz Kinas^SpeakLeash: Conceptualizing, coordinating RL trainings, data preparation, benchmarking, and quantizations.
Adrian Gwo≈∫dziej^SpeakLeash: Data preparation and ensuring data quality.
Krzysztof Wr√≥bel^SpeakLeash: Benchmarks.

The model couldn't have been created without the entire SpeakLeash team. Other contributors include Sebastian Kondracki, Igor Ciuciura, Szymon Baczy≈Ñski, Jacek Chwi≈Ça, Dominika Basaj, Kuba So≈Çtys, Karol Jezierski, Anna Przyby≈Ç, Agnieszka Ratajska, Witold Wydma≈Ñski, Izabela Babis, Nina Babis.

Members of the ACK Cyfronet AGH team who provided support and expertise are Szymon Mazurek, Marek Magry≈õ, Mieszko Cholewa .

We thank the Polish high - performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support through computational grants PLG/2024/017214 and PLG/2025/018338.

Contact Us

💡 Usage Tip

If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご