Bielik-11B-v2 Open-Source Text Generation Model - Tailored for Polish and Free to Use!

Bielik 11B V2

Developed by speakleash

Bielik-11B-v2 is a generative text model with 11 billion parameters, specifically developed and trained for Polish language text. It is initialized based on Mistral-7B-v0.2 and trained on 400 billion tokens.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Polish language generation #Supercomputing training #Large parameter base

Downloads 690

Release Time : 8/26/2024

Model Overview

This model is the result of a collaboration between the open-source scientific project SpeakLeash and the high-performance computing center ACK Cyfronet AGH. It demonstrates exceptional Polish language understanding and processing capabilities, accurately responding to and efficiently completing various language tasks.

Model Features

Large-scale training

Initialized based on the predecessor Mistral-7B-v0.2 and trained on 400 billion tokens, the training data includes Polish language texts collected by the SpeakLeash project and subsets of CommonCrawl.

High-quality data

Polish language text quality was evaluated using the XGBoost classification model, selecting texts with a HIGH quality index and a probability exceeding 90%, ensuring refined and high-quality training data.

High-performance computing

Training was completed on the Helios supercomputer at ACK Cyfronet AGH, using 256 NVidia GH200 GPUs, leveraging the large-scale computing infrastructure of the Polish PLGrid environment.

Model Capabilities

Polish text generation

Polish language understanding and processing

Language task response

Use Cases

Language processing

Text generation

Generate Polish language texts, such as articles, stories, etc.

Can accurately respond to and efficiently complete various language tasks.

Sentiment analysis

Analyze the emotional tendency of Polish language texts.

Performs excellently on the Open PL LLM Leaderboard.

🚀 Bielik-11B-v2

Bielik-11B-v2 is a generative text model with 11 billion parameters. It is initialized from Mistral-7B-v0.2 and trained on Polish text corpora, enabling it to understand and process the Polish language accurately.

🚀 Quick Start

This model can be easily loaded using the AutoModelForCausalLM functionality.

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "speakleash/Bielik-11B-v2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Advanced Usage

In order to reduce the memory usage, you can use smaller precision (bfloat16).

import torch

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

And then you can use HuggingFace Pipelines to generate text:

import transformers

text = "Najważniejszym celem człowieka na ziemi jest"

pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
sequences = pipeline(max_new_tokens=100, do_sample=True, top_k=50, eos_token_id=tokenizer.eos_token_id, text_inputs=text)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Generated output:

Najważniejszym celem człowieka na ziemi jest życie w pokoju, harmonii i miłości. Dla każdego z nas bardzo ważne jest, aby otaczać się kochanymi osobami.

✨ Features

Exceptional Polish Language Ability: The model exhibits an outstanding ability to understand and process the Polish language, providing accurate responses and performing a variety of linguistic tasks with high precision.
High Performance: Achieved excellent results on both the Open PL LLM Leaderboard and Open LLM Leaderboard, outperforming many other models.

📦 Installation

No specific installation steps are provided in the original README.

📚 Documentation

Model

Bielik-11B-v2 has been trained with Megatron-LM using different parallelization techniques.

The model training was conducted on the Helios Supercomputer at the ACK Cyfronet AGH, utilizing 256 NVidia GH200 cards.

The training dataset was composed of Polish texts collected and made available through the SpeakLeash project, as well as a subset of CommonCrawl data. We used 200 billion tokens (over 700 GB of plain text) for two epochs of training.

Model description:

Property	Details
Developed by	SpeakLeash & ACK Cyfronet AGH
Language	Polish
Model Type	causal decoder-only
Initialized from	Mistral-7B-v0.2
License	Apache 2.0 and Terms of Use
Model ref	speakleash:45b6efdb701991181a05968fc53d2a8e

Quality evaluation

An XGBoost classification model was prepared and created to evaluate the quality of texts in native Polish language. It is based on 93 features, such as the ratio of out-of-vocabulary words to all words (OOVs), the number of nouns, verbs, average sentence length etc. The model outputs the category of a given document (either HIGH, MEDIUM or LOW) along with the probability. This approach allows implementation of a dedicated pipeline to choose documents, from which we've used entries with HIGH quality index and probability exceeding 90%.

This filtration and appropriate selection of texts enable the provision of a condensed and high-quality database of texts in Polish for training purposes.

Evaluation

Models have been evaluated on two leaderboards: Open PL LLM Leaderboard and Open LLM Leaderboard. The Open PL LLM Leaderboard uses a 5-shot evaluation and focuses on NLP tasks in Polish, while the Open LLM Leaderboard evaluates models on various English language tasks.

Open PL LLM Leaderboard

The benchmark evaluates models in NLP tasks like sentiment analysis, categorization, text classification but does not test chatting skills. Average column is an average score among all tasks normalized by baseline scores.

Model	Parameters (B)	Average
Meta-Llama-3-70B	70	62.07
Qwen1.5-72B	72	61.11
Meta-Llama-3.1-70B	70	60.87
Mixtral-8x22B-v0.1	141	60.75
Qwen1.5-32B	32	58.71
Bielik-11B-v2	11	58.14
Qwen2-7B	7	49.39
SOLAR-10.7B-v1.0	10.7	47.54
Mistral-Nemo-Base-2407	12	47.28
internlm2-20b	20	47.15
Meta-Llama-3.1-8B	8	43.77
Meta-Llama-3-8B	8	43.30
Mistral-7B-v0.2	7	38.81
Bielik-7B-v0.1	7	34.34
Qra-13b	13	33.90
Qra-7b	7	16.60

The results from the Open PL LLM Leaderboard show that the Bielik-11B-v2 model, with 11 billion parameters, achieved an average score of 58.14. This makes it the best performing model among those under 20B parameters, outperforming the second-best model in this category by an impressive 8.75 percentage points. This significant lead not only places it ahead of its predecessor, the Bielik-7B-v0.1 (which scored 34.34), but also demonstrates its superiority over other larger models. The substantial improvement highlights the remarkable advancements and optimizations made in this newer version.

Other Polish models listed include Qra-13b and Qra-7b, scoring 33.90 and 16.60 respectively, indicating that Bielik-11B-v2 outperforms these models by a considerable margin.

Additionally, the Bielik-11B-v2 was initialized from the weights of Mistral-7B-v0.2, which itself scored 38.81, further demonstrating the effective enhancements incorporated into the Bielik-11B-v2 model.

Open LLM Leaderboard

The Open LLM Leaderboard evaluates models on various English language tasks, providing insights into the model's performance across different linguistic challenges.

Model	AVG	arc_challenge	hellaswag	truthfulqa_mc2	mmlu	winogrande	gsm8k
Bielik-11B-v2	65.87	60.58	79.84	46.13	63.06	77.82	67.78
Mistral-7B-v0.2	60.37	60.84	83.08	41.76	63.62	78.22	34.72
Bielik-7B-v0.1	49.98	45.22	67.92	47.16	43.20	66.85	29.49

The results from the Open LLM Leaderboard demonstrate the impressive performance of Bielik-11B-v2 across various NLP tasks. With an average score of 65.87, it significantly outperforms its predecessor, Bielik-7B-v0.1, and even surpasses Mistral-7B-v0.2, which served as its initial weight basis.

Key observations:

Bielik-11B-v2 shows substantial improvements in most categories compared to Bielik-7B-v0.1, highlighting the effectiveness of the model's enhancements.
It performs exceptionally well in tasks like hellaswag (common sense reasoning), winogrande (commonsense reasoning), and gsm8k (mathematical problem-solving), indicating its versatility across different types of language understanding and generation tasks.
While Mistral-7B-v0.2 outperforms in truthfulqa_mc2, Bielik-11B-v2 maintains competitive performance in this truth-discernment task.

Although Bielik-11B-v2 was primarily trained on Polish data, it has retained and even improved its ability to understand and operate in English, as evidenced by its strong performance across these English-language benchmarks. This suggests that the model has effectively leveraged cross-lingual transfer learning, maintaining its Polish language expertise while enhancing its English language capabilities.

Limitations and Biases

⚠️ Important Note

Bielik-11B-v2 is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.

⚠️ Important Note

Bielik-11B-v2 can produce factually incorrect output, and should not be relied on to produce factually accurate data. Bielik-11B-v2 was trained on various public datasets. While great efforts have been taken to clear the training data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs.

Citation

Please cite this model using the following format:

@misc{Bielik11Bv2b,
    title     = {Bielik-11B-v2 model card},
    author    = {Ociepa, Krzysztof and Flis, Łukasz and Wróbel, Krzysztof and Gwoździej, Adrian and {SpeakLeash Team} and {Cyfronet Team}},
    year      = {2024},
    url       = {https://huggingface.co/speakleash/Bielik-11B-v2},
    note      = {Accessed: 2024-08-28},
    urldate   = {2024-08-28}
}
@unpublished{Bielik11Bv2a,
  author = {Ociepa, Krzysztof and Flis, Łukasz and Kinas, Remigiusz and Gwoździej, Adrian and Wróbel, Krzysztof},
  title  = {Bielik: A Family of Large Language Models for the Polish Language - Development, Insights, and Evaluation},
  year   = {2024},
}
@misc{ociepa2024bielik7bv01polish,
      title={Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation}, 
      author={Krzysztof Ociepa and Łukasz Flis and Krzysztof Wróbel and Adrian Gwoździej and Remigiusz Kinas},
      year={2024},
      eprint={2410.18565},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.18565}, 
}

Responsible for training the model

Krzysztof Ociepa^SpeakLeash - team leadership, conceptualizing, data preparation, process optimization and oversight of training
Łukasz Flis^{Cyfronet AGH} - coordinating and supervising the training
Adrian Gwoździej^SpeakLeash - data cleaning and quality
Krzysztof Wróbel^SpeakLeash - benchmarks

The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH. Individuals who contributed to the creation of the model: Grzegorz Urbanowicz, Igor Ciuciura, Jacek Chwiła, Szymon Baczyński, Paweł Kiszczak, Aleksander Smywiński-Pohl.

Members of the ACK Cyfronet AGH team providing valuable support and expertise: Szymon Mazurek, Marek Magryś.

Contact Us

If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.

Additional Links

🎥 Demo: https://chat.bielik.ai
🗣️ Chat Arena*: https://arena.speakleash.org.pl/

*Chat Arena is a platform for testing and comparing different AI language models, allowing users to evaluate their performance and quality.

📄 License

This model is licensed under Apache 2.0 and Terms of Use.

⚠️ Important Note

If you want to learn more about how you can use the model, please refer to our Terms of Use.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご