🚀 polka-1.1b
polka-1.1b
enhances the TinyLlama-1.1B model by continuing pretraining on an additional 5.7 billion Polish tokens, mainly from the MADLAD-400 dataset. Tokens were sampled at a 10:1 ratio between Polish and English shards using DSIR. Moreover, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.

📄 Information Table
🚀 Quick Start
The training of polka-1.1b
took 680 GPU hours on a single 8 x RTX 4090 machine with DeepSpeed ZeRO - 2.
✨ Features
- Enhanced with Polish Tokens: Continued pretraining on 5.7 billion Polish tokens improves performance in Polish text generation.
- Extended Vocabulary: The tokenizer's vocabulary is extended to 43,882 tokens, enhancing efficiency for Polish text.
💡 Notes
This base model was initially developed as the foundation for instruction tuning, which led to polka-1.1b-chat. However, I'm sharing it with the community because of its potential value in combining relatively good performance and an efficient bilingual tokenizer.
⚠️ Important Note
The model can produce coherent Polish text, but due to its size, it may suffer from hallucination.
📊 Evaluation
PolEval - 2018
Model |
Perplexity |
meta-llama/Llama-2-7b-hf |
24.3 |
meta-llama/Llama-2-13b-hf |
21.4 |
mistralai/Mistral-7B-v0.1 |
21.4 |
TinyLlama/TinyLlama-1.1B |
40.4 |
sdadas/polish-gpt2-small |
134.4 |
sdadas/polish-gpt2-medium |
100.8 |
sdadas/polish-gpt2-large |
93.2 |
sdadas/polish-gpt2-xl |
94.1 |
Azurro/APT3-275M-Base |
129.8 |
Azurro/APT3-500M-Base |
153.1 |
Azurro/APT3-1B-Base |
106.8 |
eryk-mazus/polka-1.1b |
18.1 |
szymonrucinski/Curie-7B-v1 |
13.5 |
OPI-PG/Qra-1b |
14.7 |
Long documents (2024)
Currently, LLMs support contexts of thousands of tokens, and their practical applications often involve processing long documents. Evaluating perplexity on a sentence - based dataset like PolEval - 2018 may not be meaningful. Also, the PolEval corpus has been publicly available on the internet for several years, which may have contaminated the training sets of some models. So, we prepared a new collection of long papers published in 2024 to more reliably test the models' perplexities on new knowledge unavailable during training. The corpus consists of 5,000 documents with token counts ranging from several hundred to about 20,000. Half are press texts from Polish news portals in February 2024, and the other half are scientific articles published since January 2024. Most documents exceed the context size of the evaluated models. To calculate perplexity, we divided them into chunks of the model's context length with a stride of 512 tokens, following this example.
Model |
Context |
Perplexity |
meta-llama/Llama-2-7b-hf |
4096 |
5.9 |
meta-llama/Llama-2-13b-hf |
4096 |
5.3 |
mistralai/Mistral-7B-v0.1 |
4096 |
4.9 |
TinyLlama/TinyLlama-1.1B |
2048 |
9.6 |
sdadas/polish-gpt2-small |
2048 |
27.3 |
sdadas/polish-gpt2-medium |
2048 |
20.3 |
sdadas/polish-gpt2-large |
1536 |
18.0 |
sdadas/polish-gpt2-xl |
1536 |
16.6 |
Azurro/APT3-275M-Base |
2048 |
77.0 |
Azurro/APT3-500M-Base |
2048 |
50.5 |
Azurro/APT3-1B-Base |
2048 |
19.1 |
eryk-mazus/polka-1.1b |
2048 |
6.9 |
szymonrucinski/Curie-7B-v1 |
4096 |
4.8 |
OPI-PG/Qra-1b |
4096 |
6.1 |
💻 Usage Examples
Basic Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "eryk-mazus/polka-1.1b"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
prompt = """Przykładowe zapytanie do modelu"""
model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
with torch.no_grad():
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
do_sample=True,
penalty_alpha=0.6,
top_k=5
)
output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output)