Open-source Polka-1.1b Text Generation Model - Supports both Polish and English, provides high-quality text output

Polka 1.1b

Developed by eryk-mazus

polka-1.1b is a bilingual (Polish and English) text generation model enhanced by continuing pre-training on 5.7 billion Polish tokens based on the TinyLlama-1.1B model.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Polish language optimization #Bilingual generation #Efficient tokenizer

Downloads 174

Release Time : 1/6/2024

Model Overview

This model improves the efficiency of generating Polish text by expanding TinyLlama's tokenizer vocabulary to 43,882 tokens. It is primarily used for text generation tasks, especially suitable for Polish content creation.

Model Features

Efficient Polish text generation

Significantly improves the quality of Polish text generation through expanded tokenizer vocabulary and specialized training.

Bilingual support

Supports both Polish and English, suitable for bilingual application scenarios.

Lightweight architecture

Only 1.1B parameters, reducing computational resource requirements while maintaining performance.

Model Capabilities

Polish text generation

English text generation

Bilingual text continuation

Use Cases

Content creation

Polish article continuation

Automatically generates coherent Polish article content based on the opening paragraph

The generated content is contextually coherent and complies with Polish grammar standards.

Educational applications

Polish learning assistance

Generates Polish learning materials and practice texts

🚀 polka-1.1b

polka-1.1b enhances the TinyLlama-1.1B model by continuing pretraining on an additional 5.7 billion Polish tokens, mainly from the MADLAD-400 dataset. Tokens were sampled at a 10:1 ratio between Polish and English shards using DSIR. Moreover, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.

image/png

📄 Information Table

Property	Details
Base Model	eryk-mazus/tinyllama-with-custom-tokenizer
Datasets	allenai/MADLAD-400, eryk-mazus/polka-pretrain-en-pl-v1
Language	Polish (pl), English (en)
Pipeline Tag	text-generation
License	Apache-2.0
Context Size	2,048 tokens

🚀 Quick Start

The training of polka-1.1b took 680 GPU hours on a single 8 x RTX 4090 machine with DeepSpeed ZeRO - 2.

✨ Features

Enhanced with Polish Tokens: Continued pretraining on 5.7 billion Polish tokens improves performance in Polish text generation.
Extended Vocabulary: The tokenizer's vocabulary is extended to 43,882 tokens, enhancing efficiency for Polish text.

💡 Notes

This base model was initially developed as the foundation for instruction tuning, which led to polka-1.1b-chat. However, I'm sharing it with the community because of its potential value in combining relatively good performance and an efficient bilingual tokenizer.

⚠️ Important Note

The model can produce coherent Polish text, but due to its size, it may suffer from hallucination.

📊 Evaluation

PolEval - 2018

Model	Perplexity
meta-llama/Llama-2-7b-hf	24.3
meta-llama/Llama-2-13b-hf	21.4
mistralai/Mistral-7B-v0.1	21.4
TinyLlama/TinyLlama-1.1B	40.4
sdadas/polish-gpt2-small	134.4
sdadas/polish-gpt2-medium	100.8
sdadas/polish-gpt2-large	93.2
sdadas/polish-gpt2-xl	94.1
Azurro/APT3-275M-Base	129.8
Azurro/APT3-500M-Base	153.1
Azurro/APT3-1B-Base	106.8
eryk-mazus/polka-1.1b	18.1
szymonrucinski/Curie-7B-v1	13.5
OPI-PG/Qra-1b	14.7

Long documents (2024)

Currently, LLMs support contexts of thousands of tokens, and their practical applications often involve processing long documents. Evaluating perplexity on a sentence - based dataset like PolEval - 2018 may not be meaningful. Also, the PolEval corpus has been publicly available on the internet for several years, which may have contaminated the training sets of some models. So, we prepared a new collection of long papers published in 2024 to more reliably test the models' perplexities on new knowledge unavailable during training. The corpus consists of 5,000 documents with token counts ranging from several hundred to about 20,000. Half are press texts from Polish news portals in February 2024, and the other half are scientific articles published since January 2024. Most documents exceed the context size of the evaluated models. To calculate perplexity, we divided them into chunks of the model's context length with a stride of 512 tokens, following this example.

Model	Context	Perplexity
meta-llama/Llama-2-7b-hf	4096	5.9
meta-llama/Llama-2-13b-hf	4096	5.3
mistralai/Mistral-7B-v0.1	4096	4.9
TinyLlama/TinyLlama-1.1B	2048	9.6
sdadas/polish-gpt2-small	2048	27.3
sdadas/polish-gpt2-medium	2048	20.3
sdadas/polish-gpt2-large	1536	18.0
sdadas/polish-gpt2-xl	1536	16.6
Azurro/APT3-275M-Base	2048	77.0
Azurro/APT3-500M-Base	2048	50.5
Azurro/APT3-1B-Base	2048	19.1
eryk-mazus/polka-1.1b	2048	6.9
szymonrucinski/Curie-7B-v1	4096	4.8
OPI-PG/Qra-1b	4096	6.1

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "eryk-mazus/polka-1.1b"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

prompt = """Przykładowe zapytanie do modelu"""

model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
with torch.no_grad():
  generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    do_sample=True,
    penalty_alpha=0.6,
    top_k=5
  )

output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご