CsMPT7B Open-Source Large Czech Language Model - Empowering Czech Language Applications with Massive Corpora

Csmpt7b

Developed by BUT-FIT

A large Czech language model based on continuous pre-training of the English MPT7b model, trained on 272 billion tokens of Czech corpus using a Czech tokenizer for pre-training on approximately 67 billion tokens of Czech large-scale corpus

Large Language Model OtherOpen Source License:Apache-2.0 #Czech large language model #Lexical substitution transfer #Multilingual corpus dynamic training

Downloads 234

Release Time : 3/11/2024

Model Overview

CSMPT7b is a Czech large language model implemented through lexical substitution methods, trained on the Karolina supercomputing cluster, primarily used for Czech text generation tasks

Model Features

Lexical substitution technology

Knowledge transfer achieved by aligning English-Czech vocabulary tables and copying word vectors, significantly outperforming training from scratch

Large-scale Czech language training

Pre-trained using approximately 67 billion tokens of Czech large-scale corpus

Dynamic corpus switching

Dynamically switching between three different corpora during training, including original and filtered corpora

Model Capabilities

Czech text generation

Language understanding

Use Cases

Text generation

Czech content creation

Generating Czech articles, stories, and other textual content

🚀 CSMPT7b: A Large Czech Language Model

CSMPT7b is a large Czech language model that is continuously pretrained on 272 billion training tokens. It is derived from the English MPT7b model. The model was pretrained on approximately 67 billion tokens from the Large Czech Collection using a Czech tokenizer obtained through our vocabulary swap method (see below). The training was conducted on the Karolina cluster.

🚀 Quick Start

How to Setup Environment

pip install transformers==4.37.2 torch==2.1.2 einops==0.7.0

# be sure to install right flash-attn, we use torch compiled with CUDA 12.1, no ABI, python 3.9, Linux x86_64 architecture
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.3/flash_attn-2.5.3+cu122torch2.1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl

Running the Code

import torch
import transformers
from transformers import pipeline

name = 'BUT-FIT/csmpt7b'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.init_device = 'cuda:0'  # For fast initialization directly on GPU!
model = transformers.AutoModelForCausalLM.from_pretrained(
    name,
    config=config,
    torch_dtype=torch.bfloat16,  # Load model weights in bfloat16
    trust_remote_code=True
)

tokenizer = transformers.AutoTokenizer.from_pretrained(name, trust_remote_code=True)

pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')

with torch.autocast('cuda', dtype=torch.bfloat16):
    print(
        pipe('Nejznámějším českým spisovatelem ',
             max_new_tokens=100,
             top_p=0.95,
             repetition_penalty=1.0,
             do_sample=True,
             use_cache=True))

✨ Features

Continuously pretrained on a large Czech corpus from an English base model.
Utilizes a vocabulary swap method to transfer knowledge from English to Czech.
Evaluated on the CS-HellaSwag benchmark.

📦 Installation

The installation steps are included in the "Quick Start" section above.

💻 Usage Examples

Basic Usage

The basic usage example is provided in the "Quick Start" section's "Running the Code" part.

📚 Documentation

BUT LM Model Roster

Latest Updates

01/10/2024: We released BenCzechMark, the first Czech evaluation suite for fair open-weights model comparison.
18/04/2024: We released all our training checkpoints (in MosaicML format & packed using ZPAQ) at czechllm.fit.vutbr.cz/csmpt7b/checkpoints/.
06/05/2024: We released a small manually annotated dataset of adult content. We used a classifier trained on this dataset for filtering our corpus.

Evaluation

Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).

Model	CS-HellaSwag Accuracy
mistral7b	0.4992
csmpt@130k steps [released]	0.5004
csmpt@100k steps	0.4959
csmpt@75k steps	0.4895
csmpt@50k steps	0.4755
csmpt@26,5k steps	0.4524

However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any. The improvement over mistral7b is not significant.

Loss

We encountered loss spikes during training. As the model always recovered, and our budget for training a 7B model was very constrained, we kept on training. We observed such loss spikes before in our ablations. In these ablations (with GPT-2 small), we found these to be:

(a) influenced by the learning rate. The lower the learning rate, the less they appear. As it gets higher, they start to appear, and with too high a learning rate, the training might diverge on such a loss spike.
(b) in preliminary ablations, they only appear for continuously pretrained models. While we do not know why they appear, we hypothesize this might be linked to the theory on Adam instability in the time-domain correlation of update vectors. However, such instabilities were previously observed only for much larger models (larger than 65B).

Corpora

The model was trained on 3 corpora, which were hot-swapped during the training. These were collected/filtered during the course of training.

Corpus #1 was the same we used for our Czech GPT-2 training (15,621,685,248 tokens).
Corpus #2 contained 67,981,934,592 tokens, coming mostly from the HPLT and CulturaX corpora.
Corpus #3 (with 66,035,515,392 tokens) is Corpus #2 after we removed proportions of the inappropriate content (which avoided our other checks) through a linear classifier.

Figure 1: Training loss.

Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively. The flat region between 112k steps and 119.5k steps is caused by missing data---due to an accident, we lost these logs.

In Figure 3 (but also marked in Figure 2), we perform two ablations:

(a) After the first hot swap, we continued training on corpus #1 for a while. Result: The fact that the test loss is slightly better signifies the slight difference between the distribution of corpus #1 and corpus #2.
(b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of the hot-swap, we resume training from step 93,000 using corpus #3. The optimizer states were reinitialized. Result: Neither corpus #3 nor optimizer state reinitialization seems to mitigate the issue of local divergence at step 94,000.

Figure 3: Test loss closeup, testing performed on a split of internal-corpus #1. See Figure 2 description for ablation explanation.

Training Method

Vocabulary Swap

To transfer knowledge from the English model to Czech, we developed a simple method that (i) aligns several tokens between two vocabularies and (ii) copies the embeddings from the original language to the new language.

Test perplexity for vocabulary swap on TinyLLAMA Figure 4: Test perplexity over the course of training for the vocabulary swap (swapping 1.7K tokens) method on TinyLLAMA. Our method (green curve) vs TinyLLAMA training from scratch (blue curve).

We also verify that fine-tuning from English to Czech is beneficial for the MPT-7B model, compared to training a new model, at least on the first 10K steps. The training also seems to be more stable (notice the yellow spike around 10k steps).

Test cross-entropy on CSMPT7B and comparison with TinyLLAMA Figure 5: Test cross-entropy over the course of training on CSMPT7B (yellow-red). Comparison with TinyLLAMA (blue-green). Our method (red&green curve) vs training from scratch (yellow&blue curve).

The vocabulary swap was done the same way as for our Czech-GPT-2 model (check it out for a comprehensive description). For CSMPT7b, we managed to align 4,177 English tokens with corresponding Czech tokens.

Hyperparameters

Not mentioned hyperparameters were kept the same as for MPT.

Name	Value	Note
training sw	llm-foundry	We've done some minor patching (e.g., to allow DDP sync over file)
dataset_type	Concat	Sequences at the model's input were concatenated up to `$max_seq_len`, divided by the EOS token.
tokenizer_size	64k	Same as in Czech-GPT-2
max_seq_len	2048
batch_size	1024
learning_rate	1.0e-4
optimizer	LionW
optimizer_betas	0.9/0.95
optimizer_weight_decay	0
optimizer_eps	1.0e-08
gradient_clipping_max_norm	1.0
attn_impl	flash2	We used the Triton flash-attn 1 implementation for the initial ~60k steps
positional_encoding	alibi
fsdp	FULL_SHARD	(We had implementation issues with hybrid sharding in llm-foundry)
precision	bf16
scheduler	cosine
scheduler_warmup	100 steps
scheduler_steps	170,000
scheduler_alpha	0.1	So the LR on the last step is 0.1*(vanilla LR)

Training Data

We release most (95.79%) of our training data corpus as the BUT-Large Czech Collection.

Our Release Plan

Stage	Description	Date
1	'Best' model + training data	13.03.2024
2	All checkpoints + training code	10.04.2024. Checkpoints are released. Code won't be released. We've used LLM foundry with slight adjustments, but the version is outdated now.
3	Benczechmark a collection of Czech datasets for few-shot LLM evaluation Get in touch if you want to contribute!	01.10.2024
4	Preprint Publication	23.12.2024 preprint available

Getting in Touch

For further questions, email to martin.fajcik@vut.cz.

Disclaimer

This is a probabilistic model, and it can output stochastic information. The authors are not responsible for the model outputs. Use at your own risk.

Acknowledgement

This work was supported by the NAKI III program of the Ministry of Culture of the Czech Republic, project semANT --- "Sémantický průzkumník textového kulturního dědictví" grant no. DH23P03OVV060 and by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254).

Citation

@article{benczechmark,
  author    = {Martin Fajčík, Martin Dočekal, Jan Doležal, Karel Beneš, Michal Hradiš},
  title     = {BenCzechMark: Machine Language Understanding Benchmark for Czech Language},
  journal   = {arXiv preprint arXiv:insert-arxiv-number-here},
  year      = {2024},
  eprint    = {insert-arxiv-number-here},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
}

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご