C

Csmpt7b

Developed by BUT-FIT
A large Czech language model based on continuous pre-training of the English MPT7b model, trained on 272 billion tokens of Czech corpus using a Czech tokenizer for pre-training on approximately 67 billion tokens of Czech large-scale corpus
Downloads 234
Release Time : 3/11/2024

Model Overview

CSMPT7b is a Czech large language model implemented through lexical substitution methods, trained on the Karolina supercomputing cluster, primarily used for Czech text generation tasks

Model Features

Lexical substitution technology
Knowledge transfer achieved by aligning English-Czech vocabulary tables and copying word vectors, significantly outperforming training from scratch
Large-scale Czech language training
Pre-trained using approximately 67 billion tokens of Czech large-scale corpus
Dynamic corpus switching
Dynamically switching between three different corpora during training, including original and filtered corpora

Model Capabilities

Czech text generation
Language understanding

Use Cases

Text generation
Czech content creation
Generating Czech articles, stories, and other textual content
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase