ruGPT-3.5-13B Open-Source Russian Language Model - Pre-trained on multi-domain data to facilitate Russian language understanding and communication

Rugpt 3.5 13B

Developed by ai-forever

A 13-billion-parameter language model for Russian, pretrained on 300GB of multi-domain data with Russian perplexity around 8.8

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Russian Large Language Model #Multi-domain Pretraining #Legal Code Enhancement

Downloads 4,538

Release Time : 5/2/2023

Model Overview

Large Russian generative model supporting text generation, Q&A tasks, previously used to train GigaChat

Model Features

Large-scale Russian Training

Trained on 400GB of Russian multi-domain data (including code and legal documents)

Efficient Deduplication

Uses 64-bit hash deduplication and zlib4 compression ratio filtering to ensure data quality

Long Sequence Fine-tuning

Supports fine-tuning with 2048-token sequence length

Model Capabilities

Russian Text Generation

Poetry Composition

Technical Q&A

Historical Fact Query

Use Cases

Creative Writing

Poetry Generation

Generating Russian poems on programmer themes

Examples demonstrate humorous style poetry creation capability

Education

Scientific Concept Explanation

Explaining neural network principles in simple language

Can accurately output layman's explanations of technical concepts

Information Retrieval

Historical Event Query

Answering specific details about Gagarin's spaceflight

Can provide accurate historical event timelines and background information

🗿 ruGPT-3.5 13B

A language model designed for the Russian language. As the name suggests, this model has 13 billion parameters. It's our largest model to date and was used in the training of GigaChat. For more details, refer to the article.

🚀 Quick Start

The ruGPT - 3.5 13B is a powerful language model for Russian. It offers high - quality language generation capabilities.

✨ Features

Large - scale Model: With 13 billion parameters, it can handle complex language tasks.
Diverse Training Data: Trained on a wide range of data sources including various domains, code, and legal documents.
Used in GigaChat: Served as the foundation for training GigaChat.

📦 Installation

No installation steps were provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

request = "Стих про программиста может быть таким:"

encoded_input = tokenizer(request, return_tensors='pt', \
                          add_special_tokens=False).to('cuda:0')
output = model.generate(
    **encoded_input,
    num_beams=2,
    do_sample=True,
    max_new_tokens=100
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

>>> Стих про программиста может быть таким:

    Программист сидит в кресле,
    Стих сочиняет он про любовь,
    Он пишет, пишет, пишет, пишет...
    И не выходит ни черта!

Advanced Usage

request = "Нейронная сеть — это"

encoded_input = tokenizer(request, return_tensors='pt', \
                          add_special_tokens=False).to('cuda:0')
output = model.generate(
    **encoded_input,
    num_beams=4,
    do_sample=True,
    max_new_tokens=100
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

>>> Нейронная сеть — это математическая модель, состоящая из большого
    количества нейронов, соединенных между собой электрическими связями.
    Нейронная сеть может быть смоделирована на компьютере, и с ее помощью
    можно решать задачи, которые не поддаются решению с помощью традиционных
    математических методов.

request = "Гагарин полетел в космос в"

encoded_input = tokenizer(request, return_tensors='pt', \
                          add_special_tokens=False).to('cuda:0')
output = model.generate(
    **encoded_input,
    num_beams=2,
    do_sample=True,
    max_new_tokens=100
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

>>> Гагарин полетел в космос в 1961 году. Это было первое в истории
    человечества космическое путешествие. Юрий Гагарин совершил его
    на космическом корабле Восток-1. Корабль был запущен с космодрома
    Байконур.

📚 Documentation

Dataset

The model was pretrained on 300GB of data from various domains and then additionally trained on 100GB of code and legal documents. The dataset structure is as follows:

The training data was deduplicated. Text deduplication involved 64 - bit hashing of each text in the corpus to keep texts with unique hashes. Documents were also filtered based on their text compression rate using zlib4. The most strongly and weakly compressing deduplicated texts were discarded.

Technical Details

The model was trained using the Deepspeed and Megatron libraries. It was trained on a 300B token dataset for 3 epochs, which took around 45 days on 512 V100 GPUs. After that, the model was finetuned for 1 epoch with a sequence length of 2048, which took around 20 days on 200 A100 GPUs using additional data (as described above).

After the final training, the perplexity for this model was around 8.8 for the Russian language.

📄 License

The model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご