deberta-v1-base open-source pre-trained model - Free to handle various Russian text tasks

Home

Deberta V1 Base

Developed by deepvk

DeBERTa-base is a pre-trained bidirectional encoder for Russian, mainly used for processing Russian text tasks.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Russian NLP #Social text processing #High-precision encoder

Downloads 160

Release Time : 2/7/2023

Model Overview

This model is trained on a large text corpus containing open social data using the standard Masked Language Model (MLM) objective and supports Russian and a small number of other languages.

Model Features

Large-scale training data

Trained using 400GB of filtered and deduplicated text data from multiple sources, including Wikipedia, books, Twitter comments, etc.

Efficient deduplication process

Data deduplication is performed using MinHash and Jaccard similarity calculations to ensure the diversity of the training data.

High-performance optimization

The AdamW optimizer and mixed-precision training are used, and the model is trained on 8 A100s for 30 days to achieve efficient training results.

Model Capabilities

Russian text processing

Masked Language Model

Text encoding

Use Cases

Natural Language Processing

Russian text classification

Can be used for Russian text classification tasks, such as sentiment analysis and topic classification.

Performs excellently on the Russian Super Glue development set.

Text embedding

Generates embedding representations of Russian text for downstream tasks such as similarity calculation and clustering.

🚀 DeBERTa-base

A pretrained bidirectional encoder for the Russian language. Trained on large text corpora including open social data using the standard MLM objective.

🚀 Quick Start

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("deepvk/deberta-v1-base")
model = AutoModel.from_pretrained("deepvk/deberta-v1-base")

text = "Привет, мир!"

inputs = tokenizer(text, return_tensors='pt')
predictions = model(**inputs)

✨ Features

Developed by: deepvk
Model type: DeBERTa
Languages: Mostly Russian and a small fraction of other languages
License: Apache 2.0

⚠️ Important Note

This model contains only the encoder part without any pretrained head.

📚 Documentation

📦 Training Details

Training Data

400 GB of filtered and deduplicated texts in total. A mix of the following data: Wikipedia, Books, Twitter comments, Pikabu, Proza.ru, Film subtitles, News websites, and Social corpus.

Deduplication procedure

Calculate shingles with size of 5
Calculate MinHash with 100 seeds → for every sample (text) have a hash of size 100
Split every hash into 10 buckets → every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash → we have 10 hashes for every sample
For each bucket find duplicates: find samples which have the same hash → calculate pair-wise jaccard similarity → if the similarity is >0.7 than it's a duplicate
Gather duplicates from all the buckets and filter

Training Hyperparameters

Property	Details
Training regime	fp16 mixed precision
Optimizer	AdamW
Adam betas	0.9,0.98
Adam eps	1e-6
Weight decay	1e-2
Batch size	2240
Num training steps	1kk
Num warm-up steps	10k
LR scheduler	Linear
LR	2e-5
Gradient norm	1.0

The model was trained on a machine with 8xA100 for approximately 30 days.

Architecture details

Property	Details
Encoder layers	12
Encoder attention heads	12
Encoder embed dim	768
Encoder ffn embed dim	3,072
Activation function	GeLU
Attention dropout	0.1
Dropout	0.1
Max positions	512
Vocab size	50266
Tokenizer type	Byte-level BPE

Evaluation

We evaluated the model on Russian Super Glue dev set. The best result in each task is marked in bold. All models have the same size except the distilled version of DeBERTa.

Model	RCB	PARus	MuSeRC	TERRa	RUSSE	RWSD	DaNetQA	Score
vk-deberta-distill	0.433	0.56	0.625	0.59	0.943	0.569	0.726	0.635
vk-roberta-base	0.46	0.56	0.679	0.769	0.960	0.569	0.658	0.665
vk-deberta-base	0.450	0.61	0.722	0.704	0.948	0.578	0.76	0.682
vk-bert-base	0.467	0.57	0.587	0.704	0.953	0.583	0.737	0.657
sber-bert-base	0.491	0.61	0.663	0.769	0.962	0.574	0.678	0.678

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご