Deberta-v1-distill Open Source Bidirectional Encoder Model - Free Deployment to Empower Russian Text Processing Applications

Deberta V1 Distill

Developed by deepvk

A bidirectional encoder model pre-trained for Russian language, trained on large-scale text corpora using standard masked language modeling objectives

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Russian feature extraction #Social text optimization #Lightweight encoder

Downloads 166

Release Time : 3/17/2023

Model Overview

This is a Russian pre-trained model based on the DeBERTa architecture, primarily used for feature extraction tasks. The model is compressed using distillation techniques while retaining the core capabilities of the teacher model.

Model Features

Efficient Distillation

Adopts the distillation method by SANH et al., using interleaved layer extraction from the teacher model for initialization, reducing model size while maintaining performance

Large-scale Training Data

Uses 400GB of rigorously deduplicated mixed text data, including sources such as Wikipedia, social media, and literary websites

Optimized Deduplication Process

Employs 5-character shingle fingerprints and MinHash technology for efficient deduplication, ensuring high-quality training data

Model Capabilities

Russian text feature extraction

Multilingual understanding

Contextual encoding

Use Cases

Natural Language Processing

Russian Text Classification

Can be used for tasks such as sentiment analysis and topic classification of Russian texts

Information Retrieval

Generates high-quality embeddings for Russian documents to improve retrieval effectiveness

🚀 DeBERTa-distill

A pretrained bidirectional encoder for the Russian language. Trained on large text corpora including open social data using the standard MLM objective.

⚠️ Important Note

This model contains only the encoder part without any pretrained head.

Developed by: deepvk
Model type: DeBERTa
Languages: Mostly Russian and a small fraction of other languages
License: Apache 2.0

🚀 Quick Start

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("deepvk/deberta-v1-distill")
model = AutoModel.from_pretrained("deepvk/deberta-v1-distill")

text = "Привет, мир!"

inputs = tokenizer(text, return_tensors='pt')
predictions = model(**inputs)

📚 Documentation

📦 Training Data

A total of 400 GB of filtered and deduplicated texts were used. The data is a mix of the following sources: Wikipedia, Books, Twitter comments, Pikabu, Proza.ru, Film subtitles, News websites, and Social corpus.

Deduplication procedure

Calculate shingles with size of 5.
Calculate MinHash with 100 seeds → for every sample (text) have a hash of size 100.
Split every hash into 10 buckets → every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash → we have 10 hashes for every sample.
For each bucket find duplicates: find samples which have the same hash → calculate pair - wise jaccard similarity → if the similarity is >0.7 than it's a duplicate.
Gather duplicates from all the buckets and filter.

🔧 Training Hyperparameters

Property	Details
Training regime	fp16 mixed precision
Optimizer	AdamW
Adam betas	0.9,0.98
Adam eps	1e - 6
Weight decay	1e - 2
Batch size	3840
Num training steps	100k
Num warm - up steps	5k
LR scheduler	Cosine
LR	5e - 4
Gradient norm	1.0

The model was trained on a machine with 8xA100 for approximately 15 days.

🔧 Architecture details

Property	Details
Encoder layers	6
Encoder attention heads	12
Encoder embed dim	768
Encoder ffn embed dim	3,072
Activation function	GeLU
Attention dropout	0.1
Dropout	0.1
Max positions	512
Vocab size	50266
Tokenizer type	Byte - level BPE

🔧 Distilation

In our distillation procedure, we follow SANH et al.. The student is initialized from the teacher by taking only every second layer. We use the MLM loss and CE loss with coefficients of 0.5.

📊 Evaluation

We evaluated the model on Russian Super Glue dev set. The best result in each task is marked in bold. All models have the same size except the distilled version of DeBERTa.

Model	RCB	PARus	MuSeRC	TERRa	RUSSE	RWSD	DaNetQA	Score
[vk - deberta - distill](https://huggingface.co/deepvk/deberta - v1 - distill)	0.433	0.56	0.625	0.59	0.943	0.569	0.726	0.635
[vk - roberta - base](https://huggingface.co/deepvk/roberta - base)	0.46	0.56	0.679	0.769	0.960	0.569	0.658	0.665
[vk - deberta - base](https://huggingface.co/deepvk/deberta - v1 - base)	0.450	0.61	0.722	0.704	0.948	0.578	0.76	0.682
[vk - bert - base](https://huggingface.co/deepvk/bert - base - uncased)	0.467	0.57	0.587	0.704	0.953	0.583	0.737	0.657
[sber - bert - base](https://huggingface.co/ai - forever/ruBert - base)	0.491	0.61	0.663	0.769	0.962	0.574	0.678	0.678

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご