๐ DeBERTa-distill
A pretrained bidirectional encoder for the Russian language. Trained on large text corpora including open social data using the standard MLM objective.
โ ๏ธ Important Note
This model contains only the encoder part without any pretrained head.
- Developed by: deepvk
- Model type: DeBERTa
- Languages: Mostly Russian and a small fraction of other languages
- License: Apache 2.0
๐ Quick Start
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("deepvk/deberta-v1-distill")
model = AutoModel.from_pretrained("deepvk/deberta-v1-distill")
text = "ะัะธะฒะตั, ะผะธั!"
inputs = tokenizer(text, return_tensors='pt')
predictions = model(**inputs)
๐ Documentation
๐ฆ Training Data
A total of 400 GB of filtered and deduplicated texts were used. The data is a mix of the following sources: Wikipedia, Books, Twitter comments, Pikabu, Proza.ru, Film subtitles, News websites, and Social corpus.
Deduplication procedure
- Calculate shingles with size of 5.
- Calculate MinHash with 100 seeds โ for every sample (text) have a hash of size 100.
- Split every hash into 10 buckets โ every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash โ we have 10 hashes for every sample.
- For each bucket find duplicates: find samples which have the same hash โ calculate pair - wise jaccard similarity โ if the similarity is >0.7 than it's a duplicate.
- Gather duplicates from all the buckets and filter.
๐ง Training Hyperparameters
Property |
Details |
Training regime |
fp16 mixed precision |
Optimizer |
AdamW |
Adam betas |
0.9,0.98 |
Adam eps |
1e - 6 |
Weight decay |
1e - 2 |
Batch size |
3840 |
Num training steps |
100k |
Num warm - up steps |
5k |
LR scheduler |
Cosine |
LR |
5e - 4 |
Gradient norm |
1.0 |
The model was trained on a machine with 8xA100 for approximately 15 days.
๐ง Architecture details
Property |
Details |
Encoder layers |
6 |
Encoder attention heads |
12 |
Encoder embed dim |
768 |
Encoder ffn embed dim |
3,072 |
Activation function |
GeLU |
Attention dropout |
0.1 |
Dropout |
0.1 |
Max positions |
512 |
Vocab size |
50266 |
Tokenizer type |
Byte - level BPE |
๐ง Distilation
In our distillation procedure, we follow SANH et al.. The student is initialized from the teacher by taking only every second layer. We use the MLM loss and CE loss with coefficients of 0.5.
๐ Evaluation
We evaluated the model on Russian Super Glue dev set. The best result in each task is marked in bold. All models have the same size except the distilled version of DeBERTa.
Model |
RCB |
PARus |
MuSeRC |
TERRa |
RUSSE |
RWSD |
DaNetQA |
Score |
[vk - deberta - distill](https://huggingface.co/deepvk/deberta - v1 - distill) |
0.433 |
0.56 |
0.625 |
0.59 |
0.943 |
0.569 |
0.726 |
0.635 |
[vk - roberta - base](https://huggingface.co/deepvk/roberta - base) |
0.46 |
0.56 |
0.679 |
0.769 |
0.960 |
0.569 |
0.658 |
0.665 |
[vk - deberta - base](https://huggingface.co/deepvk/deberta - v1 - base) |
0.450 |
0.61 |
0.722 |
0.704 |
0.948 |
0.578 |
0.76 |
0.682 |
[vk - bert - base](https://huggingface.co/deepvk/bert - base - uncased) |
0.467 |
0.57 |
0.587 |
0.704 |
0.953 |
0.583 |
0.737 |
0.657 |
[sber - bert - base](https://huggingface.co/ai - forever/ruBert - base) |
0.491 |
0.61 |
0.663 |
0.769 |
0.962 |
0.574 |
0.678 |
0.678 |
๐ License
This model is licensed under the Apache 2.0 license.