đ DistilProtBert
A distilled version of the ProtBert-UniRef100 model, designed for protein feature extraction and downstream task fine - tuning.
DistilProtBert is a distilled model of ProtBert-UniRef100. Besides cross - entropy and cosine teacher - student losses, it was pretrained on a masked language modeling (MLM) objective and only works with capital letter amino acids. For more details, check out our paper DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts. You can also find the Git repository here.
đ Quick Start
The model can be used in the same way as ProtBert and with ProtBert's tokenizer. It can be applied for protein feature extraction or fine - tuned on downstream tasks.
⨠Features
- A distilled version of the ProtBert-UniRef100 model.
- Pretrained on a masked language modeling (MLM) objective in addition to cross entropy and cosine teacher - student losses.
- Only works with capital letter amino acids.
đ Documentation
Model details
Property |
Details |
Model Type |
DistilProtBert (a distilled version of ProtBert-UniRef100) |
# of parameters |
230M |
# of hidden layers |
15 |
Pretraining dataset |
UniRef50 |
# of proteins |
43M |
Pretraining hardware |
5 v100 32GB GPUs |
Intended uses & limitations
The model could be used for protein feature extraction or to be fine - tuned on downstream tasks.
Training data
DistilProtBert model was pretrained on Uniref50, a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).
Pretraining procedure
Preprocessing was done using ProtBert's tokenizer. The details of the masking procedure for each sequence followed the original Bert (as mentioned in ProtBert). The model was pretrained on a single DGX cluster for 3 epochs in total. The local batch size was 16, the optimizer used was AdamW with a learning rate of 5e - 5 and mixed precision settings.
Evaluation results
When fine - tuned on downstream tasks, this model achieves the following results:
Task/Dataset |
secondary structure (3 - states) |
Membrane |
CASP12 |
72 |
|
TS115 |
81 |
|
CB513 |
79 |
|
DeepLoc |
|
86 |
Distinguish between proteins and their k - let shuffled versions:
Singlet (dataset)
Model |
AUC |
LSTM |
0.71 |
ProtBert |
0.93 |
DistilProtBert |
0.92 |
Doublet (dataset)
Model |
AUC |
LSTM |
0.68 |
ProtBert |
0.92 |
DistilProtBert |
0.91 |
Triplet (dataset)
Model |
AUC |
LSTM |
0.61 |
ProtBert |
0.92 |
DistilProtBert |
0.87 |
Citation
If you use this model, please cite our paper:
@article {
author = {Geffen, Yaron and Ofran, Yanay and Unger, Ron},
title = {DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts},
year = {2022},
doi = {10.1093/bioinformatics/btac474},
URL = {https://doi.org/10.1093/bioinformatics/btac474},
journal = {Bioinformatics}
}
đ License
This project is licensed under the MIT license.