DistilProtBert Open-Source Model - Free for Protein Feature Extraction and Fine-Tuning of Downstream Tasks

Home

Distilprotbert

Developed by yarongef

Distilled version of ProtBert-UniRef100 model for protein feature extraction and downstream task fine-tuning

Protein Model

Transformers

Open Source License:MIT #Protein feature extraction #Lightweight protein model #Biological sequence analysis

Downloads 1,965

Release Time : 3/30/2022

Model Overview

DistilProtBert is a distilled protein language model pre-trained with masked language modeling objectives, suitable for uppercase amino acid sequences.

Model Features

Distilled model

Distilled from ProtBert-UniRef100 model with reduced parameters while maintaining high performance

Efficient pre-training

Pre-trained using cross-entropy, cosine teacher-student loss, and MLM objectives

Uppercase amino acid support

Specifically optimized for uppercase amino acid sequences

Model Capabilities

Protein feature extraction

Protein sequence classification

Protein structure prediction

Use Cases

Bioinformatics

Secondary structure prediction

Predict protein secondary structure (3-state)

Achieved 72, 81, and 79 accuracy on CASP12, TS115, and CB513 datasets respectively

Membrane protein prediction

Predict whether a protein is a membrane protein

Achieved 86 accuracy on DeepLoc dataset

Protein authenticity detection

Distinguish real proteins from their randomly shuffled versions

Achieved AUC of 0.92, 0.91, and 0.87 in single, double, and triple shuffling tasks respectively

🚀 DistilProtBert

A distilled version of the ProtBert-UniRef100 model, designed for protein feature extraction and downstream task fine - tuning.

DistilProtBert is a distilled model of ProtBert-UniRef100. Besides cross - entropy and cosine teacher - student losses, it was pretrained on a masked language modeling (MLM) objective and only works with capital letter amino acids. For more details, check out our paper DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts. You can also find the Git repository here.

🚀 Quick Start

The model can be used in the same way as ProtBert and with ProtBert's tokenizer. It can be applied for protein feature extraction or fine - tuned on downstream tasks.

✨ Features

A distilled version of the ProtBert-UniRef100 model.
Pretrained on a masked language modeling (MLM) objective in addition to cross entropy and cosine teacher - student losses.
Only works with capital letter amino acids.

📚 Documentation

Model details

Property	Details
Model Type	DistilProtBert (a distilled version of ProtBert-UniRef100)
# of parameters	230M
# of hidden layers	15
Pretraining dataset	UniRef50
# of proteins	43M
Pretraining hardware	5 v100 32GB GPUs

Intended uses & limitations

The model could be used for protein feature extraction or to be fine - tuned on downstream tasks.

Training data

DistilProtBert model was pretrained on Uniref50, a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).

Pretraining procedure

Preprocessing was done using ProtBert's tokenizer. The details of the masking procedure for each sequence followed the original Bert (as mentioned in ProtBert). The model was pretrained on a single DGX cluster for 3 epochs in total. The local batch size was 16, the optimizer used was AdamW with a learning rate of 5e - 5 and mixed precision settings.

Evaluation results

When fine - tuned on downstream tasks, this model achieves the following results:

Task/Dataset	secondary structure (3 - states)	Membrane
CASP12	72
TS115	81
CB513	79
DeepLoc		86

Distinguish between proteins and their k - let shuffled versions:

Singlet (dataset)

Model	AUC
LSTM	0.71
ProtBert	0.93
DistilProtBert	0.92

Doublet (dataset)

Model	AUC
LSTM	0.68
ProtBert	0.92
DistilProtBert	0.91

Triplet (dataset)

Model	AUC
LSTM	0.61
ProtBert	0.92
DistilProtBert	0.87

Citation

If you use this model, please cite our paper:

@article {
	author = {Geffen, Yaron and Ofran, Yanay and Unger, Ron},
	title = {DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts},
	year = {2022},
	doi = {10.1093/bioinformatics/btac474},
	URL = {https://doi.org/10.1093/bioinformatics/btac474},
	journal = {Bioinformatics}
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご