The open-source protein classification model esm2_t6_8M_UR50D_sequence_classifier_v1 - Freely distinguish enzyme, receptor and structural protein sequences

Esm2 T6 8M UR50D Sequence Classifier V1

Developed by AmelieSchreiber

A small sequence classifier trained based on the ESM-2 protein language model, capable of classifying protein sequences into three categories: enzymes, receptor proteins, and structural proteins.

Protein Model

Transformers

EnglishOpen Source License:MIT #Protein sequence classification #Zero-shot learning #ESM-2 fine-tuning

Downloads 30

Release Time : 7/29/2023

Model Overview

This model is trained using facebook/esm2_t6_8M_UR50D and can classify protein sequences into three main categories: enzymes (class 0), receptor proteins (class 1), and structural proteins (class 2).

Model Features

Based on ESM-2 model

Trained using facebook/esm2_t6_8M_UR50D, belonging to the ESM-2 model series.

Synthetic data training

Trained on synthetic data generated by GPT-4, capable of classifying protein sequences.

Lightweight

Small model size (8M parameters), suitable for experimental and educational purposes.

Model Capabilities

Protein sequence classification

Zero-shot classification

Use Cases

Bioinformatics

Protein function prediction

Predict potential functional categories by classifying protein sequences.

Can classify protein sequences into enzymes, receptor proteins, and structural proteins.

Educational demonstration

Used for teaching and demonstrating the basic principles of protein sequence classification.

🚀 ESM-2 Sequence Classifier

This is a small sequence classifier designed to classify protein sequences into three categories: enzymes (class 0), receptor_proteins (class 1), and structural_proteins (class 2). It is trained on synthetic data generated by GPT-4, leveraging facebook/esm2_t6_8M_UR50D, one of the ESM-2 models.

Please note that this model is for experimental and educational purposes only, as it has not undergone extensive testing. Use it with caution.

🚀 Quick Start

✨ Features

Classify protein sequences into three categories: enzymes, receptor_proteins, and structural_proteins.
Trained on synthetic data generated by GPT-4.
Utilizes the facebook/esm2_t6_8M_UR50D model.

💻 Usage Examples

Basic Usage

# Load the trained model and tokenizer
model = EsmForSequenceClassification.from_pretrained("AmelieSchreiber/esm2_t6_8M_UR50D_sequence_classifier_v1")
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")

# Suppose these are your new sequences that you want to classify
# Additional Family 0: Enzymes
new_sequences_0 = [
    "ACGYLKTPKLADPPVLRGDSSVTKAICKPDPVLEK",
    "GVALDECKALDYLPGKPLPMDGKVCQCGSKTPLRP",
    "VLPGYTCGELDCKPGKPLPKCGADKTQVATPFLRG",
    "TCGALVQYPSCADPPVLRGSDSSVKACKKLDPQDK",
    "GALCEECKLCPGADYKPMDGDRLPAAATSKTRPVG",
    "PAVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYG",
    "VLGYTCGALDCKPGKPLPKCGADKTQVATPFLRGA",
    "CGALVQYPSCADPPVLRGSDSSVKACKKLDPQDKT",
    "ALCEECKLCPGADYKPMDGDRLPAAATSKTRPVGK",
    "AVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYGR",
]

# Additional Family 1: Receptor Proteins
new_sequences_1 = [
    "VGQRFYGGRQKNRHCELSPLPSACRGSVQGALYTD",
    "KDQVLTVPTYACRCCPKMDSKGRVPSTLRVKSARS",
    "PLAGVACGRGLDYRCPRKMVPGDLQVTPATQRPYG",
    "CGVRLGYPGCADVPLRGRSSFAPRACMKKDPRVTR",
    "RKGVAYLYECRKLRCRADYKPRGMDGRRLPKASTT",
    "RPTGAVNCKQAKVYRGLPLPMMGKVPRVCRSRRPY",
    "RLDGGYTCGQALDCKPGRKPPKMGCADLKSTVATP",
    "LGTCRKLVRYPQCADPPVMGRSSFRPKACCRQDPV",
    "RVGYAMCSPKLCSCRADYKPPMGDGDRLPKAATSK",
    "QPKAVNCRKAMVYRPKPLPMDKGVPVCRSKRPRPY",
]

# Additional Family 2: Structural Proteins
new_sequences_2 = [
    "VGKGFRYGSSQKRYLHCQKSALPPSCRRGKGQGSAT",
    "KDPTVMTVGTYSCQCPKQDSRGSVQPTSRVKTSRSK",
    "PLVGKACGRSSDYKCPGQMVSGGSKQTPASQRPSYD",
    "CGKKLVGYPSSKADVPLQGRSSFSPKACKKDPQMTS",
    "RKGVASLYCSSKLSCKAQYSKGMSDGRSPKASSTTS",
    "RPKSAASCEQAKSYRSLSLPSMKGKVPSKCSRSKRP",
    "RSDVSYTSCSQSKDCKPSKPPKMSGSKDSSTVATPS",
    "LSTCSKKVAYPSSKADPPSSGRSSFSMKACKKQDPPV",
    "RVGSASSEPKSSCSVQSYSKPSMSGDSSPKASSTSK",
    "QPSASNCEKMSSYRPSLPSMSKGVPSSRSKSSPPYQ",
]

# Tokenize the sequences and convert to tensors
# Merge all sequences
new_sequences = new_sequences_0 + new_sequences_1 + new_sequences_2
inputs = tokenizer(new_sequences, return_tensors="pt", padding=True, truncation=True)

# Use the model to get the logits
with torch.no_grad():
    logits = model(**inputs).logits

# Get the predicted class for each sequence
predicted_class_ids = torch.argmax(logits, dim=-1)

# Print the predicted class for each sequence
for sequence, predicted_class in zip(new_sequences, predicted_class_ids):
    print(f"Sequence: {sequence}, Predicted class: {predicted_class.item()}")

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご