ProtGPT2 Open-Source Protein Language Model - Freely Generate New Protein Sequences Retaining Key Features

Protgpt2

Developed by nferruz

ProtGPT2 is a protein language model based on the GPT2 architecture, capable of generating novel protein sequences while retaining key features of natural proteins.

Protein Model

Transformers

Open Source License:Apache-2.0 #Protein Generation #Zero-shot Design #Self-supervised Learning

Downloads 17.99k

Release Time : 3/7/2022

Model Overview

ProtGPT2 is a language model that understands protein language and is used for novel protein design and engineering. The sequences it generates explore uncharted regions of protein space while preserving key characteristics of natural proteins (amino acid propensities, secondary structure content, and globular properties).

Model Features

Protein Sequence Generation

Capable of generating novel protein sequences to explore uncharted regions of protein space.

Preservation of Natural Features

Generated sequences retain key features of natural proteins, such as amino acid propensities, secondary structure content, and globular properties.

Self-supervised Training

Trained using self-supervised learning with a causal modeling objective to predict the next token in a sequence.

Model Capabilities

Protein sequence generation

Protein design

Protein engineering

Use Cases

Protein Design

Zero-shot Generation of Novel Proteins

Generate novel protein sequences starting from methionine (M).

Generated sequences retain key features of natural proteins.

Fine-tuning Based on User Sequences

Fine-tune based on user-provided sequences to generate specific types of protein sequences.

Generated sequences better align with user requirements.

🚀 ProtGPT2

ProtGPT2 is a language model that speaks the protein language, enabling de novo protein design and engineering. It generates sequences that conserve key features of natural proteins while exploring new regions of the protein space.

🚀 Quick Start

ProtGPT2 can be used with the HuggingFace transformer python package. Detailed installation instructions can be found here: https://huggingface.co/docs/transformers/installation

✨ Features

ProtGPT2 can generate protein sequences in a zero - shot fashion or after finetuning on a user - defined dataset.
It conserves natural proteins' critical features (amino acid propensities, secondary structural content, and globularity) while exploring unseen regions of the protein space.

📦 Installation

Detailed installation instructions can be found here: https://huggingface.co/docs/transformers/installation

💻 Usage Examples

Basic Usage

Example 1: Generating de novo proteins in a zero - shot fashion

# In the example below, ProtGPT2 generates sequences that follow the amino acid 'M'.
# Any other amino acid, oligomer, fragment, or protein of choice can be selected instead.
# The model will generate the most probable sequences that follow the input.
# Alternatively, the input field can also be left empty and it will choose the starting tokens.
>>> from transformers import pipeline
>>> protgpt2 = pipeline('text - generation', model="nferruz/ProtGPT2")
# length is expressed in tokens, where each token has an average length of 4 amino acids.
>>> sequences = protgpt2("<|endoftext|>", max_length=100, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0)
>>> for seq in sequences:
        print(seq):
{'generated_text': 'MINDLLDISRIISGKMTLDRAEVNLTAIARQVVEEQRQAAEAKSIQLLCSTPDTNHYVFG\nDFDRLKQTLWNLLSNAVKFTPSGGTVELELGYNAEGMEVYVKDSGIGIDPAFLPYVFDRF\nRQSDAADSRNYGGLGLGLAIVKHLLDLHEGNVSAQSEGFGKGATFTVLLPLKPLKRELAA\nVNRHTAVQQSAPLNDNLAGMKILIVEDRPDTNEMVSYILEEAGAIVETAESGAAALTSLK\nSYSPDLVLSDIGMPMMDGYEMIEYIREWKTTKGG'}
{'generated_text': 'MQGDSSISSSNRMFT\nLCKPLTVANETSTLSTTRNSKSNKRVSKQRVNLAESPERNAPSPASIKTNETEEFSTIKT\nTNNEVLGYEPNYVSYDFVPMEKCNLCNENCSIELASLNEETFVKKTICCHECRKKAIENA\nENNNTKGSAVSNNSVTSSSGRKKIIVSGSQILRNLDSLTSSKSNISTLLNPNHLAKLAKN\nGNLSSLSSLQSSASSISKSSSTSSTPTTSPKVSSPTNSPSSSPINSPTP'}
{'generated_text': 'M\nSTHVSLENTLASLQATFFSLEARHTALETQLLSTRTELAATKQELVRVQAEISRADAQAQ\nDLKAQILTLKEKADQAEVEAAAATQRAEESQAALEAQTAELAQLRLEKQAPQHVAEEGDP\nQPAAPTTQAQSPVTSAAAAASSAASAEPSKPELTFPAYTKRKPPTITHAPKAPTKVALNP\nSTLSTSGSGGGAKADPTPTTPVPSSSAGLIPKALRLPPPVTPAASGAKPAPSARSKLRGP\nDAPLSPSTQS'}
{'generated_text': 'MVLLSTGPLPILFLGPSLAELNQKYQVVSDTLLRFTNTV\nTFNTLKFLGSDS\n'}
{'generated_text': 'M\nNNDEQPFIMSTSGYAGNTTSSMNSTSDFNTNNKSNTWSNRFSNFIAYFSGVGWFIGAISV\nIFFIIYVIVFLSRKTKPSGQKQYSRTERNNRDVDSIKRANYYG\n'}
{'generated_text': 'M\nEAVYSFTITETGTGTVEVTPLDRTISGADIVYPPDTACVPLTVQPVINANGTWTLGSGCT\nGHFSVDTTGHVNCLTGGFGAAGVHTVIYTVETPYSGNSFAVIDVNVTEPSGPGDGGNGNG\nDRGDGPDNGGGNNPGPDPDPSTPPPPGDCSSPLPVVCSDRDCADFDTQAQVQIYLDRYGG\nTCDLDGNHDGTPCENLPNNSGGQSSDSGNGGGNPGTGSTHQVVTGDCLWNIASRNNGQGG\nQAWPALLAANNESITNP'}
{'generated_text': 'M\nGLTTSGGARGFCSLAVLQELVPRPELLFVIDRAFHSGKHAVDMQVVDQEGLGDGVATLLY\nAHQGLYTCLLQAEARLLGREWAAVPALEPNFMESPLIALPRQLLEGLEQNILSAYGSEWS\nQDVAEPQGDTPAALLATALGLHEPQQVAQRRRQLFEAAEAALQAIRASA\n'}
{'generated_text': 'M\nGAAGYTGSLILAALKQNPDIAVYALNRNDEKLKDVCGQYSNLKGQVCDLSNESQVEALLS\nGPRKTVVNLVGPYSFYGSRVLNACIEANCHYIDLTGEVYWIPQMIKQYHHKAVQSGARIV\nPAVGFDSTPAELGSFFAYQQCREKLKKAHLKIKAYTGQSGGASGGTILTMIQHGIENGKI\nLREIRSMANPREPQSDFKHYKEKTFQDGSASFWGVPFVMKGINTPVVQRSASLLKKLYQP\nFDYKQCFSFSTLLNSLFSYIFNAI'}
{'generated_text': 'M\nKFPSLLLDSYLLVFFIFCSLGLYFSPKEFLSKSYTLLTFFGSLLFIVLVAFPYQSAISAS\nKYYYFPFPIQFFDIGLAENKSNFVTSTTILIFCFILFKRQKYISLLLLTVVLIPIISKGN\nYLFIILILNLAVYFFLFKKLYKKGFCISLFLVFSCIFIFIVSKIMYSSGIEGIYKELIFT\nGDNDGRFLIIKSFLEYWKDNLFFGLGPSSVNLFSGAVSGSFHNTYFFIFFQSGILGAFIF\nLLPFVYFFISFFKDNSSFMKLF'}
{'generated_text': 'M\nRRAVGNADLGMEAARYEPSGAYQASEGDGAHGKPHSLPFVALERWQQLGPEERTLAEAVR\nAVLASGQYLLGEAVRRFETAVAAWLGVPFALGVASGTAALTLALRAYGVGPGDEVIVPAI\nTFIATSNAITAAGARPVLVDIDPSTWNMSVASLAARLTPKTKAILAVHLWGQPVDMHPLL\nDIAAQANLAVIEDCAQALGASIAGTKVGTFGDAAAFSFYPTKNMTTGEGGMLVTNARDLA\nQAARMLRSHGQDPPTAYMHSQVGFN'}

Advanced Usage

Example 2: Finetuning on a set of user - defined sequences

# This alternative option to the zero - shot generation permits introducing direction in the generation process.
# User - defined training and validation files containing the sequences of interest are provided to the model.
# After a short update of the model's weights, ProtGPT2 will generate sequences that follow the input properties.
# To create the validation and training file, it is necessary to (1) substitute the FASTA headers for each sequence with the expression "<|endoftext|>"
# and (2) split the originating dataset into training and validation files (this is often done with the ratio 90/10, 80/20 or 95/5).
# Then, to finetune the model to the input sequences, we can use the example below.
# Here we show a learning rate of 1e - 06, but ideally, the learning rate should be optimised in separate runs.
# After training, the finetuned model will be stored in the ./output folder.
# Lastly, ProtGPT2 can generate the tailored sequences as shown in Example 1:
python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2
 --do_train --do_eval --output_dir output --learning_rate 1e - 06

📚 Documentation

Model description

ProtGPT2 is based on the GPT2 Transformer architecture and contains 36 layers with a model dimensionality of 1280, totalling 738 million parameters.

ProtGPT2 is a decoder - only transformer model pre - trained on the protein space, database UniRef50 (version 2021_04). The pre - training was done on the raw sequences without FASTA headers. Details of training and datasets can be found here: https://huggingface.co/datasets/nferruz/UR50_2021_04

ProtGPT2 was trained in a self - supervised fashion, i.e., the raw sequence data was used during training without including the annotation of sequences. In particular, ProtGPT2 was trained using a causal modelling objective, in which the model is trained to predict the next token (or, in this case, oligomer) in the sequence. By doing so, the model learns an internal representation of proteins and is able to speak the protein language.

How to select the best sequences

# We've observed that perplexity values correlate with AlphaFold2's plddt.
# We recommend computing perplexity for each sequence as follows:
sequence='MGEAMGLTQPAVSRAVARLEERVGIRIFNRTARAITLTDEGRRFYEAVAPLLAGIEMHGYR\nVNVEGVAQLLELYARDILAEGRLVQLLPEWAD'

#Convert the sequence to a string like this
#(note we have to introduce new line characters every 60 amino acids,
#following the FASTA file format).

sequence = "<|endoftext|>\nMGEAMGLTQPAVSRAVARLEERVGIRIFNRTARAITLTDEGRRFYEAVAPLLAGIEMHGY\nRVNVEGVAQLLELYARDILAEGRLVQLLPEWAD\n<|endoftext|>"

# ppl function
def calculatePerplexity(sequence, model, tokenizer):
    input_ids = torch.tensor(tokenizer.encode(sequence)).unsqueeze(0) 
    input_ids = input_ids.to(device)
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
    loss, logits = outputs[:2]
    return math.exp(loss)

#And hence: 
ppl = calculatePerplexity(sequence, model, tokenizer)

Where ppl is a value with the perplexity for that sequence. We do not yet have a threshold as to what perplexity value gives a 'good' or 'bad' sequence, but given the fast inference times, the best is to sample many sequences, order them by perplexity, and select those with the lower values (the lower the better).

Training specs

The model was trained on 128 NVIDIA A100 GPUs for 50 epochs, using a block size of 512 and a total batch size of 1024 (65,536 tokens per batch). The optimizer used was Adam (beta1 = 0.9, beta2 = 0.999) with a learning rate of 1e - 3.

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご