ankh3-xl Open-source Protein Language Model - Free implementation of protein feature extraction and sequence analysis

Ankh3 Xl

Developed by ElnaggarLab

Ankh3 is a protein language model based on the T5 architecture. It is pre - trained by jointly optimizing masked language modeling and sequence completion tasks, and is suitable for protein feature extraction and sequence analysis.

Protein Model

Transformers

#Protein sequence completion #Masked language modeling #Biological feature extraction

Downloads 131

Release Time : 9/29/2024

Model Overview

Ankh3 is an advanced protein language model specifically designed to process protein sequence data. It learns the deep representation of proteins through two jointly optimized pre - training tasks (masked language modeling and sequence completion), and can be used for tasks such as protein feature extraction, sequence analysis, and structure prediction.

Model Features

Dual - task joint optimization

Optimize both masked language modeling and sequence completion tasks simultaneously to enhance the model's understanding of protein sequences

Flexible sequence processing

Support different tasks through different prefixes ([NLU]/[S2S]) to adapt to various protein analysis scenarios

Large - scale pre - training

Pre - trained on the UniRef50 dataset to learn a wide range of protein sequence features

Model Capabilities

Protein feature extraction

Protein sequence completion

Protein sequence representation learning

Use Cases

Protein research

Protein feature extraction

Extract the deep representation of protein sequences for downstream analysis tasks

Obtain protein sequence embeddings containing semantic information

Protein sequence completion

Predict the complete protein sequence based on the known partial sequence

Generate a protein sequence completion that is coherent with the input sequence

🚀 Ankh3 Protein Language Model

Ankh3 is a protein language model designed for protein feature extraction. It jointly optimizes two objectives to understand and generate protein sequences effectively.

🚀 Quick Start

The following sections provide details on how to use Ankh3 for embedding extraction and sequence completion.

✨ Features

Ankh3 is jointly optimized on two objectives:

Masked language modeling with multiple masking probabilities
Protein sequence completion

1. Masked Language Modeling

The idea of this task is to intentionally 'corrupt' an input protein sequence by masking a certain percentage (X%) of its individual tokens (amino acids), and then train the model to reconstruct the original sequence.
Example on a protein sequence before and after corruption:
- Original protein sequence: MKAYVLINSRGP
- This sequence will be masked/corrupted using sentinel tokens as shown below:
  - Sequence after corruption: M <extra_id_0> A Y <extra_id_1> L I <extra_id_2> S R G <extra_id_3>
- The decoder learns to correspond each sentinel token to the actual amino acid that was masked.
  - In this example: <extra_id_0> K means that <extra_id_0> corresponds to the "K" amino acid and so on.
  - Decoder output: <extra_id_0> K <extra_id_1> V <extra_id_2> N <extra_id_3> P

2. Protein Sequence Completion

The idea of this task is to cut the input sequence into two segments, where the first segment is fed to the encoder and the decoder is tasked to auto - regressively generate the second segment conditioned on the first segment representation outputted from the encoder.
Example on protein sequence completion:
- Original sequence: MKAYVLINSRGP
- We will pass "MKAYVL" of it to the encoder, and the decoder is trained that given the representation of the first part provided by the encoder, it should output the second part which is: "INSRGP"

📦 Installation

The provided code snippets rely on the transformers library. You can install it using the following command:

pip install transformers torch

💻 Usage Examples

[Basic Usage - Embedding Extraction]

from transformers import T5ForConditionalGeneration, T5Tokenizer, T5EncoderModel
import torch

# Random sequence from uniprot, most likely Ankh3 saw it during pre-training.
sequence = "MDTAYPREDTRAPTPSKAGAHTALTLGAPHPPPRDHLIWSVFSTLYLNLCCLGFLALAYSIKARDQKVVGDLEAARRFGSKAKCYNILAAMWTLVPPLLLLGLVVTGALHLARLAKDSAAFFSTKFDDADYD"

ckpt = "ElnaggarLab/ankh3-xl"

# Make sure that you must use `T5Tokenizer` not `AutoTokenizer`.
tokenizer = T5Tokenizer.from_pretrained(ckpt)

# To use the encoder representation using the NLU prefix:
encoder_model = T5EncoderModel.from_pretrained(ckpt).eval()


# For extracting embeddings, consider trying the '[S2S]' prefix.
# Since this prefix was specifically used to denote sequence completion
# during the model's pre-training, its use can sometimes
# lead to improved embedding quality.

nlu_sequence = "[NLU]" + sequence
encoded_nlu_sequence = tokenizer(nlu_sequence, add_special_tokens=True, return_tensors="pt", is_split_into_words=False)

with torch.no_grad():
  embedding = encoder_model(**encoded_nlu_sequence)

[Advanced Usage - Sequence Completion]

from transformers import T5ForConditionalGeneration, T5Tokenizer
from transformers.generation import GenerationConfig
import torch

sequence = "MDTAYPREDTRAPTPSKAGAHTALTLGAPHPPPRDHLIWSVFSTLYLNLCCLGFLALAYSIKARDQKVVGDLEAARRFGSKAKCYNILAAMWTLVPPLLLLGLVVTGALHLARLAKDSAAFFSTKFDDADYD"

ckpt = "ElnaggarLab/ankh3-xl"
tokenizer = T5Tokenizer.from_pretrained(ckpt)
# To use the sequence to sequence task using the S2S prefix:
model = T5ForConditionalGeneration.from_pretrained(ckpt).eval()


half_length = int(len(sequence) * 0.5)
s2s_sequence = "[S2S]" + sequence[:half_length]
encoded_s2s_sequence = tokenizer(s2s_sequence, add_special_tokens=True, return_tensors="pt", is_split_into_words=False)
# + 1 to account for the start of sequence token.
gen_config = GenerationConfig(min_length=half_length + 1, max_length=half_length + 1, do_sample=False, num_beams=1)
generated_sequence = model.generate(encoded_s2s_sequence["input_ids"], gen_config, )
predicted_sequence = sequence[:half_length] + tokenizer.batch_decode(generated_sequence)[0]

📚 Documentation

Model Information

Property	Details
Library Name	transformers
Model Type	Protein Language Model
License	cc - by - nc - sa - 4.0
Pipeline Tag	feature - extraction
Tags	protein language model
Datasets	UniRef50

📄 License

This model is released under the cc - by - nc - sa - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご