đ Ankh3 Protein Language Model
Ankh3 is a protein language model designed for protein feature extraction. It jointly optimizes two objectives to understand and generate protein sequences effectively.
đ Quick Start
The following sections provide details on how to use Ankh3 for embedding extraction and sequence completion.
⨠Features
Ankh3 is jointly optimized on two objectives:
- Masked language modeling with multiple masking probabilities
- Protein sequence completion
1. Masked Language Modeling
- The idea of this task is to intentionally 'corrupt' an input protein sequence by masking a certain percentage (X%) of its individual tokens (amino acids), and then train the model to reconstruct the original sequence.
- Example on a protein sequence before and after corruption:
- Original protein sequence: MKAYVLINSRGP
- This sequence will be masked/corrupted using sentinel tokens as shown below:
- Sequence after corruption: M <extra_id_0> A Y <extra_id_1> L I <extra_id_2> S R G <extra_id_3>
- The decoder learns to correspond each sentinel token to the actual amino acid that was masked.
- In this example: <extra_id_0> K means that <extra_id_0> corresponds to the "K" amino acid and so on.
- Decoder output: <extra_id_0> K <extra_id_1> V <extra_id_2> N <extra_id_3> P
2. Protein Sequence Completion
- The idea of this task is to cut the input sequence into two segments, where the first segment is fed to the encoder and the decoder is tasked to auto - regressively generate the second segment conditioned on the first segment representation outputted from the encoder.
- Example on protein sequence completion:
- Original sequence: MKAYVLINSRGP
- We will pass "MKAYVL" of it to the encoder, and the decoder is trained that given the representation of the first part provided by the encoder, it should output the second part which is: "INSRGP"
đĻ Installation
The provided code snippets rely on the transformers
library. You can install it using the following command:
pip install transformers torch
đģ Usage Examples
[Basic Usage - Embedding Extraction]
from transformers import T5ForConditionalGeneration, T5Tokenizer, T5EncoderModel
import torch
sequence = "MDTAYPREDTRAPTPSKAGAHTALTLGAPHPPPRDHLIWSVFSTLYLNLCCLGFLALAYSIKARDQKVVGDLEAARRFGSKAKCYNILAAMWTLVPPLLLLGLVVTGALHLARLAKDSAAFFSTKFDDADYD"
ckpt = "ElnaggarLab/ankh3-xl"
tokenizer = T5Tokenizer.from_pretrained(ckpt)
encoder_model = T5EncoderModel.from_pretrained(ckpt).eval()
nlu_sequence = "[NLU]" + sequence
encoded_nlu_sequence = tokenizer(nlu_sequence, add_special_tokens=True, return_tensors="pt", is_split_into_words=False)
with torch.no_grad():
embedding = encoder_model(**encoded_nlu_sequence)
[Advanced Usage - Sequence Completion]
from transformers import T5ForConditionalGeneration, T5Tokenizer
from transformers.generation import GenerationConfig
import torch
sequence = "MDTAYPREDTRAPTPSKAGAHTALTLGAPHPPPRDHLIWSVFSTLYLNLCCLGFLALAYSIKARDQKVVGDLEAARRFGSKAKCYNILAAMWTLVPPLLLLGLVVTGALHLARLAKDSAAFFSTKFDDADYD"
ckpt = "ElnaggarLab/ankh3-xl"
tokenizer = T5Tokenizer.from_pretrained(ckpt)
model = T5ForConditionalGeneration.from_pretrained(ckpt).eval()
half_length = int(len(sequence) * 0.5)
s2s_sequence = "[S2S]" + sequence[:half_length]
encoded_s2s_sequence = tokenizer(s2s_sequence, add_special_tokens=True, return_tensors="pt", is_split_into_words=False)
gen_config = GenerationConfig(min_length=half_length + 1, max_length=half_length + 1, do_sample=False, num_beams=1)
generated_sequence = model.generate(encoded_s2s_sequence["input_ids"], gen_config, )
predicted_sequence = sequence[:half_length] + tokenizer.batch_decode(generated_sequence)[0]
đ Documentation
Model Information
Property |
Details |
Library Name |
transformers |
Model Type |
Protein Language Model |
License |
cc - by - nc - sa - 4.0 |
Pipeline Tag |
feature - extraction |
Tags |
protein language model |
Datasets |
UniRef50 |
đ License
This model is released under the cc - by - nc - sa - 4.0 license.