đ Protein Matryoshka Embeddings
This model generates embeddings for input proteins. It leverages the Matryoshka loss, allowing shortened embeddings for faster search and other tasks.
đ Quick Start
This model is used to generate embeddings for input proteins. It was trained with Matryoshka loss, enabling the use of shortened embeddings for faster search and other tasks.
Inputs adopt IUPAC-IUB codes, where letters A - Z correspond to amino acids. For example:
"M A R N W S F R V"
The base model is Rostlab/prot_bert_bfd. A sentence-transformers model was trained on the cosine - similarity of embeddings from UniProt. For train/test/validation datasets of embeddings and distances, refer to: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot
⨠Features
- Generates embeddings for input proteins.
- Utilizes Matryoshka loss, allowing for the use of shortened embeddings in tasks.
- Can be used in various protein - related tasks such as classification and regression.
đĻ Installation
Install the following dependencies:
pip install -U sentence-transformers datasets
đģ Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]
model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sequences)
print(embeddings)
đ Documentation
Training + Code
- CoLab notebook: https://colab.research.google.com/drive/1uBk - jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing
- Results on 1,000 protein pairs from the validation dataset during training:
steps |
cosine_pearson |
cosine_spearman |
3000 |
0.8598688660086558 |
0.8666855900999677 |
6000 |
0.8692703523988448 |
0.8615673651584274 |
9000 |
0.8779733537629968 |
0.8754158959780602 |
12000 |
0.8877422045031667 |
0.8881492475969834 |
15000 |
0.9027359688395733 |
0.899106724739699 |
18000 |
0.9046675789738002 |
0.9044183600191271 |
21000 |
0.9165801536390973 |
0.9061381997421003 |
24000 |
0.9128046401341833 |
0.9076748537082228 |
27000 |
0.918547416546341 |
0.9127677526055185 |
30000 |
0.9239429677657788 |
0.9187051589781693 |
Validation
Scatter plots comparing the full and 128 - dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing
Finetuning / Tasks
- One of the more popular evaluations is [Tasks Assessing Protein Embeddings (TAPE)](https://github.com/songlab - cal/tape)
- Example using SciKit - Learn to train on Fluorescence, a regression task from TAPE: https://colab.research.google.com/drive/1cH9jOBSC56mqJHU_6ztQPp6qWJguNjAn?usp=sharing
- Example using SciKit - Learn to train on a classification task from [greenbeing - binary](https://huggingface.co/datasets/monsoon - nlp/greenbeing - binary): https://colab.research.google.com/drive/1MCTn8f3oeIKpB6n_8mPumet3ukm7GD8a?usp=sharing
Future
- This page will be updated when there are examples of using it on protein classification tasks.
- The author is interested in whether [embedding quantization](https://huggingface.co/blog/embedding - quantization) could be more efficient.
- Collaboration requests for future projects or offers of resources for longer training on more embeddings are welcome.
đ License
This project is licensed under the CC license.