Protein-Matryoshka-Embeddings Open-Source Model - Generate Vectors for Protein Sequences to Accelerate Search Tasks

Protein Matryoshka Embeddings

Developed by monsoon-nlp

This model generates embedding vectors for protein sequences, supporting shortened embeddings to accelerate search tasks.

Protein Model

Transformers

Open Source License:CC #Protein sequence embedding #Matryoshka loss optimization #Bioinformatics

Downloads 2,121

Release Time : 3/24/2024

Model Overview

A protein sequence embedding model based on Rostlab/prot_bert_bfd, trained with matryoshka loss function, suitable for protein similarity calculation in the field of biology.

Model Features

Matryoshka Embedding Technology

Supports generating embedding vectors of different lengths, allowing a balance between accuracy and efficiency based on task requirements.

Specialized Protein Processing

Optimized for IUPAC-IUB encoded protein sequences, directly processing amino acid sequences.

High-Performance Similarity Calculation

Achieves a cosine similarity metric of 0.92+ on the UniProt dataset.

Model Capabilities

Protein sequence embedding generation

Protein similarity calculation

Biological sequence feature extraction

Use Cases

Bioinformatics

Protein Function Prediction

Infers the function of unknown proteins through embedding vector similarity.

Protein Structure Classification

Classifies protein secondary/tertiary structures based on sequence embeddings.

Performs well on the TAPE benchmark.

Drug Development

Target Protein Screening

Rapidly screens candidate proteins with structures similar to the target protein.

🚀 Protein Matryoshka Embeddings

This model generates embeddings for input proteins. It leverages the Matryoshka loss, allowing shortened embeddings for faster search and other tasks.

🚀 Quick Start

This model is used to generate embeddings for input proteins. It was trained with Matryoshka loss, enabling the use of shortened embeddings for faster search and other tasks.

Inputs adopt IUPAC-IUB codes, where letters A - Z correspond to amino acids. For example:

"M A R N W S F R V"

The base model is Rostlab/prot_bert_bfd. A sentence-transformers model was trained on the cosine - similarity of embeddings from UniProt. For train/test/validation datasets of embeddings and distances, refer to: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot

✨ Features

Generates embeddings for input proteins.
Utilizes Matryoshka loss, allowing for the use of shortened embeddings in tasks.
Can be used in various protein - related tasks such as classification and regression.

📦 Installation

Install the following dependencies:

pip install -U sentence-transformers datasets

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]

model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sequences)
print(embeddings)

📚 Documentation

Training + Code

CoLab notebook: https://colab.research.google.com/drive/1uBk - jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing
Results on 1,000 protein pairs from the validation dataset during training:

steps	cosine_pearson	cosine_spearman
3000	0.8598688660086558	0.8666855900999677
6000	0.8692703523988448	0.8615673651584274
9000	0.8779733537629968	0.8754158959780602
12000	0.8877422045031667	0.8881492475969834
15000	0.9027359688395733	0.899106724739699
18000	0.9046675789738002	0.9044183600191271
21000	0.9165801536390973	0.9061381997421003
24000	0.9128046401341833	0.9076748537082228
27000	0.918547416546341	0.9127677526055185
30000	0.9239429677657788	0.9187051589781693

Validation

Scatter plots comparing the full and 128 - dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing

Finetuning / Tasks

One of the more popular evaluations is [Tasks Assessing Protein Embeddings (TAPE)](https://github.com/songlab - cal/tape)
Example using SciKit - Learn to train on Fluorescence, a regression task from TAPE: https://colab.research.google.com/drive/1cH9jOBSC56mqJHU_6ztQPp6qWJguNjAn?usp=sharing
Example using SciKit - Learn to train on a classification task from [greenbeing - binary](https://huggingface.co/datasets/monsoon - nlp/greenbeing - binary): https://colab.research.google.com/drive/1MCTn8f3oeIKpB6n_8mPumet3ukm7GD8a?usp=sharing

Future

This page will be updated when there are examples of using it on protein classification tasks.
The author is interested in whether [embedding quantization](https://huggingface.co/blog/embedding - quantization) could be more efficient.
Collaboration requests for future projects or offers of resources for longer training on more embeddings are welcome.

📄 License

This project is licensed under the CC license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご