Open-source ESM2-150M free protein molecular function prediction model, accurately predicting protein functions

Home

ESM2 150M Protein Molecular Function

Developed by andrewdalpino

A protein molecular function prediction model based on the Gene Ontology (GO) and the ESM2 architecture

Protein Model

Transformers

#Protein molecular function prediction #Based on the ESM2 architecture #Gene Ontology analysis

Downloads 175

Release Time : 5/15/2025

Model Overview

This model uses the Evolutionary Scale Model (ESM) to predict protein functions from amino acid sequences, focusing on the prediction of the molecular function sub - graph of the Gene Ontology.

Model Features

Advanced architecture

Based on the ESM2 Transformer architecture, using advanced deep learning techniques for protein function prediction

Multi - stage training

Pretrained on UniRef50 and fine - tuned on the AmiGO dataset to ensure model performance

Specific function prediction

Focusing on the prediction of the molecular function sub - graph of the Gene Ontology, providing precise functional analysis

Model Capabilities

Protein sequence analysis

Molecular function prediction

GO term classification

Use Cases

Bioinformatics

Protein function annotation

Predict the molecular function of a protein based on its amino acid sequence

Provide the GO terms of the molecular functions that the protein may participate in

Protein research assistance

Help researchers quickly understand the possible functions of newly discovered proteins

Accelerate the protein function research process

🚀 ESM2 Protein Function Caller

An Evolutionary-scale Model (ESM) for predicting protein functions from amino acid sequences using the Gene Ontology (GO). It offers insights into a protein's molecular function, biological process, and cellular location.

🚀 Quick Start

This Evolutionary-scale Model (ESM) predicts protein functions from amino acid sequences using the Gene Ontology (GO). Based on the ESM2 Transformer architecture, pre - trained on UniRef50 and fine - tuned on the AmiGO dataset, it predicts the GO subgraph for a specific protein sequence.

Note: This version only models the molecular function subgraph of the gene ontology.

✨ Features

Library and Tags:
- Library: transformers
- Tags: gene - ontology, proteomics
Datasets and Metrics:
- Datasets: andrewdalpino/AmiGO
- Metrics: precision, recall, f1
Base Model and Pipeline:
- Base Model: facebook/esm2_t30_150M_UR50D
- Pipeline Tag: text - classification

📚 Documentation

What are GO terms?

"The Gene Ontology (GO) is a concept hierarchy that describes the biological function of genes and gene products at different levels of abstraction (Ashburner et al., 2000). It is a good model to describe the multi - faceted nature of protein function."

"GO is a directed acyclic graph. The nodes in this graph are functional descriptors (terms or classes) connected by relational ties between them (is_a, part_of, etc.). For example, terms 'protein binding activity' and 'binding activity' are related by an is_a relationship; however, the edge in the graph is often reversed to point from binding towards protein binding. This graph contains three subgraphs (subontologies): Molecular Function (MF), Biological Process (BP), and Cellular Component (CC), defined by their root nodes. Biologically, each subgraph represent a different aspect of the protein's function: what it does on a molecular level (MF), which biological processes it participates in (BP) and where in the cell it is located (CC)."

From [CAFA 5 Protein Function Prediction](https://www.kaggle.com/competitions/cafa - 5 - protein - function - prediction/data)

Code Repository

You can access the code repository at [https://github.com/andrewdalpino/esm2 - function - classifier](https://github.com/andrewdalpino/esm2 - function - classifier)

Model Specs

Property	Details
Vocabulary Size	33
Embedding Dimensions	640
Attention Heads	20
Encoder Layers	30
Context Length	1026

💻 Usage Examples

Basic Usage

For a basic demonstration, we can rank the GO terms for a particular sequence. For a more advanced example, see the [predict - subgraph.py](https://github.com/andrewdalpino/esm2 - function - classifier/blob/master/predict - subgraph.py) source file.

import torch

from transformers import EsmTokenizer, EsmForSequenceClassification

model_name = "andrewdalpino/ESM2-35M-Protein-Molecular-Function"

tokenizer = EsmTokenizer.from_pretrained(model_name)

model = EsmForSequenceClassification.from_pretrained(model_name)

model.eval()

sequence = "MCNAWYISVDFEKNREDKSKCIHTRRNSGPKLLEHVMYEVLRDWYCLEGENVYMM"

top_k = 10

out = tokenizer(sequence)

input_ids = out["input_ids"]

input_ids = torch.tensor(input_ids, dtype=torch.int64).unsqueeze(0)

with torch.no_grad():
    outputs = model.forward(input_ids)

    probabilities = torch.sigmoid(outputs.logits.squeeze(0))

    probabilities, indices = torch.topk(probabilities, top_k)

probabilities = probabilities.tolist()

terms = [model.config.id2label[index] for index in indices.tolist()]

print(f"Top {args.top_k} GO Terms:")

for term, probability in zip(terms, probabilities):
    print(f"{probability:.4f}: {term}")

📄 References

A. Rives, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, 2021.
Z. Lin, et al. Evolutionary - scale prediction of atomic level protein structure with a language model, 2022.
G. A. Merino, et al. Hierarchical deep learning for predicting GO annotations by integrating protein knowledge, 2022.
M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご