GO - language open - source model - Freely encode protein gene ontology and explore gene similarity and functional term comparison

GO Language

Developed by damlab

This model aims to encode Gene Ontology definitions of proteins into vector representations for exploring gene-level similarities and comparisons between functional terms.

Large Language Model

Transformers

Open Source License:MIT #Gene Ontology Encoding #Protein Function Prediction #BERT-style Masked Learning

Downloads 25

Release Time : 4/8/2022

Model Overview

The model employs BERT-style masked language learning techniques, trained on Gene Ontology term sets from model organisms, designed as a translation model between PROT-BERT and GO-Language to aid in predicting the functions of novel genes.

Model Features

Gene Ontology Term Encoding

Encodes Gene Ontology terms and their annotation descriptions into vector representations for functional comparison and analysis.

Masked Language Learning

Uses BERT-style training with a 15% masking rate to predict missing Gene Ontology terms.

Cross-Model Translation

Designed for translation between PROT-BERT and GO-Language, supporting novel gene function prediction.

Model Capabilities

Gene Ontology term prediction

Functional similarity analysis

Biological terminology vector representation

Use Cases

Bioinformatics

Novel Gene Function Prediction

Predicts potential biological processes or molecular functions of unknown genes through the model.

Provides candidate function lists with confidence scores

Functional Similarity Analysis

Compares GO term vector representations of different genes to assess functional similarity.

🚀 GO-Language model

This GO-Language model encodes the Gene Ontology definition of a protein as a vector representation. It's trained on gene - ontology terms from model organisms and can be used to predict novel gene functions.

🚀 Quick Start

The GO-Language model is a BERT-style Masked Language learner. You can use it to determine the most likely token at a masked position.

from transformers import pipeline

unmasker = pipeline("fill-mask", model="damlab/GO-language")

unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")

[{'score': 0.1040298342704773,
  'token': 103,
  'token_str': 'GO:0002250',
  'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.018045395612716675,
  'token': 21,
  'token_str': 'GO:0005576',
  'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.015035462565720081,
  'token': 50,
  'token_str': 'GO:0000139',
  'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.01181247178465128,
  'token': 37,
  'token_str': 'GO:0007165',
  'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.01000668853521347,
  'token': 14,
  'token_str': 'GO:0005737',
  'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}
]

✨ Features

Encodes the Gene Ontology definition of a protein as a vector representation.
Allows exploration of gene - level similarities and comparisons between functional terms.
Can be used as a translation model between PROT - BERT and GO - Language for predicting novel gene functions.

📦 Installation

No installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import pipeline

unmasker = pipeline("fill-mask", model="damlab/GO-language")

unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")

Advanced Usage

There is no advanced usage example in the original README, so this part is not generated.

📚 Documentation

Summary

This model was built as a way to encode the Gene Ontology definition of a protein as vector representation. It was trained on a collection of gene - ontology terms from model organisms. Each function was sorted by the ID number and combined with its annotation description ie (is_a, enables, located_in, etc). The model is tokenized such that each description and GO term is its own token. This is intended to be used as a translation model between PROT - BERT and GO - Language. That type of translation model will be useful for predicting the function of novel genes.

Model Description

This model was trained using the damlab/uniprot dataset on the go field with 256 token chunks and a 15% mask rate.

Intended Uses & Limitations

This model is a useful encapsulation of gene ontology functions. It allows both an exploration of gene - level similarities as well as comparisons between functional terms.

Training Data

The dataset was trained using damlab/uniprot from a random initial model. The Gene Ontology functions were sorted (by ID number) along with annotating term.

Training Procedure

Preprocessing

All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.

Training

Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E - 5, 50K warm - up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held - out dataset.

BibTeX Entry and Citation Info

[More Information Needed]

🔧 Technical Details

The model is a BERT - style Masked Language learner. It processes the Gene Ontology data by sorting functions by ID number, combining them with annotation descriptions, and tokenizing each description and GO term as an individual token. During training, it uses the MaskedLM data loader with a 15% masking rate, a learning rate of E - 5, 50K warm - up steps, and a cosine_with_restarts learning rate schedule.

📄 License

The model is released under the MIT license.

Property	Details
Model Type	GO - Language model
Training Data	damlab/uniprot

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご