đ GO-Language model
This GO-Language model encodes the Gene Ontology definition of a protein as a vector representation. It's trained on gene - ontology terms from model organisms and can be used to predict novel gene functions.
đ Quick Start
The GO-Language model is a BERT-style Masked Language learner. You can use it to determine the most likely token at a masked position.
from transformers import pipeline
unmasker = pipeline("fill-mask", model="damlab/GO-language")
unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")
[{'score': 0.1040298342704773,
'token': 103,
'token_str': 'GO:0002250',
'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.018045395612716675,
'token': 21,
'token_str': 'GO:0005576',
'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.015035462565720081,
'token': 50,
'token_str': 'GO:0000139',
'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.01181247178465128,
'token': 37,
'token_str': 'GO:0007165',
'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.01000668853521347,
'token': 14,
'token_str': 'GO:0005737',
'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}
]
⨠Features
- Encodes the Gene Ontology definition of a protein as a vector representation.
- Allows exploration of gene - level similarities and comparisons between functional terms.
- Can be used as a translation model between PROT - BERT and GO - Language for predicting novel gene functions.
đĻ Installation
No installation steps are provided in the original README, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import pipeline
unmasker = pipeline("fill-mask", model="damlab/GO-language")
unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")
Advanced Usage
There is no advanced usage example in the original README, so this part is not generated.
đ Documentation
Summary
This model was built as a way to encode the Gene Ontology definition of a protein as vector representation. It was trained on a collection of gene - ontology terms from model organisms. Each function was sorted by the ID number and combined with its annotation description ie (is_a
, enables
, located_in
, etc). The model is tokenized such that each description and GO term is its own token. This is intended to be used as a translation model between PROT - BERT and GO - Language. That type of translation model will be useful for predicting the function of novel genes.
Model Description
This model was trained using the damlab/uniprot dataset on the go
field with 256 token chunks and a 15% mask rate.
Intended Uses & Limitations
This model is a useful encapsulation of gene ontology functions. It allows both an exploration of gene - level similarities as well as comparisons between functional terms.
Training Data
The dataset was trained using damlab/uniprot from a random initial model. The Gene Ontology functions were sorted (by ID number) along with annotating term.
Training Procedure
Preprocessing
All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
Training
Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E - 5, 50K warm - up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held - out dataset.
BibTeX Entry and Citation Info
[More Information Needed]
đ§ Technical Details
The model is a BERT - style Masked Language learner. It processes the Gene Ontology data by sorting functions by ID number, combining them with annotation descriptions, and tokenizing each description and GO term as an individual token. During training, it uses the MaskedLM data loader with a 15% masking rate, a learning rate of E - 5, 50K warm - up steps, and a cosine_with_restarts learning rate schedule.
đ License
The model is released under the MIT license.
Property |
Details |
Model Type |
GO - Language model |
Training Data |
damlab/uniprot |