đ HIV_BERT model
The HIV_BERT model is a refined version of the ProtBert - BFD model, specifically tailored for HIV - centric tasks. It uses whole viral genomes from the Los Alamos HIV Sequence Database for pre - training, which is crucial for HIV - related transfer learning tasks as the original BFD database has few viral proteins.
đ Quick Start
You can use the following code to run the model:
from transformers import pipeline
unmasker = pipeline("fill - mask", model="damlab/HIV_FLT")
unmasker(f"C T R P N [MASK] N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C")
⨠Features
- HIV - centric refinement: Trained as a refinement of the ProtBert - BFD model for HIV - specific tasks.
- Whole viral genome pre - training: Utilizes whole viral genomes from the Los Alamos HIV Sequence Database, improving performance on HIV - related tasks.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import pipeline
unmasker = pipeline("fill - mask", model="damlab/HIV_FLT")
unmasker(f"C T R P N [MASK] N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C")
[
{
"score": 0.9581968188285828,
"token": 17,
"token_str": "N",
"sequence": "C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
},
{
"score": 0.022986575961112976,
"token": 12,
"token_str": "K",
"sequence": "C T R P N K N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
},
{
"score": 0.003997281193733215,
"token": 14,
"token_str": "D",
"sequence": "C T R P N D N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
},
{
"score": 0.003636382520198822,
"token": 15,
"token_str": "T",
"sequence": "C T R P N T N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
},
{
"score": 0.002701344434171915,
"token": 10,
"token_str": "S",
"sequence": "C T R P N S N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
}
]
đ Documentation
Model Description
The HIV - BERT model encodes each amino acid as an individual token. It was trained using Masked Language Modeling, with a 15% mask rate on the damlab/HIV_FLT dataset in 256 - amino - acid chunks.
Intended Uses & Limitations
- Intended uses: Can be used to predict expected mutations via a masking approach and as a base for transfer learning in HIV - specific classification tasks.
- Limitations: Not specified in the original document.
Training Data
The damlab/HIV_FLT dataset was used to refine the original rostlab/Prot - bert - bfd. It contains 1790 full HIV genomes from around the world, which translate to approximately 3.9 million amino - acid tokens.
Training Procedure
Preprocessing
Similar to the rostlab/Prot - bert - bfd model, rare amino acids U, Z, O, and B were converted to X, spaces were added between each amino acid. All strings were concatenated and chunked into 256 - token chunks for training, with 20% of chunks held for validation.
Training
Training was performed using the HuggingFace training module with the MaskedLM data loader at a 15% masking rate. The learning rate was set at E - 5, with 50K warm - up steps and a cosine_with_restarts learning rate schedule. Training continued until 3 consecutive epochs did not improve the loss on the held - out dataset.
Evaluation Results
No evaluation results are provided in the original document.
BibTeX Entry and Citation Info
[More Information Needed]
đ License
This project is licensed under the MIT license.
Property |
Details |
Model Type |
Refined version of ProtBert - BFD for HIV - centric tasks |
Training Data |
damlab/HIV_FLT, containing 1790 full HIV genomes with about 3.9 million amino - acid tokens |