HIV_BERT Open Source Model - Accurately Predict HIV-Specific Protein Sequences and Improve Performance in Related Tasks

HIV BERT

Developed by damlab

An HIV-specific protein sequence prediction model optimized based on ProtBert-BFD, enhanced for related tasks through HIV genome fine-tuning

Large Language Model

Transformers

Open Source License:MIT #HIV protein prediction #Masked language modeling #Viral genome analysis

Downloads 19

Release Time : 3/2/2022

Model Overview

A BERT-style masked language model optimized for HIV protein sequences, useful for mutation prediction and transfer learning tasks

Model Features

HIV-specific optimization

Fine-tuned with complete HIV genomes to address the lack of viral proteins in the original BFD database

Transfer learning foundation

Can serve as a pre-trained base model for HIV-related classification tasks

High-frequency mutation identification

Effectively identifies high-frequency mutation patterns in sequences through masked prediction techniques

Model Capabilities

Protein sequence prediction

Mutation pattern recognition

Transfer learning feature extraction

Use Cases

Viral research

Mutation hotspot analysis

Predict high-frequency mutation sites in HIV protein sequences

Accurately predicts conserved amino acids in key regions such as the V3 loop

Sequencing quality control

Identify potential sequencing artifacts or abnormal sequences

Drug development

Epitope prediction

Assists in vaccine target identification as a feature extractor

🚀 HIV_BERT model

The HIV_BERT model is a refined version of the ProtBert - BFD model, specifically tailored for HIV - centric tasks. It uses whole viral genomes from the Los Alamos HIV Sequence Database for pre - training, which is crucial for HIV - related transfer learning tasks as the original BFD database has few viral proteins.

🚀 Quick Start

You can use the following code to run the model:

from transformers import pipeline

unmasker = pipeline("fill - mask", model="damlab/HIV_FLT")

unmasker(f"C T R P N [MASK] N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C")

✨ Features

HIV - centric refinement: Trained as a refinement of the ProtBert - BFD model for HIV - specific tasks.
Whole viral genome pre - training: Utilizes whole viral genomes from the Los Alamos HIV Sequence Database, improving performance on HIV - related tasks.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import pipeline

unmasker = pipeline("fill - mask", model="damlab/HIV_FLT")

unmasker(f"C T R P N [MASK] N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C")

[
  {
    "score": 0.9581968188285828,
    "token": 17,
    "token_str": "N",
    "sequence": "C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
  },
  {
    "score": 0.022986575961112976,
    "token": 12,
    "token_str": "K",
    "sequence": "C T R P N K N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
  },
  {
    "score": 0.003997281193733215,
    "token": 14,
    "token_str": "D",
    "sequence": "C T R P N D N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
  },
  {
    "score": 0.003636382520198822,
    "token": 15,
    "token_str": "T",
    "sequence": "C T R P N T N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
  },
  {
    "score": 0.002701344434171915,
    "token": 10,
    "token_str": "S",
    "sequence": "C T R P N S N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
  }
]

📚 Documentation

Model Description

The HIV - BERT model encodes each amino acid as an individual token. It was trained using Masked Language Modeling, with a 15% mask rate on the damlab/HIV_FLT dataset in 256 - amino - acid chunks.

Intended Uses & Limitations

Intended uses: Can be used to predict expected mutations via a masking approach and as a base for transfer learning in HIV - specific classification tasks.
Limitations: Not specified in the original document.

Training Data

The damlab/HIV_FLT dataset was used to refine the original rostlab/Prot - bert - bfd. It contains 1790 full HIV genomes from around the world, which translate to approximately 3.9 million amino - acid tokens.

Training Procedure

Preprocessing

Similar to the rostlab/Prot - bert - bfd model, rare amino acids U, Z, O, and B were converted to X, spaces were added between each amino acid. All strings were concatenated and chunked into 256 - token chunks for training, with 20% of chunks held for validation.

Training

Training was performed using the HuggingFace training module with the MaskedLM data loader at a 15% masking rate. The learning rate was set at E - 5, with 50K warm - up steps and a cosine_with_restarts learning rate schedule. Training continued until 3 consecutive epochs did not improve the loss on the held - out dataset.

Evaluation Results

No evaluation results are provided in the original document.

BibTeX Entry and Citation Info

[More Information Needed]

📄 License

This project is licensed under the MIT license.

Property	Details
Model Type	Refined version of ProtBert - BFD for HIV - centric tasks
Training Data	damlab/HIV_FLT, containing 1790 full HIV genomes with about 3.9 million amino - acid tokens

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご