GENERator - 3b-base開源基因組模型，基於真核生物數據，超長鹼基對分析更強大

首頁

Generator Eukaryote 3b Base

由GenerTeam開發

GENERator是一個具有9.8萬鹼基對上下文長度和30億參數的生成式基因組基礎模型，基於真核生物DNA擴展數據集訓練

蛋白質模型

Transformers

開源協議:MIT #長序列生成 #跨物種基因組 #9.8萬鹼基上下文

下載量 1,599

發布時間 : 2/11/2025

模型概述

該模型是一個專注於基因組序列生成和分析的基礎模型，具有跨物種的增強理解與生成能力

模型特點

長上下文處理

支持高達9.8萬鹼基對的上下文長度

跨物種理解

基於多樣化的真核生物DNA數據集訓練，具有跨物種分析能力

大規模預訓練

在3860億鹼基對的DNA序列上進行預訓練

模型能力

DNA序列生成

基因組序列分析

序列嵌入表示

使用案例

基因組研究

基因序列生成

根據輸入序列生成新的DNA序列

可生成符合生物特性的DNA序列片段

序列特徵提取

獲取DNA序列的嵌入表示

可用於下游分析任務如基因分類或功能預測

🚀 GENERator-eukaryote-3b-base模型

本項目推出的GENERator是一個生成式基因組基礎模型，其上下文長度可達98k個鹼基對，擁有30億個參數。該模型在包含3860億個真核生物DNA鹼基對的龐大數據集上進行訓練，豐富多樣的預訓練數據賦予了GENERator在不同生物體上更強的理解和生成能力。

🚀 快速開始

在本倉庫中，我們介紹了GENERator，這是一個生成式基因組基礎模型，上下文長度為98k個鹼基對，擁有30億個參數。它在包含3860億個真核生物DNA鹼基對的大規模數據集上進行訓練。廣泛且多樣的預訓練數據使GENERator在各種生物體上具備更強的理解和生成能力。

如需瞭解更多技術細節，請參考我們的論文 GENERator: A Long-Context Generative Genomic Foundation Model。代碼和實現細節可在Github上獲取：https://github.com/GenerTeam/GENERator。

💻 使用示例

基礎用法

示例1：生成

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")
config = model.config

max_length = config.max_position_embeddings

# Define input sequences.
sequences = [
    "ATGAGGTGGCAAGAAATGGGCTAC",
    "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]

# Process the sequences
sequences = [tokenizer.bos_token + sequence for sequence in sequences]

# Tokenize the sequences
tokenizer.padding_side = "left"
inputs = tokenizer(
    sequences,
    add_special_tokens=False,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=max_length
)

# Generate the sequences
with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1)

# Decode the generated sequences
decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Print the decoded sequences
print(decoded_sequences)

# It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA')
# The input sequences are too short to provide sufficient context.

示例2：嵌入

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")

config = model.config
max_length = config.max_position_embeddings

# Define input sequences.
sequences = [
    "ATGAGGTGGCAAGAAATGGGCTAC",
    "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]

# Tokenize the sequences with add_special_tokens=True to automatically add special tokens,
# such as the BOS EOS token, at the appropriate positions.
tokenizer.padding_side = "right"
inputs = tokenizer(
    sequences,
    add_special_tokens=True,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=max_length
)

# Perform a forward pass through the model to obtain the outputs, including hidden states.
with torch.inference_mode():
    outputs = model(**inputs, output_hidden_states=True)

# Retrieve the hidden states from the last layer.
hidden_states = outputs.hidden_states[-1]  # Shape: (batch_size, sequence_length, hidden_size)

# Use the attention_mask to determine the index of the last token in each sequence.
# Since add_special_tokens=True is used, the last token is typically the EOS token.
attention_mask = inputs["attention_mask"]
last_token_indices = attention_mask.sum(dim=1) - 1  # Index of the last token for each sequence

# Extract the embedding corresponding to the EOS token for each sequence.
seq_embeddings = []
for i, token_index in enumerate(last_token_indices):
    # Fetch the embedding for the last token (EOS token).
    seq_embedding = hidden_states[i, token_index, :]
    seq_embeddings.append(seq_embedding)

# Stack the embeddings into a tensor with shape (batch_size, hidden_size)
seq_embeddings = torch.stack(seq_embeddings)

print("Sequence Embeddings:", seq_embeddings)

📄 許可證

本項目採用MIT許可證。

📚 引用

@misc{wu2025generator,
      title={GENERator: A Long-Context Generative Genomic Foundation Model}, 
      author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang},
      year={2025},
      eprint={2502.07272},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07272}, 
}