GENERator - 3b-base开源基因组模型，基于真核生物数据，超长碱基对分析更强大

首页

Generator Eukaryote 3b Base

由 GenerTeam 开发

GENERator是一个具有9.8万碱基对上下文长度和30亿参数的生成式基因组基础模型，基于真核生物DNA扩展数据集训练

蛋白质模型

Transformers

开源协议:MIT #长序列生成 #跨物种基因组 #9.8万碱基上下文

下载量 1,599

发布时间 : 2/11/2025

模型简介

该模型是一个专注于基因组序列生成和分析的基础模型，具有跨物种的增强理解与生成能力

模型特点

长上下文处理

支持高达9.8万碱基对的上下文长度

跨物种理解

基于多样化的真核生物DNA数据集训练，具有跨物种分析能力

大规模预训练

在3860亿碱基对的DNA序列上进行预训练

模型能力

DNA序列生成

基因组序列分析

序列嵌入表示

使用案例

基因组研究

基因序列生成

根据输入序列生成新的DNA序列

可生成符合生物特性的DNA序列片段

序列特征提取

获取DNA序列的嵌入表示

可用于下游分析任务如基因分类或功能预测

🚀 GENERator-eukaryote-3b-base模型

本项目推出的GENERator是一个生成式基因组基础模型，其上下文长度可达98k个碱基对，拥有30亿个参数。该模型在包含3860亿个真核生物DNA碱基对的庞大数据集上进行训练，丰富多样的预训练数据赋予了GENERator在不同生物体上更强的理解和生成能力。

🚀 快速开始

在本仓库中，我们介绍了GENERator，这是一个生成式基因组基础模型，上下文长度为98k个碱基对，拥有30亿个参数。它在包含3860亿个真核生物DNA碱基对的大规模数据集上进行训练。广泛且多样的预训练数据使GENERator在各种生物体上具备更强的理解和生成能力。

如需了解更多技术细节，请参考我们的论文 GENERator: A Long-Context Generative Genomic Foundation Model。代码和实现细节可在Github上获取：https://github.com/GenerTeam/GENERator。

💻 使用示例

基础用法

示例1：生成

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")
config = model.config

max_length = config.max_position_embeddings

# Define input sequences.
sequences = [
    "ATGAGGTGGCAAGAAATGGGCTAC",
    "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]

# Process the sequences
sequences = [tokenizer.bos_token + sequence for sequence in sequences]

# Tokenize the sequences
tokenizer.padding_side = "left"
inputs = tokenizer(
    sequences,
    add_special_tokens=False,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=max_length
)

# Generate the sequences
with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1)

# Decode the generated sequences
decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Print the decoded sequences
print(decoded_sequences)

# It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA')
# The input sequences are too short to provide sufficient context.

示例2：嵌入

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")

config = model.config
max_length = config.max_position_embeddings

# Define input sequences.
sequences = [
    "ATGAGGTGGCAAGAAATGGGCTAC",
    "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]

# Tokenize the sequences with add_special_tokens=True to automatically add special tokens,
# such as the BOS EOS token, at the appropriate positions.
tokenizer.padding_side = "right"
inputs = tokenizer(
    sequences,
    add_special_tokens=True,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=max_length
)

# Perform a forward pass through the model to obtain the outputs, including hidden states.
with torch.inference_mode():
    outputs = model(**inputs, output_hidden_states=True)

# Retrieve the hidden states from the last layer.
hidden_states = outputs.hidden_states[-1]  # Shape: (batch_size, sequence_length, hidden_size)

# Use the attention_mask to determine the index of the last token in each sequence.
# Since add_special_tokens=True is used, the last token is typically the EOS token.
attention_mask = inputs["attention_mask"]
last_token_indices = attention_mask.sum(dim=1) - 1  # Index of the last token for each sequence

# Extract the embedding corresponding to the EOS token for each sequence.
seq_embeddings = []
for i, token_index in enumerate(last_token_indices):
    # Fetch the embedding for the last token (EOS token).
    seq_embedding = hidden_states[i, token_index, :]
    seq_embeddings.append(seq_embedding)

# Stack the embeddings into a tensor with shape (batch_size, hidden_size)
seq_embeddings = torch.stack(seq_embeddings)

print("Sequence Embeddings:", seq_embeddings)

📄 许可证

本项目采用MIT许可证。

📚 引用

@misc{wu2025generator,
      title={GENERator: A Long-Context Generative Genomic Foundation Model}, 
      author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang},
      year={2025},
      eprint={2502.07272},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07272}, 
}