GENERator - 3b-baseオープンソースのゲノムモデル、真核生物データに基づき、超長塩基対分析がより強力

ホーム

Generator Eukaryote 3b Base

GenerTeamによって開発

GENERatorは9.8万塩基対のコンテキスト長と30億パラメータを持つ生成型ゲノム基礎モデルで、真核生物DNA拡張データセットで訓練されています

タンパク質モデル

Transformers

オープンソースライセンス:MIT #長鎖生成 #種間ゲノム #9.8万塩基コンテキスト

ダウンロード数 1,599

リリース時間 : 2/11/2025

モデル概要

このモデルはゲノム配列生成と分析に特化した基礎モデルで、種間を超えた理解と生成能力を強化しています

モデル特徴

長文脈処理

最大9.8万塩基対のコンテキスト長をサポート

種間理解

多様な真核生物DNAデータセットで訓練され、種間分析能力を有する

大規模事前訓練

3860億塩基対のDNA配列で事前訓練済み

モデル能力

DNA配列生成

ゲノム配列分析

配列埋め込み表現

使用事例

ゲノム研究

遺伝子配列生成

入力配列に基づいて新しいDNA配列を生成

生物学的特性に適合したDNA配列断片を生成可能

配列特徴抽出

DNA配列の埋め込み表現を取得

遺伝子分類や機能予測などの下流分析タスクに利用可能

🚀 GENERator-eukaryote-3b-baseモデル

このモデルは、真核生物のDNAに基づく長文脈の生成型ゲノム基礎モデルで、幅広い生物に対する理解と生成能力を備えています。

🚀 クイックスタート

このリポジトリでは、GENERatorという生成型ゲノム基礎モデルを紹介しています。このモデルは98k塩基対のコンテキスト長と30億のパラメータを持ち、3860億塩基対の真核生物DNAからなる大規模なデータセットで学習されています。広範かつ多様な事前学習データにより、GENERatorは様々な生物に対する理解と生成能力が向上しています。

詳細な技術情報については、論文 GENERator: A Long-Context Generative Genomic Foundation Model を参照してください。コードと実装の詳細はGithubで公開されています: https://github.com/GenerTeam/GENERator。

✨ 主な機能

98k塩基対の長いコンテキスト長をサポート
3860億塩基対の真核生物DNAで学習
様々な生物に対する理解と生成能力を備える

📦 インストール

このモデルはtransformersライブラリを使用しています。必要な依存関係をインストールすることで利用できます。

💻 使用例

基本的な使用法

生成の例

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")
config = model.config

max_length = config.max_position_embeddings

# Define input sequences.
sequences = [
    "ATGAGGTGGCAAGAAATGGGCTAC",
    "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]

# Process the sequences
sequences = [tokenizer.bos_token + sequence for sequence in sequences]

# Tokenize the sequences
tokenizer.padding_side = "left"
inputs = tokenizer(
    sequences,
    add_special_tokens=False,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=max_length
)

# Generate the sequences
with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1)

# Decode the generated sequences
decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Print the decoded sequences
print(decoded_sequences)

# It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA')
# The input sequences are too short to provide sufficient context.

埋め込みの例

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")

config = model.config
max_length = config.max_position_embeddings

# Define input sequences.
sequences = [
    "ATGAGGTGGCAAGAAATGGGCTAC",
    "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]

# Tokenize the sequences with add_special_tokens=True to automatically add special tokens,
# such as the BOS EOS token, at the appropriate positions.
tokenizer.padding_side = "right"
inputs = tokenizer(
    sequences,
    add_special_tokens=True,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=max_length
)

# Perform a forward pass through the model to obtain the outputs, including hidden states.
with torch.inference_mode():
    outputs = model(**inputs, output_hidden_states=True)

# Retrieve the hidden states from the last layer.
hidden_states = outputs.hidden_states[-1]  # Shape: (batch_size, sequence_length, hidden_size)

# Use the attention_mask to determine the index of the last token in each sequence.
# Since add_special_tokens=True is used, the last token is typically the EOS token.
attention_mask = inputs["attention_mask"]
last_token_indices = attention_mask.sum(dim=1) - 1  # Index of the last token for each sequence

# Extract the embedding corresponding to the EOS token for each sequence.
seq_embeddings = []
for i, token_index in enumerate(last_token_indices):
    # Fetch the embedding for the last token (EOS token).
    seq_embedding = hidden_states[i, token_index, :]
    seq_embeddings.append(seq_embedding)

# Stack the embeddings into a tensor with shape (batch_size, hidden_size)
seq_embeddings = torch.stack(seq_embeddings)

print("Sequence Embeddings:", seq_embeddings)

📚 ドキュメント

詳細な技術情報については、論文 GENERator: A Long-Context Generative Genomic Foundation Model を参照してください。

📄 ライセンス

このプロジェクトはMITライセンスの下で公開されています。

📚 引用

@misc{wu2025generator,
      title={GENERator: A Long-Context Generative Genomic Foundation Model}, 
      author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang},
      year={2025},
      eprint={2502.07272},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07272}, 
}