SegmentNTオープンソースDNA分割モデル - シーケンス中の様々なゲノム要素の位置を無料で予測

ホーム

Segment Nt

InstaDeepAIによって開発

SegmentNTはNucleotide TransformerベースのDNAセグメンテーションモデルで、単一塩基解像度で配列中の多様なゲノム要素の位置を予測できます。

分子モデル

Transformers

#単一塩基解像度 #ゲノムセグメンテーション #DNA基礎モデル

ダウンロード数 546

リリース時間 : 3/4/2024

モデル概要

SegmentNTはDNA基礎モデルのセグメンテーションモデルで、最大30kbのヒトゲノム入力配列において、遺伝子や制御要素を含む14種類の異なるゲノム要素の位置を予測できます。

モデル特徴

高解像度セグメンテーション

単一塩基解像度でゲノム要素の位置を予測可能

長鎖処理

最大30kbのDNA配列を処理可能（50kbまで拡張可）

多要素予測

遺伝子や制御要素を含む14種類の異なるゲノム要素を予測可能

モデル能力

DNA配列セグメンテーション

ゲノム要素予測

長鎖処理

使用事例

ゲノム学研究

遺伝子構造予測

タンパク質コード遺伝子、非コードRNAなどの遺伝子構造要素を予測

高精度な単一塩基解像度予測

制御要素識別

プロモーター、エンハンサーなどの制御要素を識別

組織特異的および組織不変の制御要素を区別可能

🚀 SegmentNT

SegmentNTは、Nucleotide Transformer (NT) DNA基礎モデルを活用したセグメンテーションモデルです。このモデルは、シーケンス内のいくつかのタイプのゲノム要素の位置を単一ヌクレオチド解像度で予測します。最大30kbの入力シーケンスにおける14種類のヒトゲノム要素のクラスを対象にトレーニングされています。これには、遺伝子（タンパク質コード遺伝子、lncRNA、5’UTR、3’UTR、エクソン、イントロン、スプライスアクセプターおよびドナー部位）および調節（ポリAシグナル、組織非依存的および組織特異的なプロモーターとエンハンサー、およびCTCF結合部位）要素が含まれます。

開発者: InstaDeep

🚀 クイックスタート

モデルソース

リポジトリ: Nucleotide Transformer
論文: Segmenting the genome at single-nucleotide resolution with DNA foundation models

使い方

次のリリースまでは、モデルを使用するためにtransformersライブラリをソースからインストールする必要があります。以下のコマンドを使用してください。

pip install --upgrade git+https://github.com/huggingface/transformers.git

ダミーのDNAシーケンスからロジットと埋め込みを取得するための小さなコードスニペットを以下に示します。

⚠️ 最大シーケンス長は、デフォルトでトレーニング長の30,000ヌクレオチド、または5001トークン（CLSトークンを含む）に設定されています。ただし、SegmentNT-multi-speciesは最大50,000bpのシーケンスまで汎化することが示されています。30kbpから50kbpのシーケンスで推論する必要がある場合は、esmモデルのRotary Embeddingレイヤーのrescaling_factorをnum_dna_tokens_inference / max_num_tokens_ntに変更してください。ここで、num_dna_tokens_inferenceは推論時のトークン数（例えば、40008塩基対のシーケンスの場合は6669）であり、max_num_tokens_ntはバックボーンのnucleotide-transformerがトレーニングされた最大トークン数、つまり2048です。

./inference_segment_nt.ipynbは、アイコンをクリックすることでGoogle Colabで実行できます。このノートブックは、リスケーリング係数の変更が必要なシーケンス長とそうでないシーケンス長の推論をどのように処理するかを示しています。ノートブックを実行して、SegmentNT論文の図1と図3を再現することができます。

# Load model and tokenizer
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)

# Choose the length to which the input sequences are padded. By default, the 
# model max length is chosen, but feel free to decrease it as the time taken to 
# obtain the embeddings increases significantly with it.
# The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by
# 2 to the power of the number of downsampling block, i.e 4.
max_length = 12 + 1

assert (max_length - 1) % 4 == 0, (
    "The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
     "2 to the power of the number of downsampling block, i.e 4.")

# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# Infer
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
    tokens,
    attention_mask=attention_mask,
    output_hidden_states=True
)

# Obtain the logits over the genomic features
logits = outs.logits.detach()
# Transform them in probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")

# Get probabilities associated with intron
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")

💻 使用例

基本的な使用法

# Load model and tokenizer
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)

# Choose the length to which the input sequences are padded. By default, the 
# model max length is chosen, but feel free to decrease it as the time taken to 
# obtain the embeddings increases significantly with it.
# The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by
# 2 to the power of the number of downsampling block, i.e 4.
max_length = 12 + 1

assert (max_length - 1) % 4 == 0, (
    "The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
     "2 to the power of the number of downsampling block, i.e 4.")

# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# Infer
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
    tokens,
    attention_mask=attention_mask,
    output_hidden_states=True
)

# Obtain the logits over the genomic features
logits = outs.logits.detach()
# Transform them in probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")

# Get probabilities associated with intron
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")

📚 ドキュメント

トレーニングデータ

SegmentNTモデルは、テストセットとして保持された20番染色体と21番染色体、および検証セットとして使用された22番染色体を除く、すべてのヒト染色体でトレーニングされました。トレーニング中は、ゲノム内のシーケンスが関連する注釈とともにランダムにサンプリングされます。ただし、検証セットとテストセットのシーケンスは、20番染色体と21番染色体に対して長さ30,000のスライディングウィンドウを使用して固定されています。検証セットは、トレーニングの監視と早期終了に使用されました。

トレーニング手順

前処理

DNAシーケンスは、Nucleotide Transformer Tokenizerを使用してトークン化されます。このトークナイザーは、関連するリポジトリのTokenizationセクションで説明されているように、シーケンスを6マートークンとしてトークン化します。このトークナイザーの語彙サイズは4105です。モデルの入力は次の形式になります。

<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>

トレーニング

モデルは、8つのGPUを搭載したDGXH100ノードで、合計23Bトークンに対して3日間トレーニングされました。モデルは、3kb、10kb、20kb、最後に30kbのシーケンスでトレーニングされ、それぞれの有効バッチサイズは256シーケンスでした。

アーキテクチャ

モデルは、nucleotide-transformer-v2-500m-multi-speciesエンコーダーで構成されています。このエンコーダーから言語モデルヘッドを削除し、2つのダウンサンプリング畳み込みブロックと2つのアップサンプリング畳み込みブロックからなる1次元U-Netセグメンテーションヘッド[4]に置き換えました。これらの各ブロックは、それぞれ1,024と2,048のカーネルを持つ2つの畳み込みレイヤーで構成されています。この追加のセグメンテーションヘッドは5300万のパラメーターを占め、総パラメーター数は562Mになります。

BibTeXエントリと引用情報

@article{de2024segmentnt,
  title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models},
  author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
  journal={bioRxiv},
  pages={2024--03},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}