ModernPubMedBERTオープンソース文変換器モデル - 多次元のバイオメディカルテキスト処理を無料でサポート

ホーム

Modernpubmedbert

lokeshch19によって開発

PubMedデータセットを基に訓練された文変換器モデルで、複数の埋め込み次元をサポートし、生物医学テキスト処理に適しています。

テキスト埋め込みオープンソースライセンス:Apache-2.0 #生物医学テキスト埋め込み #多次元ベクトル表現 #医学的意味類似度

ダウンロード数 380

リリース時間 : 4/16/2025

モデル概要

これはPubMedデータセットを基に訓練された文変換器モデルで、ネスト表現学習によって文や段落を複数の埋め込み次元を持つ密なベクトル空間にマッピングし、意味的テキスト類似性、意味検索、言い換えマイニングなどのタスクに適しています。

モデル特徴

複数の埋め込み次元

768、512、384、256、128などの複数の埋め込み次元をサポートし、アプリケーションのニーズに応じて柔軟に選択できます。

長シーケンスサポート

最大シーケンス長は8192トークンをサポートし、長いテキストの処理に適しています。

生物医学最適化

PubMedデータセットを基に訓練されており、生物医学および臨床テキスト処理に特に適しています。

モデル能力

意味的テキスト類似度計算

意味検索

言い換えマイニング

テキスト分類

クラスタリング

使用事例

生物医学文献処理

医学文献の類似度分析

医学文献間の意味的類似度を計算し、研究者が関連文献を迅速に見つけるのに役立ちます。

臨床診断支援

臨床テキストを分析することで、医師の診断判断を支援します。

テキストマイニング

医学テキストのクラスタリング

大量の医学テキストをクラスタリング分析し、潜在的なトピックやパターンを発見します。

🚀 ModernPubMedBERT

このモデルは、PubMedデータセットで学習されたsentence-transformersモデルです。Matryoshka Representation Learningを用いて、文章や段落を複数の埋め込み次元（768、512、384、256、128）の密なベクトル空間にマッピングします。これにより、アプリケーションのニーズに応じて異なる埋め込みサイズを柔軟に使用でき、意味的な文章の類似性、意味検索、言い換えマイニング、テキスト分類、クラスタリングなどのタスクで高性能を維持できます。

✨ 主な機能

モデル詳細

属性	详情
モデルタイプ	Sentence Transformer
最大シーケンス長	8192トークン
出力次元数	768次元
類似度関数	コサイン類似度
言語	en
ライセンス	apache-2.0

完全なモデルアーキテクチャ

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

📦 インストール

まず、Sentence Transformersライブラリをインストールします。

pip install -U sentence-transformers

💻 使用例

基本的な使用法

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("lokeshch19/ModernPubMedBERT")
# Run inference
sentences = [
    "The patient was diagnosed with type 2 diabetes mellitus",
    "The individual shows symptoms of hyperglycemia and insulin resistance",
    "Metastatic cancer requires aggressive treatment approaches"
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

高度な使用法

# 損失関数の設定
# Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) を使用し、以下のパラメータを設定
{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        384,
        256,
        128
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

フレームワークバージョン

属性	详情
Python	3.10.10
Sentence Transformers	4.1.0
Transformers	4.51.3
PyTorch	2.7.0+cu128
Accelerate	1.6.0
Datasets	3.5.1
Tokenizers	0.21.1

📚 ドキュメント

引用

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}