BioLORD - STAMB2 - v1オープンソースモデル - 臨床文や生物医学概念の意味表現を無料でデプロイして実現する

ホーム

Biolord STAMB2 V1

FremyCompanyによって開発

BioLORDは臨床文や生物医学的概念のための新しい事前学習戦略モデルです

テキスト埋め込み

PyTorch

英語オープンソースライセンス:その他 #生物医学セマンティックエンベディング #臨床用語類似度 #オントロジー表現学習

ダウンロード数 49

リリース時間 : 10/20/2022

モデル概要

このモデルは概念表現を定義や生物医学オントロジーから派生した短い説明にアンカーすることで、オントロジー階層構造に沿ったセマンティック表現を生成し、電子健康記録（EHR）や臨床ノートなどの医療文書処理に適しています。

モデル特徴

セマンティック表現生成

概念定義とオントロジー記述をアンカーとして、生物医学オントロジー階層構造に適合したセマンティック表現を生成

生物医学領域最適化

生物医学領域向けに特別にファインチューニングされ、臨床文書や医学用語を効率的に処理可能

マルチタスクサポート

臨床文と生物医学的概念の類似度計算を同時にサポート

モデル能力

文類似度計算

生物医学的概念表現生成

臨床文書特徴抽出

テキストクラスタリング

セマンティック検索

使用事例

臨床医学

医学用語マッチング

異なる表現方法だが同一の医学概念を指す用語を識別

MayoSRSデータセットで最先端レベルを達成

電子健康記録分析

臨床ノートから関連医学概念を抽出・関連付け

生物医学研究

生物医学オントロジー整合

異なるソースの生物医学オントロジーデータ統合を支援

🚀 FremyCompany/BioLORD-STAMB2-v1

このモデルは、臨床文や生物医学概念に対して意味のある表現を生成するための新しい事前学習戦略であるBioLORDを使用して学習されました。

⚠️ 重要な注意

このモデルは2022年に導入されました。それ以来、新しいバージョンが公開されています。
ほとんどの使用事例では、最新世代のBioLORDモデルであるBioLORD - 2023の方が適しています。

最先端の手法では、同じ概念を指す名前の表現の類似度を最大化し、対照学習によって崩壊を防ぐことで動作します。しかし、生物医学的な名前は必ずしも自明ではないため、非意味的な表現になることがあります。

BioLORDは、生物医学的オントロジーから構成される多関係知識グラフから導出された定義や短い説明を使用して概念表現を基盤にすることで、この問題を克服します。この基盤により、当社のモデルはオントロジーの階層構造により密接に一致する、より意味的な概念表現を生成します。BioLORDは、臨床文（MedSTS）と生物医学概念（MayoSRS）の両方におけるテキスト類似度に関して新しい最先端技術を確立しています。

このモデルは、sentence - transformers/all - mpnet - base - v2に基づいており、BioLORD - Datasetでさらに微調整されています。

🚀 クイックスタート

このモデルは、臨床文や生物医学概念に対して意味のある表現を生成するために開発されたものです。以下に、このモデルの使用方法や特徴について説明します。

✨ 主な機能

このモデルはsentence - transformersモデルで、文や段落を768次元の密ベクトル空間にマッピングし、クラスタリングや意味検索などのタスクに使用できます。
生物医学ドメイン向けに微調整されており、一般的なテキストの埋め込み生成能力も維持しつつ、EHRレコードや臨床ノートなどの医療文書の処理により有用です。
文とフレーズの両方を同じ潜在空間に埋め込むことができます。

📦 インストール

sentence - transformersをインストールすることで、このモデルを簡単に使用できます。

pip install -U sentence-transformers

💻 使用例

基本的な使用法（Sentence - Transformers）

from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

model = SentenceTransformer('FremyCompany/BioLORD-STAMB2-v1')
embeddings = model.encode(sentences)
print(embeddings)

高度な使用法（HuggingFace Transformers）

sentence - transformersを使用せずに、このモデルを使用するには、まず入力をトランスフォーマーモデルに通し、その後、文脈化された単語埋め込みの上に適切なプーリング操作を適用する必要があります。

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

📚 ドキュメント

一般的な目的

このモデルは、文や段落を768次元の密ベクトル空間にマッピングし、クラスタリングや意味検索などのタスクに使用できるsentence - transformersモデルです。生物医学ドメイン向けに微調整されており、一般的なテキストの埋め込み生成能力も維持しつつ、医療文書の処理により有用です。

引用

このモデルは、BioLORD: Learning Ontological Representations from Definitions論文に付随しており、EMNLP 2022 Findingsに掲載されています。このモデルを使用する場合は、以下のように元の論文を引用してください。

@inproceedings{remy-etal-2022-biolord,
    title = "{B}io{LORD}: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions",
    author = "Remy, François  and
      Demuynck, Kris  and
      Demeester, Thomas",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.104",
    pages = "1454--1465",
    abstract = "This work introduces BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts. State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations. BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).",
}

また、当社のMWE 2023論文も参照すると良いでしょう。

Detecting Idiomatic Multiword Expressions in Clinical Terminology using Definition-Based Representation Learning

📄 ライセンス

このモデルに対する私自身の貢献はMITライセンスの対象となります。ただし、このモデルの学習に使用されるデータはUMLSに由来するため、このモデルを使用する前にUMLSの適切なライセンスを取得していることを確認する必要があります。UMLSはほとんどの国で無料ですが、有効なライセンスを維持するために、アカウントを作成し、データの使用状況を毎年報告する必要がある場合があります。

その他の情報

属性	详情
パイプラインタグ	文の類似度
タグ	sentence - transformers、feature - extraction、sentence - similarity
言語	英語
ライセンス	その他
データセット	FremyCompany/BioLORD - Dataset