BioSimCSE - BioLinkBERT - BASEオープンソースモデル - 無料で生物医学テキストの類似度計算をサポート

ホーム

Biosimcse BioLinkBERT BASE

kamalkrajによって開発

BioLinkBERTを基にした生物医学文埋め込みモデルで、生物医学テキストの類似度計算のために設計されています

テキスト埋め込み

Transformers

#生物医学テキスト埋め込み #対照学習最適化 #研究文献類似度

ダウンロード数 774

リリース時間 : 12/5/2022

モデル概要

このモデルはsentence-transformersモデルで、生物医学分野の文や段落を768次元の密なベクトル空間にマッピングし、クラスタリングや意味検索などのタスクに適しています。

モデル特徴

生物医学分野最適化

生物医学テキストに特化して訓練されており、生物医学的意味類似度タスクで優れた性能を発揮します

対照学習訓練

MultipleNegativesRankingLossを用いた対照学習訓練を行い、文埋め込みの品質を最適化しています

効率的なベクトル表現

文を768次元の密なベクトルに変換し、下流タスクの処理を容易にします

モデル能力

生物医学テキスト類似度計算

文埋め込み生成

意味検索

テキストクラスタリング

使用事例

生物医学研究

文献検索強化

意味的類似度を通じて生物医学文献検索システムを改善します

🚀 kamalkraj/BioSimCSE - BioLinkBERT - BASE

このモデルはsentence - transformersをベースにしており、文章や段落を768次元の密ベクトル空間にマッピングします。クラスタリングや意味検索などのタスクに利用できます。

🚀 クイックスタート

📦 インストール

sentence - transformersをインストールすると、このモデルの使用が簡単になります。

pip install -U sentence-transformers

💻 使用例

基本的な使用法

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('kamalkraj/BioSimCSE-BioLinkBERT-BASE')
embeddings = model.encode(sentences)
print(embeddings)

高度な使用法

sentence - transformersを使用しない場合、以下のようにモデルを使用できます。まず、入力をTransformerモデルに通し、次に文脈化された単語埋め込みに対して適切なプーリング操作を適用する必要があります。

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('kamalkraj/BioSimCSE-BioLinkBERT-BASE')
model = AutoModel.from_pretrained('kamalkraj/BioSimCSE-BioLinkBERT-BASE')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 ドキュメント

評価結果

このモデルの自動評価については、Sentence Embeddings Benchmarkを参照してください: https://seb.sbert.net

学習

このモデルは以下のパラメータで学習されました。

DataLoader: torch.utils.data.dataloader.DataLoader (長さ7708) で、以下のパラメータが使用されました。

{'batch_size': 128, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

損失関数: sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss で、以下のパラメータが使用されました。

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

fit()メソッドのパラメータ:

{
    "epochs": 1,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 5e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 771,
    "weight_decay": 0.01
}

モデルの完全なアーキテクチャ

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

引用と著者

@inproceedings{kanakarajan-etal-2022-biosimcse,
    title = "{B}io{S}im{CSE}: {B}io{M}edical Sentence Embeddings using Contrastive learning",
    author = "Kanakarajan, Kamal raj  and
      Kundumani, Bhuvana  and
      Abraham, Abhijith  and
      Sankarasubbu, Malaikannan",
    booktitle = "Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.louhi-1.10",
    pages = "81--86",
    abstract = "Sentence embeddings in the form of fixed-size vectors that capture the information in the sentence as well as the context are critical components of Natural Language Processing systems. With transformer model based sentence encoders outperforming the other sentence embedding methods in the general domain, we explore the transformer based architectures to generate dense sentence embeddings in the biomedical domain. In this work, we present BioSimCSE, where we train sentence embeddings with domain specific transformer based models with biomedical texts. We assess our model{'}s performance with zero-shot and fine-tuned settings on Semantic Textual Similarity (STS) and Recognizing Question Entailment (RQE) tasks. Our BioSimCSE model using BioLinkBERT achieves state of the art (SOTA) performance on both tasks.",
}