🚀 kamalkraj/BioSimCSE - BioLinkBERT - BASE
このモデルはsentence - transformersをベースにしており、文章や段落を768次元の密ベクトル空間にマッピングします。クラスタリングや意味検索などのタスクに利用できます。
🚀 クイックスタート
📦 インストール
sentence - transformersをインストールすると、このモデルの使用が簡単になります。
pip install -U sentence-transformers
💻 使用例
基本的な使用法
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('kamalkraj/BioSimCSE-BioLinkBERT-BASE')
embeddings = model.encode(sentences)
print(embeddings)
高度な使用法
sentence - transformersを使用しない場合、以下のようにモデルを使用できます。まず、入力をTransformerモデルに通し、次に文脈化された単語埋め込みに対して適切なプーリング操作を適用する必要があります。
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('kamalkraj/BioSimCSE-BioLinkBERT-BASE')
model = AutoModel.from_pretrained('kamalkraj/BioSimCSE-BioLinkBERT-BASE')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
📚 ドキュメント
評価結果
このモデルの自動評価については、Sentence Embeddings Benchmarkを参照してください: https://seb.sbert.net
学習
このモデルは以下のパラメータで学習されました。
DataLoader:
torch.utils.data.dataloader.DataLoader
(長さ7708) で、以下のパラメータが使用されました。
{'batch_size': 128, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
損失関数:
sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss
で、以下のパラメータが使用されました。
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
fit()メソッドのパラメータ:
{
"epochs": 1,
"evaluation_steps": 0,
"evaluator": "NoneType",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 5e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 771,
"weight_decay": 0.01
}
モデルの完全なアーキテクチャ
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
引用と著者
@inproceedings{kanakarajan-etal-2022-biosimcse,
title = "{B}io{S}im{CSE}: {B}io{M}edical Sentence Embeddings using Contrastive learning",
author = "Kanakarajan, Kamal raj and
Kundumani, Bhuvana and
Abraham, Abhijith and
Sankarasubbu, Malaikannan",
booktitle = "Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates (Hybrid)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.louhi-1.10",
pages = "81--86",
abstract = "Sentence embeddings in the form of fixed-size vectors that capture the information in the sentence as well as the context are critical components of Natural Language Processing systems. With transformer model based sentence encoders outperforming the other sentence embedding methods in the general domain, we explore the transformer based architectures to generate dense sentence embeddings in the biomedical domain. In this work, we present BioSimCSE, where we train sentence embeddings with domain specific transformer based models with biomedical texts. We assess our model{'}s performance with zero-shot and fine-tuned settings on Semantic Textual Similarity (STS) and Recognizing Question Entailment (RQE) tasks. Our BioSimCSE model using BioLinkBERT achieves state of the art (SOTA) performance on both tasks.",
}