BioSimCSE-BioLinkBERT-BASE開源模型 - 免費助力生物醫學文本相似度計算

首頁

Biosimcse BioLinkBERT BASE

由kamalkraj開發

基於BioLinkBERT的生物醫學句子嵌入模型，專為生物醫學文本相似度計算設計

文本嵌入

Transformers

#生物醫學文本嵌入 #對比學習優化 #科研文獻相似度

下載量 774

發布時間 : 12/5/2022

模型概述

該模型是一個sentence-transformers模型，可將生物醫學領域的句子和段落映射到768維稠密向量空間，適用於聚類、語義搜索等任務。

模型特點

生物醫學領域優化

專門針對生物醫學文本訓練，在生物醫學語義相似度任務上表現優異

對比學習訓練

採用MultipleNegativesRankingLoss進行對比學習訓練，優化句子嵌入質量

高效向量表示

將句子轉換為768維稠密向量，便於下游任務處理

模型能力

生物醫學文本相似度計算

句子嵌入生成

語義搜索

文本聚類

使用案例

生物醫學研究

文獻檢索增強

通過語義相似度改進生物醫學文獻檢索系統

提高相關文獻檢索準確率

研究結果比對

自動識別不同研究中相似或相關的發現

加速研究綜述過程

臨床決策支持

病例相似度分析

通過症狀描述向量匹配相似病例

輔助臨床決策

🚀 kamalkraj/BioSimCSE - BioLinkBERT - BASE

這是一個 sentence - transformers 模型，它能將句子和段落映射到一個 768 維的密集向量空間，可用於聚類或語義搜索等任務。

🚀 快速開始

✨ 主要特性

可將句子和段落映射到 768 維的密集向量空間。
適用於聚類和語義搜索等任務。

📦 安裝指南

若要使用此模型，需先安裝 sentence - transformers：

pip install -U sentence-transformers

💻 使用示例

基礎用法

使用 sentence - transformers 庫調用模型的示例代碼如下：

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('kamalkraj/BioSimCSE-BioLinkBERT-BASE')
embeddings = model.encode(sentences)
print(embeddings)

高級用法

若不使用 sentence - transformers，可按以下步驟使用模型：首先將輸入傳遞給 Transformer 模型，然後對上下文詞嵌入應用正確的池化操作。

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('kamalkraj/BioSimCSE-BioLinkBERT-BASE')
model = AutoModel.from_pretrained('kamalkraj/BioSimCSE-BioLinkBERT-BASE')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 詳細文檔

評估結果

若要對該模型進行自動評估，請參考 Sentence Embeddings Benchmark：https://seb.sbert.net

訓練

該模型使用以下參數進行訓練：

數據加載器： torch.utils.data.dataloader.DataLoader，長度為 7708，參數如下：

{'batch_size': 128, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

損失函數： sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss，參數如下：
```
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
```

fit() 方法的參數：

{
    "epochs": 1,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 5e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 771,
    "weight_decay": 0.01
}

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 許可證

文檔中未提及相關許可證信息。

🔧 技術細節

該模型將句子和段落映射到 768 維的密集向量空間，在訓練過程中使用了特定的數據加載器、損失函數和優化器參數。通過對上下文詞嵌入應用池化操作，得到句子的嵌入表示。在評估方面，可通過 Sentence Embeddings Benchmark 進行自動評估。

📄 引用與作者

@inproceedings{kanakarajan-etal-2022-biosimcse,
    title = "{B}io{S}im{CSE}: {B}io{M}edical Sentence Embeddings using Contrastive learning",
    author = "Kanakarajan, Kamal raj  and
      Kundumani, Bhuvana  and
      Abraham, Abhijith  and
      Sankarasubbu, Malaikannan",
    booktitle = "Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.louhi-1.10",
    pages = "81--86",
    abstract = "Sentence embeddings in the form of fixed-size vectors that capture the information in the sentence as well as the context are critical components of Natural Language Processing systems. With transformer model based sentence encoders outperforming the other sentence embedding methods in the general domain, we explore the transformer based architectures to generate dense sentence embeddings in the biomedical domain. In this work, we present BioSimCSE, where we train sentence embeddings with domain specific transformer based models with biomedical texts. We assess our model{'}s performance with zero-shot and fine-tuned settings on Semantic Textual Similarity (STS) and Recognizing Question Entailment (RQE) tasks. Our BioSimCSE model using BioLinkBERT achieves state of the art (SOTA) performance on both tasks.",
}