bioformer-8L開源生物醫學文本挖掘模型 - 輕量高速，性能媲美BioBERT

首頁

Bioformer 8L

由bioformers開發

專為生物醫學文本挖掘設計的輕量化BERT模型，運行速度是BERT-base的3倍，性能與BioBERT/PubMedBERT相當甚至更優

大型語言模型

Transformers

英語開源協議:Apache-2.0 #生物醫學文本挖掘 #輕量化BERT #全詞掩碼

下載量 164

發布時間 : 3/2/2022

模型概述

Bioformer-8L是一款基於生物醫學領域語料從頭預訓練的輕量化BERT模型，採用生物醫學專用詞彙表，適用於各種生物醫學文本挖掘任務

模型特點

生物醫學專用

完全基於生物醫學領域語料(PubMed摘要和PMC全文)預訓練，採用生物醫學專用詞彙表

高效輕量

參數規模42.8M，運行速度是BERT-base的3倍，在下游任務中保持高性能

全詞掩碼策略

預訓練採用全詞掩碼(whole-word masking)策略，掩碼率15%

專業詞彙覆蓋

詞彙表基於生物醫學文獻訓練，包含32768個token，涵蓋生物醫學特殊符號

模型能力

生物醫學文本理解

掩碼語言建模

生物醫學實體識別

生物醫學文本分類

使用案例

生物醫學研究

疾病概念識別

識別生物醫學文本中的疾病相關概念

在掩碼填充示例中準確識別'糖尿病'等醫學概念

文獻分類

對生物醫學文獻進行多標籤主題分類

在BioCreative VII新冠肺炎分類挑戰賽中取得最佳性能

臨床文本處理

臨床記錄分析

分析臨床記錄中的關鍵醫學信息

🚀 Bioformer-8L

Bioformer-8L 是一款用於生物醫學文本挖掘的輕量級 BERT 模型。它採用生物醫學詞彙表，並僅在生物醫學領域語料庫上從頭開始預訓練。實驗表明，Bioformer-8L 的速度是 BERT-base 的 3 倍，並且在下游自然語言處理任務中，其性能與 BioBERT/PubMedBERT 相當，甚至更優。

🚀 快速開始

Bioformer-8L 的使用方法與標準 BERT 模型相同。BERT 的文檔可參考此處。

⚠️ 重要提示

bioformer-cased-v1.0 已更名為 bioformer-8L。所有指向 bioformer-cased-v1.0 的鏈接（包括 Git 操作）都將自動重定向到 bioformer-8L。不過，為避免混淆，建議將現有的本地克隆更新為指向新的倉庫 URL。

✨ 主要特性

輕量級高效：速度是 BERT-base 的 3 倍。
領域適配性強：使用生物醫學詞彙表，僅在生物醫學領域語料庫上預訓練。
性能優異：在下游 NLP 任務中，性能與 BioBERT/PubMedBERT 相當甚至更優。

📦 安裝指南

前提條件

python3、pytorch、transformers 和 datasets

我們已在 Python v3.9.16、PyTorch v1.13.1+cu117、Datasets v2.9.0 和 Transformers v4.26 上測試了以下命令。

安裝步驟

安裝 pytorch，請參考此處的說明。
安裝 transformers 和 datasets 庫：

pip install transformers
pip install datasets

💻 使用示例

基礎用法

from transformers import pipeline
unmasker8L = pipeline('fill-mask', model='bioformers/bioformer-8L')
unmasker8L("[MASK] refers to a group of diseases that affect how the body uses blood sugar (glucose)")

unmasker16L = pipeline('fill-mask', model='bioformers/bioformer-16L')
unmasker16L("[MASK] refers to a group of diseases that affect how the body uses blood sugar (glucose)")

輸出示例

`bioformer-8L` 的輸出

[{'score': 0.3207533359527588, 
'token': 13473, 
'token_str': 'Diabetes', 
'sequence': 'Diabetes refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.19234347343444824, 
'token': 17740, 
'token_str': 'Obesity', 
'sequence': 'Obesity refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.09200277179479599, 
'token': 10778, 
'token_str': 'T2DM', 
'sequence': 'T2DM refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.08494312316179276, 
'token': 2228, 
'token_str': 'It', 
'sequence': 'It refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.0412776917219162, 
'token': 22263, 
'token_str': 
'Hypertension', 
'sequence': 'Hypertension refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}]

`bioformer-16L` 的輸出

[{'score': 0.7262957692146301,
'token': 13473,
'token_str': 'Diabetes',
'sequence': 'Diabetes refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},

{'score': 0.124954953789711,
'token': 10778,
'token_str': 'T2DM',
'sequence': 'T2DM refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},

{'score': 0.04062706232070923,
'token': 2228,
'token_str': 'It',
'sequence': 'It refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.022694870829582214,
'token': 17740,
'token_str': 'Obesity',
'sequence': 'Obesity refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},

{'score': 0.009743048809468746,
'token': 13960,
'token_str': 'T2D',
'sequence': 'T2D refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}]

📚 詳細文檔

Bioformer-8L 的詞彙表

Bioformer-8L 使用從生物醫學語料庫訓練的大小寫敏感的 WordPiece 詞彙表，該語料庫包含所有 PubMed 摘要（截至 2021 年 2 月 1 日，共 3300 萬條）和 100 萬篇 PMC 全文文章。PMC 有 360 萬篇文章，但我們將其下采樣至 100 萬篇，以使 PubMed 摘要和 PMC 全文文章的總規模大致相等。為緩解未登錄詞問題並納入生物醫學文獻中的特殊符號（如男性和女性符號），我們從這兩個資源的 Unicode 文本中訓練了 Bioformer 的詞彙表。Bioformer-8L 的詞彙表大小為 32768（2^15），與原始 BERT 相近。

Bioformer-8L 的預訓練

Bioformer-8L 在與詞彙表相同的語料庫（3300 萬篇 PubMed 摘要 + 100 萬篇 PMC 全文文章）上從頭開始預訓練。對於掩碼語言模型（MLM）目標，我們使用全詞掩碼，掩碼率為 15%。關於下一句預測（NSP）目標是否能提高下游任務的性能存在爭議。我們將其納入預訓練實驗，以防最終用戶需要進行下一句預測。所有訓練文本的句子分割使用 SciSpacy 進行。

Bioformer-8L 的預訓練在單個雲 TPU 設備（TPUv2，8 核，每核 8GB 內存）上進行。最大輸入序列長度固定為 512，批量大小設置為 256。我們對 Bioformer-8L 進行了 200 萬步的預訓練，大約耗時 8.3 天。

🏆 所獲榮譽

Bioformer-8L 在 BioCreative VII COVID-19 多標籤主題分類挑戰賽（https://doi.org/10.1093/database/baac069）中取得了最佳性能（最高微 F1 分數）。

🔗 相關鏈接

Bioformer-16L

🙏 致謝

Bioformer-8L 的訓練和評估得到了 Google TPU 研究雲（TRC）計劃、美國國立醫學圖書館（NLM）、美國國立衛生研究院（NIH）的內部研究計劃以及 NIH/NLM 資助項目 LM012895 和 1K99LM014024 - 01 的支持。

❓ 常見問題

如果您有任何問題，請在此處提交問題：https://github.com/WGLab/bioformer/issues

您也可以發送電子郵件至 Li Fang（fangli9@mail.sysu.edu.cn，https://fangli80.github.io/）。

📚 引用信息

您可以引用我們在 arXiv 上的預印本：

Fang L, Chen Q, Wei C-H, Lu Z, Wang K: Bioformer: an efficient transformer language model for biomedical text mining. arXiv preprint arXiv:2302.01588 (2023). DOI: https://doi.org/10.48550/arXiv.2302.01588

BibTeX 格式：

@ARTICLE{fangli2023bioformer,
    author = {{Fang}, Li and {Chen}, Qingyu and {Wei}, Chih-Hsuan and {Lu}, Zhiyong and {Wang}, Kai},
    title = "{Bioformer: an efficient transformer language model for biomedical text mining}",
    journal = {arXiv preprint arXiv:2302.01588},
    year = {2023}
}