bioformer-8L开源生物医学文本挖掘模型 - 轻量高速，性能媲美BioBERT

首页

Bioformer 8L

由 bioformers 开发

专为生物医学文本挖掘设计的轻量化BERT模型，运行速度是BERT-base的3倍，性能与BioBERT/PubMedBERT相当甚至更优

大型语言模型

Transformers

英语开源协议:Apache-2.0 #生物医学文本挖掘 #轻量化BERT #全词掩码

下载量 164

发布时间 : 3/2/2022

模型简介

Bioformer-8L是一款基于生物医学领域语料从头预训练的轻量化BERT模型，采用生物医学专用词汇表，适用于各种生物医学文本挖掘任务

模型特点

生物医学专用

完全基于生物医学领域语料(PubMed摘要和PMC全文)预训练，采用生物医学专用词汇表

高效轻量

参数规模42.8M，运行速度是BERT-base的3倍，在下游任务中保持高性能

全词掩码策略

预训练采用全词掩码(whole-word masking)策略，掩码率15%

专业词汇覆盖

词汇表基于生物医学文献训练，包含32768个token，涵盖生物医学特殊符号

模型能力

生物医学文本理解

掩码语言建模

生物医学实体识别

生物医学文本分类

使用案例

生物医学研究

疾病概念识别

识别生物医学文本中的疾病相关概念

在掩码填充示例中准确识别'糖尿病'等医学概念

文献分类

对生物医学文献进行多标签主题分类

在BioCreative VII新冠肺炎分类挑战赛中取得最佳性能

临床文本处理

临床记录分析

分析临床记录中的关键医学信息

🚀 Bioformer-8L

Bioformer-8L 是一款用于生物医学文本挖掘的轻量级 BERT 模型。它采用生物医学词汇表，并仅在生物医学领域语料库上从头开始预训练。实验表明，Bioformer-8L 的速度是 BERT-base 的 3 倍，并且在下游自然语言处理任务中，其性能与 BioBERT/PubMedBERT 相当，甚至更优。

🚀 快速开始

Bioformer-8L 的使用方法与标准 BERT 模型相同。BERT 的文档可参考此处。

⚠️ 重要提示

bioformer-cased-v1.0 已更名为 bioformer-8L。所有指向 bioformer-cased-v1.0 的链接（包括 Git 操作）都将自动重定向到 bioformer-8L。不过，为避免混淆，建议将现有的本地克隆更新为指向新的仓库 URL。

✨ 主要特性

轻量级高效：速度是 BERT-base 的 3 倍。
领域适配性强：使用生物医学词汇表，仅在生物医学领域语料库上预训练。
性能优异：在下游 NLP 任务中，性能与 BioBERT/PubMedBERT 相当甚至更优。

📦 安装指南

前提条件

python3、pytorch、transformers 和 datasets

我们已在 Python v3.9.16、PyTorch v1.13.1+cu117、Datasets v2.9.0 和 Transformers v4.26 上测试了以下命令。

安装步骤

安装 pytorch，请参考此处的说明。
安装 transformers 和 datasets 库：

pip install transformers
pip install datasets

💻 使用示例

基础用法

from transformers import pipeline
unmasker8L = pipeline('fill-mask', model='bioformers/bioformer-8L')
unmasker8L("[MASK] refers to a group of diseases that affect how the body uses blood sugar (glucose)")

unmasker16L = pipeline('fill-mask', model='bioformers/bioformer-16L')
unmasker16L("[MASK] refers to a group of diseases that affect how the body uses blood sugar (glucose)")

输出示例

`bioformer-8L` 的输出

[{'score': 0.3207533359527588, 
'token': 13473, 
'token_str': 'Diabetes', 
'sequence': 'Diabetes refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.19234347343444824, 
'token': 17740, 
'token_str': 'Obesity', 
'sequence': 'Obesity refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.09200277179479599, 
'token': 10778, 
'token_str': 'T2DM', 
'sequence': 'T2DM refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.08494312316179276, 
'token': 2228, 
'token_str': 'It', 
'sequence': 'It refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.0412776917219162, 
'token': 22263, 
'token_str': 
'Hypertension', 
'sequence': 'Hypertension refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}]

`bioformer-16L` 的输出

[{'score': 0.7262957692146301,
'token': 13473,
'token_str': 'Diabetes',
'sequence': 'Diabetes refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},

{'score': 0.124954953789711,
'token': 10778,
'token_str': 'T2DM',
'sequence': 'T2DM refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},

{'score': 0.04062706232070923,
'token': 2228,
'token_str': 'It',
'sequence': 'It refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.022694870829582214,
'token': 17740,
'token_str': 'Obesity',
'sequence': 'Obesity refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},

{'score': 0.009743048809468746,
'token': 13960,
'token_str': 'T2D',
'sequence': 'T2D refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}]

📚 详细文档

Bioformer-8L 的词汇表

Bioformer-8L 使用从生物医学语料库训练的大小写敏感的 WordPiece 词汇表，该语料库包含所有 PubMed 摘要（截至 2021 年 2 月 1 日，共 3300 万条）和 100 万篇 PMC 全文文章。PMC 有 360 万篇文章，但我们将其下采样至 100 万篇，以使 PubMed 摘要和 PMC 全文文章的总规模大致相等。为缓解未登录词问题并纳入生物医学文献中的特殊符号（如男性和女性符号），我们从这两个资源的 Unicode 文本中训练了 Bioformer 的词汇表。Bioformer-8L 的词汇表大小为 32768（2^15），与原始 BERT 相近。

Bioformer-8L 的预训练

Bioformer-8L 在与词汇表相同的语料库（3300 万篇 PubMed 摘要 + 100 万篇 PMC 全文文章）上从头开始预训练。对于掩码语言模型（MLM）目标，我们使用全词掩码，掩码率为 15%。关于下一句预测（NSP）目标是否能提高下游任务的性能存在争议。我们将其纳入预训练实验，以防最终用户需要进行下一句预测。所有训练文本的句子分割使用 SciSpacy 进行。

Bioformer-8L 的预训练在单个云 TPU 设备（TPUv2，8 核，每核 8GB 内存）上进行。最大输入序列长度固定为 512，批量大小设置为 256。我们对 Bioformer-8L 进行了 200 万步的预训练，大约耗时 8.3 天。

🏆 所获荣誉

Bioformer-8L 在 BioCreative VII COVID-19 多标签主题分类挑战赛（https://doi.org/10.1093/database/baac069）中取得了最佳性能（最高微 F1 分数）。

🔗 相关链接

Bioformer-16L

🙏 致谢

Bioformer-8L 的训练和评估得到了 Google TPU 研究云（TRC）计划、美国国立医学图书馆（NLM）、美国国立卫生研究院（NIH）的内部研究计划以及 NIH/NLM 资助项目 LM012895 和 1K99LM014024 - 01 的支持。

❓ 常见问题

如果您有任何问题，请在此处提交问题：https://github.com/WGLab/bioformer/issues

您也可以发送电子邮件至 Li Fang（fangli9@mail.sysu.edu.cn，https://fangli80.github.io/）。

📚 引用信息

您可以引用我们在 arXiv 上的预印本：

Fang L, Chen Q, Wei C-H, Lu Z, Wang K: Bioformer: an efficient transformer language model for biomedical text mining. arXiv preprint arXiv:2302.01588 (2023). DOI: https://doi.org/10.48550/arXiv.2302.01588

BibTeX 格式：

@ARTICLE{fangli2023bioformer,
    author = {{Fang}, Li and {Chen}, Qingyu and {Wei}, Chih-Hsuan and {Lu}, Zhiyong and {Wang}, Kai},
    title = "{Bioformer: an efficient transformer language model for biomedical text mining}",
    journal = {arXiv preprint arXiv:2302.01588},
    year = {2023}
}