base-7b-v0.2开源医疗语言模型 - 专业医生训练，助力医疗问答超及格线！

首页

Base 7b V0.2

由 internistai 开发

由医学医生训练的大型医疗领域语言模型，首个在MedQA（USMLE）考试中得分超过60%及格线的7b模型

大型语言模型

Transformers

英语开源协议:Apache-2.0 #医疗问答 #USMLE考试 #临床决策支持

下载量 91

发布时间 : 4/21/2024

模型简介

专为医疗领域设计的语言模型，结合高质量医学文献与通用数据，保持跨领域能力，主要用于临床决策支持和文档辅助

模型特点

医生参与训练

训练数据由医学医生精心筛选，确保临床相关性和质量

医疗领域优化

在MedQA（USMLE）考试中得分超过60%及格线，优于同类7b模型

跨领域能力

结合通用数据和医疗专业数据，保持多领域应用能力

长上下文支持

支持4096个token的上下文长度，适合处理复杂医疗文档

模型能力

医疗文本生成

临床决策支持

医学问答

医疗文档辅助

使用案例

临床支持

疾病特征描述

生成特定疾病的解剖学特征和临床表现描述

在医生评估中达到与GPT-4相当的描述质量

诊断辅助

基于症状提供可能的诊断建议

在MedQA测试中达到60.5%准确率

医学教育

USMLE考试准备

帮助医学生准备USMLE考试相关问题

在MedQA测试中超过及格线

🚀 Internist.ai 7b 模型卡片

Internist.ai 7b 是一款医学领域的大语言模型，由医学专家参与训练，展示了 “医生参与” 方法的优势。训练数据由医学专家精心挑选，以确保其临床相关性和临床实践所需的质量。

这款 70 亿参数的模型是首个在 MedQA（USMLE）测试中得分超过 60% 及格线的 70 亿参数模型，并且在大多数医学评估中表现优于同规模的其他模型。

该模型是一个概念验证，后续计划基于更大的医学文献语料库训练更大的模型。如果您想赞助计算资源以加快训练速度，请随时与我们联系。

咨询通知

该模型由医学专家为医学专家设计，在非医学专业人员使用时，未针对潜在的安全问题进行特定训练。我们强烈建议在没有通过前瞻性临床试验进行全面评估并进行额外训练以达到所需安全水平的情况下，不要在实际环境中使用该模型。

🚀 快速开始

本部分暂未提供快速开始的相关内容。

✨ 主要特性

医生参与训练：由医学专家精心挑选训练数据，确保临床相关性和质量。
优异的医学评估表现：是首个在 MedQA（USMLE）测试中得分超过 60% 及格线的 70 亿参数模型，在大多数医学评估中优于同规模其他模型。

📦 安装指南

本部分暂未提供安装指南的相关内容。

💻 使用示例

基础用法

该模型使用 Alpaca 格式，以下是一个聊天模板的使用示例：

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("internistai/base-7b-v0.2")
tokenizer = AutoTokenizer.from_pretrained("internistai/base-7b-v0.2")

messages = [
    {"role": "user", "content": "Describe the anatomy of nutcracker syndrome"},
]

encodeds = tokenizer.apply_chat_template(messages, add_generation_prompt=True ,return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

高级用法

本部分暂未提供高级用法的相关内容。

📚 详细文档

模型详情

属性	详情
开发方	UCLouvain 和 Cliniques Universitaires Saint-Luc
语言	主要为英语
模型许可证	APACHE 2.0 LICENSE
代码许可证	APACHE 2.0 LICENSE
基础模型	Mistral-7B-v0.1
上下文长度	4096 个词元
知识截止日期	2023 年 10 月

模型来源

训练器：Axolotl
论文：Impact of High-Quality, Mixed-Domain Data on the Performance of Medical Language Models

用途

该模型旨在展示使用高质量且相关的医学文献以及通用数据来保留其他领域能力的好处。因此，该模型并非针对特定用途进行训练，也未进行额外的指令微调以确保安全性。

当前状态下，该模型可作为医学专业人员的助手，用于临床决策支持或文档撰写。我们不建议非专业人员使用该模型，因为他们可能无法察觉模型输出中的错误。

我们建议在实际应用中使用该模型之前，进行特定任务的训练和安全评估。

适用范围外的使用

我们不建议在生产环境中使用该模型进行自然语言生成，无论是否进行微调。

专业评估

我们创建了一个包含 100 个问题的自由回答评估数据集，并使用这些问题对该模型和 GPT - 4 进行了测试。然后，我们收集了提示/答案对，并将其呈现给 10 位不同专业的医学专家，让他们使用 7 点李克特量表进行评价（更多信息请参阅论文）。

🔧 技术细节

训练细节

训练数据

Internist.ai 7b 总共包含 23 亿个词元：

通用领域：OpenOrca - GPT4 是一个最先进的通用领域数据集，使用 GPT - 4 从 Flan 提示生成。
医学指南：包含来自 UpToDate 的 11332 篇文章，以及医生提供的特定领域指南，以覆盖 [USMLE 内容大纲](https://www.usmle.org/sites/default/files/2021 - 08/USMLE_Content_Outline.pdf)。
医学书籍：从 PMC LitArch 和我们的大学图书馆获取了 10376 本教科书。
合成数据：我们通过使用指令提示一个更大的模型，对医学指南中的摘录进行转换和调整，生成了 4 亿个词元。

数据可用性：由于数据集包含专有信息，我们不会公开发布这些数据集。关于合成数据集，正如我们在论文中所示，仅在该数据集上训练的模型表现非常差，未达到我们的标准。由于质量不佳，我们决定不发布该数据集。

训练过程

我们使用 Axolotl 在配备 4 块 NVIDIA A100 80GB GPU 的服务器上进行了总共 450 个 GPU 小时的训练。我们使用了 FlashAttention、NEFTune 和样本打包技术，并使用了以下参数：

参数	值
bf16	true
lr	6e - 6
eps	1e - 5
epochs	4
betas	[0.9, 0.95]
weight decay	0.1
批次大小	192000 个词元
序列长度	4096
学习率调度器	cosine
最小学习率	1e - 8
NEFT alpha	5
热身迭代次数	100

评估

测试数据与指标

测试数据

指标

准确率：我们使用 [lm - evaluation - harness](https://github.com/maximegmd/lm - evaluation - harness/tree/big - refactor/lm_eval) 进行了标准化的零样本基准测试。

结果

	Internist.ai 7b	PMC LLaMA 7b*	Mistral 7b	Meditron 7b**
MedQA	60.5	27.7 (44.7)	48.7	52.0
MedMCQA	55.8	32.2 (51.4)	45.7	59.2
PubMedQA	79.4	67.8 (74.6)	75.8	74.4
MMLU 专业医学	76.1	19.5	65.8	26.6
MMLU 临床知识	70.6	23.8	61.1	35.5
MMLU 解剖学	65.9	18.5	52.6	42.6
MMLU 大学医学	63.0	23.7	55.5	28.9
MMLU 医学遗传学	71.0	32.0	68.0	46.0

*: PMC LLaMA 7b 在基准测试中表现不佳，可能是由于格式不匹配和缺乏指令微调。括号内为作者报告的可用结果。

**: Meditron 7b 在 MMLU 中的结果是为了透明起见而报告的，但与他们论文中报告的 54.2 的平均值不一致。请随时告知每个类别的详细信息，以便我们更新表格。

📄 许可证

本模型使用 APACHE 2.0 LICENSE 许可证。

📚 引用

BibTeX: 如果您使用 Internist.ai 7b，请引用我们：

@article{10.1093/jamia/ocae120,
    author = {Griot, Maxime and Hemptinne, Coralie and Vanderdonckt, Jean and Yuksel, Demet},
    title = "{Impact of high-quality, mixed-domain data on the performance of medical language models}",
    journal = {Journal of the American Medical Informatics Association},
    volume = {31},
    number = {9},
    pages = {1875-1883},
    year = {2024},
    month = {05},
    abstract = "{To optimize the training strategy of large language models for medical applications, focusing on creating clinically relevant systems that efficiently integrate into healthcare settings, while ensuring high standards of accuracy and reliability.We curated a comprehensive collection of high-quality, domain-specific data and used it to train several models, each with different subsets of this data. These models were rigorously evaluated against standard medical benchmarks, such as the USMLE, to measure their performance. Furthermore, for a thorough effectiveness assessment, they were compared with other state-of-the-art medical models of comparable size.The models trained with a mix of high-quality, domain-specific, and general data showed superior performance over those trained on larger, less clinically relevant datasets (P \\&lt; .001). Our 7-billion-parameter model Med5 scores 60.5\\% on MedQA, outperforming the previous best of 49.3\\% from comparable models, and becomes the first of its size to achieve a passing score on the USMLE. Additionally, this model retained its proficiency in general domain tasks, comparable to state-of-the-art general domain models of similar size.Our findings underscore the importance of integrating high-quality, domain-specific data in training large language models for medical purposes. The balanced approach between specialized and general data significantly enhances the model’s clinical relevance and performance.This study sets a new standard in medical language models, proving that a strategically trained, smaller model can outperform larger ones in clinical relevance and general proficiency, highlighting the importance of data quality and expert curation in generative artificial intelligence for healthcare applications.}",
    issn = {1527-974X},
    doi = {10.1093/jamia/ocae120},
    url = {https://doi.org/10.1093/jamia/ocae120},
    eprint = {https://academic.oup.com/jamia/article-pdf/31/9/1875/58868289/ocae120.pdf},
}