biobertpt - all開源葡萄牙語生物醫學模型 - 基於臨床與文獻訓練助力醫學應用

首頁

Biobertpt All

由pucpr開發

基於BERT的葡萄牙語臨床與生物醫學模型，在臨床記錄和生物醫學文獻上訓練

大型語言模型其他#葡萄牙語臨床NER #生物醫學文本處理 #遷移學習優化

下載量 1,460

發布時間 : 3/2/2022

模型概述

BioBERTpt是基於BERT的葡萄牙語模型，專門針對臨床和生物醫學領域優化，支持命名實體識別等自然語言處理任務。

模型特點

領域特定優化

在臨床敘述和生物醫學文獻上進行訓練，針對醫療領域優化

葡萄牙語支持

專門針對葡萄牙語（特別是巴西葡萄牙語）臨床文本優化

遷移學習能力

通過遷移學習減少對標註數據的需求

模型能力

臨床文本理解

生物醫學實體識別

葡萄牙語自然語言處理

使用案例

醫療健康

電子病歷分析

從非結構化臨床文本中提取有價值信息

在13個評估實體中有11個表現優於基線模型

生物醫學文獻處理

處理葡萄牙語生物醫學科學論文

🚀 BioBERTpt - 葡萄牙語臨床與生物醫學BERT

BioBERTpt是一個用於葡萄牙語臨床命名實體識別的神經語言模型。它基於BERT架構，使用多語言BERT模型初始化，並在臨床筆記和生物醫學文獻上進行訓練，能夠有效處理葡萄牙語的臨床和生物醫學文本。

🚀 快速開始

模型加載

你可以通過transformers庫加載BioBERTpt模型：

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("pucpr/biobertpt-all")
model = AutoModel.from_pretrained("pucpr/biobertpt-all")

📚 詳細文檔

模型介紹

論文 BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition 中包含了基於BERT的葡萄牙語臨床和生物醫學模型。該模型使用多語言BERT模型（BERT-Multilingual-Cased）初始化，並在臨床筆記和生物醫學文獻上進行訓練。

本模型卡片描述的是BioBERTpt(all)模型，它是一個完整版本，包含葡萄牙語的臨床敘述和生物醫學文獻。

🙏 致謝

本研究部分由巴西高等教育人員素質提升協調局（Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil，CAPES）資助，資助代碼為001。

📄 許可證

引用信息

如果你使用了BioBERTpt模型，請按照以下格式進行引用：

@inproceedings{schneider-etal-2020-biobertpt,
    title = "{B}io{BERT}pt - A {P}ortuguese Neural Language Model for Clinical Named Entity Recognition",
    author = "Schneider, Elisa Terumi Rubel  and
      de Souza, Jo{\~a}o Vitor Andrioli  and
      Knafou, Julien  and
      Oliveira, Lucas Emanuel Silva e  and
      Copara, Jenny  and
      Gumiel, Yohan Bonescki  and
      Oliveira, Lucas Ferro Antunes de  and
      Paraiso, Emerson Cabrera  and
      Teodoro, Douglas  and
      Barra, Cl{\'a}udia Maria Cabral Moro",
    booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.7",
    pages = "65--72",
    abstract = "With the growing number of electronic health record data, clinical NLP tasks have become increasingly relevant to unlock valuable information from unstructured clinical text. Although the performance of downstream NLP tasks, such as named-entity recognition (NER), in English corpus has recently improved by contextualised language models, less research is available for clinical texts in low resource languages. Our goal is to assess a deep contextual embedding model for Portuguese, so called BioBERTpt, to support clinical and biomedical NER. We transfer learned information encoded in a multilingual-BERT model to a corpora of clinical narratives and biomedical-scientific papers in Brazilian Portuguese. To evaluate the performance of BioBERTpt, we ran NER experiments on two annotated corpora containing clinical narratives and compared the results with existing BERT models. Our in-domain model outperformed the baseline model in F1-score by 2.72{\%}, achieving higher performance in 11 out of 13 assessed entities. We demonstrate that enriching contextual embedding models with domain literature can play an important role in improving performance for specific NLP tasks. The transfer learning process enhanced the Portuguese biomedical NER model by reducing the necessity of labeled data and the demand for retraining a whole new model.",
}