biobertpt-clin開源模型 - 葡萄牙語臨床與生物醫學命名實體精準識別

首頁

Biobertpt Clin

由pucpr開發

BioBERTpt是基於BERT架構的葡萄牙語臨床與生物醫學模型，專門針對臨床命名實體識別任務優化。

大型語言模型其他#葡萄牙語臨床NER #生物醫學文本處理 #電子病歷分析

下載量 109

發布時間 : 3/2/2022

模型概述

該模型是基於BERT架構的葡萄牙語臨床與生物醫學模型，通過臨床記錄和生物醫學文獻進行訓練，特別適用於臨床命名實體識別(NER)任務。

模型特點

領域專用訓練

模型在巴西醫院電子健康檔案中的臨床敘述文本上進行訓練，針對臨床領域優化。

葡萄牙語支持

專門針對葡萄牙語臨床和生物醫學文本優化的語言模型。

遷移學習

基於多語言BERT模型，通過遷移學習適應臨床和生物醫學領域。

模型能力

臨床文本理解

生物醫學實體識別

葡萄牙語自然語言處理

使用案例

醫療健康

電子健康記錄分析

從醫院電子健康檔案中提取關鍵臨床信息

在13個評估實體中有11個表現優於基線模型

臨床研究

處理和分析生物醫學研究文獻

🚀 BioBERTpt - 葡萄牙語臨床與生物醫學BERT

BioBERTpt是基於BERT的葡萄牙語臨床與生物醫學模型。它以BERT多語言大小寫敏感模型（BERT - Multilingual - Cased）為基礎進行初始化，並在臨床筆記和生物醫學文獻上進行訓練。本項目聚焦於BioBERTpt(clin)模型，這是BioBERTpt的臨床版本，在巴西醫院電子健康記錄中的臨床敘述數據上進行了訓練。

🚀 快速開始

如何使用該模型

可以通過transformers庫來加載模型：

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("pucpr/biobertpt-clin")
model = AutoModel.from_pretrained("pucpr/biobertpt-clin")

📚 詳細文檔

更多詳細信息和該模型在葡萄牙語命名實體識別（NER）任務中的性能表現，請參考原始論文 BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition。

👏 致謝

本研究部分由巴西高等教育人員素質提升協調局（Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil，簡稱CAPES）資助，資助代碼為001。

📄 許可證

引用

如果您使用了該模型，請按照以下格式進行引用：

@inproceedings{schneider-etal-2020-biobertpt,
    title = "{B}io{BERT}pt - A {P}ortuguese Neural Language Model for Clinical Named Entity Recognition",
    author = "Schneider, Elisa Terumi Rubel  and
      de Souza, Jo{\~a}o Vitor Andrioli  and
      Knafou, Julien  and
      Oliveira, Lucas Emanuel Silva e  and
      Copara, Jenny  and
      Gumiel, Yohan Bonescki  and
      Oliveira, Lucas Ferro Antunes de  and
      Paraiso, Emerson Cabrera  and
      Teodoro, Douglas  and
      Barra, Cl{\'a}udia Maria Cabral Moro",
    booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.7",
    pages = "65--72",
    abstract = "With the growing number of electronic health record data, clinical NLP tasks have become increasingly relevant to unlock valuable information from unstructured clinical text. Although the performance of downstream NLP tasks, such as named-entity recognition (NER), in English corpus has recently improved by contextualised language models, less research is available for clinical texts in low resource languages. Our goal is to assess a deep contextual embedding model for Portuguese, so called BioBERTpt, to support clinical and biomedical NER. We transfer learned information encoded in a multilingual-BERT model to a corpora of clinical narratives and biomedical-scientific papers in Brazilian Portuguese. To evaluate the performance of BioBERTpt, we ran NER experiments on two annotated corpora containing clinical narratives and compared the results with existing BERT models. Our in-domain model outperformed the baseline model in F1-score by 2.72{\%}, achieving higher performance in 11 out of 13 assessed entities. We demonstrate that enriching contextual embedding models with domain literature can play an important role in improving performance for specific NLP tasks. The transfer learning process enhanced the Portuguese biomedical NER model by reducing the necessity of labeled data and the demand for retraining a whole new model.",
}