EpiExtract4GARD-v2开源模型 - 免费精准识别罕见病摘要流行病学信息

首页

Epiextract4gard V2

由 ncats 开发

基于BioBERT微调的命名实体识别模型，专注于识别罕见疾病摘要中的流行病学信息。

序列标注

Transformers

英语开源协议:其他 #罕见病流行病学 #生物医学实体识别 #BioBERT微调

下载量 34

发布时间 : 3/2/2022

模型简介

该模型用于识别文本中的地点（LOC）、流行病学类型（EPI）和流行病学率（STAT），特别针对罕见疾病领域的流行病学数据提取。

模型特点

流行病学信息提取

专门针对罕见疾病领域的流行病学数据进行优化，能准确识别发病率、患病率等关键指标。

弱监督学习

采用弱监督教学方法训练，适应有限标注数据的场景。

多实体识别

能同时识别地点、流行病学类型和流行病学率三类实体。

模型能力

识别流行病学类型

提取流行病学率数据

定位相关地理位置

处理罕见疾病相关文本

使用案例

医学研究

罕见疾病流行病学研究

从医学文献中提取罕见疾病的发病率、患病率等数据

可自动识别如'每10万活产婴儿中4.05例'等流行病学数据

疾病监测

追踪特定地区特定疾病的发病情况

可识别如'冰岛已有27名患者被诊断出患有PKU'等病例信息

公共卫生

疾病负担评估

评估不同地区疾病的负担情况

可比较不同地区的发病率差异

🚀 EpiExtract4GARD - v2

EpiExtract4GARD - v2 是一个经过微调的模型，可用于对位置（LOC）、流行病学类型（EPI）和流行病学比率（STAT）进行命名实体识别。该模型能从罕见病摘要中提取流行病学信息，为相关研究和分析提供有力支持。

🚀 快速开始

你可以使用右侧的托管推理 API，结合这个测试句子：“自 1947 年以来，冰岛已有 27 名患者被诊断出患有苯丙酮尿症（PKU）。1972 - 2008 年的发病率为每 8400 例活产 1 例。”

以下是使用 Transformers pipeline 进行命名实体识别（NER）的代码示例：

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("ncats/EpiExtract4GARD")
tokenizer = AutoTokenizer.from_pretrained("ncats/EpiExtract4GARD")
NER_pipeline = pipeline('ner', model=model, tokenizer=tokenizer,aggregation_strategy='simple')

sample = "The live - birth prevalence of mucopolysaccharidoses in Estonia. Previous studies on the prevalence of mucopolysaccharidoses (MPS) in different populations have shown considerable variations. There are, however, few data with regard to the prevalence of MPSs in Fenno - Ugric populations or in north - eastern Europe, except for a report about Scandinavian countries. A retrospective epidemiological study of MPSs in Estonia was undertaken, and live - birth prevalence of MPS patients born between 1985 and 2006 was estimated. The live - birth prevalence for all MPS subtypes was found to be 4.05 per 100,000 live births, which is consistent with most other European studies. MPS II had the highest calculated incidence, with 2.16 per 100,000 live births (4.2 per 100,000 male live births), forming 53% of all diagnosed MPS cases, and was twice as high as in other studied European populations. The second most common subtype was MPS IIIA, with a live - birth prevalence of 1.62 in 100,000 live births. With 0.27 out of 100,000 live births, MPS VI had the third - highest live - birth prevalence. No cases of MPS I were diagnosed in Estonia, making the prevalence of MPS I in Estonia much lower than in other European populations. MPSs are the third most frequent inborn error of metabolism in Estonia after phenylketonuria and galactosemia."
sample2 = "Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn Screening: A 6 - Year Experience. Kuwait is a small Arabian Gulf country with a high rate of consanguinity and where a national newborn screening program was expanded in October 2014 to include a wide range of endocrine and metabolic disorders. A retrospective study conducted between January 2015 and December 2020 revealed a total of 304,086 newborns have been screened in Kuwait. Six newborns were diagnosed with classic homocystinuria with an incidence of 1:50,000, which is not as high as in Qatar but higher than the global incidence. Molecular testing for five of them has revealed three previously reported pathogenic variants in the <i>CBS</i> gene, c.969G>A, p.(Trp323Ter); c.982G>A, p.(Asp328Asn); and the Qatari founder variant c.1006C>T, p.(Arg336Cys). This is the first study to review the screening of newborns in Kuwait for classic homocystinuria, starting with the detection of elevated blood methionine and providing a follow - up strategy for positive results, including plasma total homocysteine and amino acid analyses. Further, we have demonstrated an increase in the specificity of the current newborn screening test for classic homocystinuria by including the methionine to phenylalanine ratio along with the elevated methionine blood levels in first - tier testing. Here, we provide evidence that the newborn screening in Kuwait has led to the early detection of classic homocystinuria cases and enabled the affected individuals to lead active and productive lives."
#Sample 1 is from: Krabbi K, Joost K, Zordania R, Talvik I, Rein R, Huijmans JG, Verheijen FV, Õunap K. The live - birth prevalence of mucopolysaccharidoses in Estonia. Genet Test Mol Biomarkers. 2012 Aug;16(8):846 - 9. doi: 10.1089/gtmb.2011.0307. Epub 2012 Apr 5. PMID: 22480138; PMCID: PMC3422553.
#Sample 2 is from: Alsharhan H, Ahmed AA, Ali NM, Alahmad A, Albash B, Elshafie RM, Alkanderi S, Elkazzaz UM, Cyril PX, Abdelrahman RM, Elmonairy AA, Ibrahim SM, Elfeky YME, Sadik DI, Al - Enezi SD, Salloum AM, Girish Y, Al - Ali M, Ramadan DG, Alsafi R, Al - Rushood M, Bastaki L. Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn Screening: A 6 - Year Experience. Int J Neonatal Screen. 2021 Aug 17;7(3):56. doi: 10.3390/ijns7030056. PMID: 34449519; PMCID: PMC8395821.

NER_pipeline(sample)
NER_pipeline(sample2)

或者，如果你从 GitHub 下载了 classify_abs.py、extract_abs.py 和 [gard - id - name - synonyms.json](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/gard - id - name - synonyms.json)，那么你可以使用这个附加代码进行测试：

import pandas as pd
import extract_abs
import classify_abs
pd.set_option('display.max_colwidth', None)

NER_pipeline = extract_abs.init_NER_pipeline()
GARD_dict, max_length = extract_abs.load_GARD_diseases()
nlp, nlpSci, nlpSci2, classify_model, classify_tokenizer = classify_abs.init_classify_model()


def search(term,num_results = 50):
    return extract_abs.search_term_extraction(term, num_results, NER_pipeline, GARD_dict, max_length,nlp, nlpSci, nlpSci2, classify_model, classify_tokenizer)
    
a = search(7058)
a

b = search('Santos Mateus Leal syndrome')
b

c = search('Fellman syndrome')
c

d = search('GARD:0009941')
d

e = search('Homocystinuria')
e

✨ 主要特性

精细微调：EpiExtract4GARD - v2 是基于 [BioBERT - base - cased](https://huggingface.co/dmis - lab/biobert - base - cased - v1.1) 模型进行微调的，能够准确识别位置（LOC）、流行病学类型（EPI）和流行病学比率（STAT）等命名实体。
特定领域适用：该模型在 EpiSet4NER - v2 数据集上进行微调，专注于从罕见病摘要中提取流行病学信息。

📦 安装指南

文档中未提及具体安装步骤，暂不提供相关内容。

📚 详细文档

模型描述

EpiExtract4GARD - v2 是一个经过微调的 [BioBERT - base - cased](https://huggingface.co/dmis - lab/biobert - base - cased - v1.1) 模型，可用于对位置（LOC）、流行病学类型（EPI）和流行病学比率（STAT）进行命名实体识别。此模型在 EpiSet4NER - v2 数据集上进行微调，用于从罕见病摘要中提取流行病学信息。有关弱监督教学方法以及数据集偏差和局限性的详细信息，请参阅数据集文档。有关整个流程的详细信息，请参阅 GitHub 上的 EpiExtract4GARD。

训练数据

该模型在 EpiSet4NER 数据集上进行训练。有关弱监督教学方法以及数据集偏差和局限性的详细信息，请参阅数据集文档。训练数据集会区分实体的起始和延续，以便当存在连续的同类型实体时，模型能够输出第二个实体的起始位置。与数据集中一样，每个标记将被分类为以下类别之一：

缩写	描述
O	命名实体之外
B - LOC	位置的开始
I - LOC	位置内部
B - EPI	流行病学类型的开始（例如“发病率”、“患病率”、“发生率”）
I - EPI	非起始标记的流行病学类型
B - STAT	流行病学比率的开始
I - STAT	流行病学比率内部
+More	描述待定