Epiextract4gard V2
模型简介
模型特点
模型能力
使用案例
🚀 EpiExtract4GARD - v2
EpiExtract4GARD - v2 是一个经过微调的模型,可用于对位置(LOC)、流行病学类型(EPI)和流行病学比率(STAT)进行命名实体识别。该模型能从罕见病摘要中提取流行病学信息,为相关研究和分析提供有力支持。
🚀 快速开始
你可以使用右侧的托管推理 API,结合这个测试句子:“自 1947 年以来,冰岛已有 27 名患者被诊断出患有苯丙酮尿症(PKU)。1972 - 2008 年的发病率为每 8400 例活产 1 例。”
以下是使用 Transformers pipeline 进行命名实体识别(NER)的代码示例:
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("ncats/EpiExtract4GARD")
tokenizer = AutoTokenizer.from_pretrained("ncats/EpiExtract4GARD")
NER_pipeline = pipeline('ner', model=model, tokenizer=tokenizer,aggregation_strategy='simple')
sample = "The live - birth prevalence of mucopolysaccharidoses in Estonia. Previous studies on the prevalence of mucopolysaccharidoses (MPS) in different populations have shown considerable variations. There are, however, few data with regard to the prevalence of MPSs in Fenno - Ugric populations or in north - eastern Europe, except for a report about Scandinavian countries. A retrospective epidemiological study of MPSs in Estonia was undertaken, and live - birth prevalence of MPS patients born between 1985 and 2006 was estimated. The live - birth prevalence for all MPS subtypes was found to be 4.05 per 100,000 live births, which is consistent with most other European studies. MPS II had the highest calculated incidence, with 2.16 per 100,000 live births (4.2 per 100,000 male live births), forming 53% of all diagnosed MPS cases, and was twice as high as in other studied European populations. The second most common subtype was MPS IIIA, with a live - birth prevalence of 1.62 in 100,000 live births. With 0.27 out of 100,000 live births, MPS VI had the third - highest live - birth prevalence. No cases of MPS I were diagnosed in Estonia, making the prevalence of MPS I in Estonia much lower than in other European populations. MPSs are the third most frequent inborn error of metabolism in Estonia after phenylketonuria and galactosemia."
sample2 = "Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn Screening: A 6 - Year Experience. Kuwait is a small Arabian Gulf country with a high rate of consanguinity and where a national newborn screening program was expanded in October 2014 to include a wide range of endocrine and metabolic disorders. A retrospective study conducted between January 2015 and December 2020 revealed a total of 304,086 newborns have been screened in Kuwait. Six newborns were diagnosed with classic homocystinuria with an incidence of 1:50,000, which is not as high as in Qatar but higher than the global incidence. Molecular testing for five of them has revealed three previously reported pathogenic variants in the <i>CBS</i> gene, c.969G>A, p.(Trp323Ter); c.982G>A, p.(Asp328Asn); and the Qatari founder variant c.1006C>T, p.(Arg336Cys). This is the first study to review the screening of newborns in Kuwait for classic homocystinuria, starting with the detection of elevated blood methionine and providing a follow - up strategy for positive results, including plasma total homocysteine and amino acid analyses. Further, we have demonstrated an increase in the specificity of the current newborn screening test for classic homocystinuria by including the methionine to phenylalanine ratio along with the elevated methionine blood levels in first - tier testing. Here, we provide evidence that the newborn screening in Kuwait has led to the early detection of classic homocystinuria cases and enabled the affected individuals to lead active and productive lives."
#Sample 1 is from: Krabbi K, Joost K, Zordania R, Talvik I, Rein R, Huijmans JG, Verheijen FV, Õunap K. The live - birth prevalence of mucopolysaccharidoses in Estonia. Genet Test Mol Biomarkers. 2012 Aug;16(8):846 - 9. doi: 10.1089/gtmb.2011.0307. Epub 2012 Apr 5. PMID: 22480138; PMCID: PMC3422553.
#Sample 2 is from: Alsharhan H, Ahmed AA, Ali NM, Alahmad A, Albash B, Elshafie RM, Alkanderi S, Elkazzaz UM, Cyril PX, Abdelrahman RM, Elmonairy AA, Ibrahim SM, Elfeky YME, Sadik DI, Al - Enezi SD, Salloum AM, Girish Y, Al - Ali M, Ramadan DG, Alsafi R, Al - Rushood M, Bastaki L. Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn Screening: A 6 - Year Experience. Int J Neonatal Screen. 2021 Aug 17;7(3):56. doi: 10.3390/ijns7030056. PMID: 34449519; PMCID: PMC8395821.
NER_pipeline(sample)
NER_pipeline(sample2)
或者,如果你从 GitHub 下载了 classify_abs.py、extract_abs.py 和 [gard - id - name - synonyms.json](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/gard - id - name - synonyms.json),那么你可以使用这个附加代码进行测试:
import pandas as pd
import extract_abs
import classify_abs
pd.set_option('display.max_colwidth', None)
NER_pipeline = extract_abs.init_NER_pipeline()
GARD_dict, max_length = extract_abs.load_GARD_diseases()
nlp, nlpSci, nlpSci2, classify_model, classify_tokenizer = classify_abs.init_classify_model()
def search(term,num_results = 50):
return extract_abs.search_term_extraction(term, num_results, NER_pipeline, GARD_dict, max_length,nlp, nlpSci, nlpSci2, classify_model, classify_tokenizer)
a = search(7058)
a
b = search('Santos Mateus Leal syndrome')
b
c = search('Fellman syndrome')
c
d = search('GARD:0009941')
d
e = search('Homocystinuria')
e
✨ 主要特性
- 精细微调:EpiExtract4GARD - v2 是基于 [BioBERT - base - cased](https://huggingface.co/dmis - lab/biobert - base - cased - v1.1) 模型进行微调的,能够准确识别位置(LOC)、流行病学类型(EPI)和流行病学比率(STAT)等命名实体。
- 特定领域适用:该模型在 EpiSet4NER - v2 数据集上进行微调,专注于从罕见病摘要中提取流行病学信息。
📦 安装指南
文档中未提及具体安装步骤,暂不提供相关内容。
📚 详细文档
模型描述
EpiExtract4GARD - v2 是一个经过微调的 [BioBERT - base - cased](https://huggingface.co/dmis - lab/biobert - base - cased - v1.1) 模型,可用于对位置(LOC)、流行病学类型(EPI)和流行病学比率(STAT)进行命名实体识别。此模型在 EpiSet4NER - v2 数据集上进行微调,用于从罕见病摘要中提取流行病学信息。有关弱监督教学方法以及数据集偏差和局限性的详细信息,请参阅数据集文档。有关整个流程的详细信息,请参阅 GitHub 上的 EpiExtract4GARD。
训练数据
该模型在 EpiSet4NER 数据集上进行训练。有关弱监督教学方法以及数据集偏差和局限性的详细信息,请参阅数据集文档。训练数据集会区分实体的起始和延续,以便当存在连续的同类型实体时,模型能够输出第二个实体的起始位置。与数据集中一样,每个标记将被分类为以下类别之一:
缩写 | 描述 |
---|---|
O | 命名实体之外 |
B - LOC | 位置的开始 |
I - LOC | 位置内部 |
B - EPI | 流行病学类型的开始(例如“发病率”、“患病率”、“发生率”) |
I - EPI | 非起始标记的流行病学类型 |
B - STAT | 流行病学比率的开始 |
I - STAT | 流行病学比率内部 |
+More | 描述待定 |
EpiSet 统计信息
除了 EpiSet4NER 数据集带来的任何限制之外,由于基于 BERT 的模型使用子词嵌入,该模型在数值处理方面存在局限性,这对于流行病学比率的识别至关重要,并限制了实体级别的结果。可以使用最新的数值处理技术来提高模型的性能,而无需改进底层数据集。
训练过程
该模型在 [AWS EC2 p3.2xlarge](https://aws.amazon.com/ec2/instance - types/) 上进行训练,使用了单个 Tesla V100 GPU,并采用了以下超参数:
- 训练 4 个周期(AdamW 权重衰减 = 0.05),批量大小为 16。
- 最大序列长度 = 192。
- 模型一次输入一个句子。
🔧 技术细节
模型局限性
- 数值处理限制:由于基于 BERT 的模型使用子词嵌入,该模型在数值处理方面存在局限性,这对于流行病学比率的识别至关重要,并限制了实体级别的结果。
- 数据集相关限制:模型的性能受到 EpiSet4NER 数据集的限制,包括弱监督教学方法和数据集偏差等。
改进建议
可以使用最新的数值处理技术来提高模型的性能,而无需改进底层数据集。
📄 许可证
本项目采用其他许可证。具体许可证信息请参考相关文档。








