EpiExtract4GARD-v2開源模型 - 免費精準識別罕見病摘要流行病學信息

首頁

Epiextract4gard V2

由ncats開發

基於BioBERT微調的命名實體識別模型，專注於識別罕見疾病摘要中的流行病學信息。

序列標註

Transformers

英語開源協議:其他 #罕見病流行病學 #生物醫學實體識別 #BioBERT微調

下載量 34

發布時間 : 3/2/2022

模型概述

該模型用於識別文本中的地點（LOC）、流行病學類型（EPI）和流行病學率（STAT），特別針對罕見疾病領域的流行病學數據提取。

模型特點

流行病學信息提取

專門針對罕見疾病領域的流行病學數據進行優化，能準確識別發病率、患病率等關鍵指標。

弱監督學習

採用弱監督教學方法訓練，適應有限標註數據的場景。

多實體識別

能同時識別地點、流行病學類型和流行病學率三類實體。

模型能力

識別流行病學類型

提取流行病學率數據

定位相關地理位置

處理罕見疾病相關文本

使用案例

醫學研究

罕見疾病流行病學研究

從醫學文獻中提取罕見疾病的發病率、患病率等數據

可自動識別如'每10萬活產嬰兒中4.05例'等流行病學數據

疾病監測

追蹤特定地區特定疾病的發病情況

可識別如'冰島已有27名患者被診斷出患有PKU'等病例信息

公共衛生

疾病負擔評估

評估不同地區疾病的負擔情況

可比較不同地區的發病率差異

🚀 EpiExtract4GARD - v2

EpiExtract4GARD - v2 是一個經過微調的模型，可用於對位置（LOC）、流行病學類型（EPI）和流行病學比率（STAT）進行命名實體識別。該模型能從罕見病摘要中提取流行病學信息，為相關研究和分析提供有力支持。

🚀 快速開始

你可以使用右側的託管推理 API，結合這個測試句子：“自 1947 年以來，冰島已有 27 名患者被診斷出患有苯丙酮尿症（PKU）。1972 - 2008 年的發病率為每 8400 例活產 1 例。”

以下是使用 Transformers pipeline 進行命名實體識別（NER）的代碼示例：

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("ncats/EpiExtract4GARD")
tokenizer = AutoTokenizer.from_pretrained("ncats/EpiExtract4GARD")
NER_pipeline = pipeline('ner', model=model, tokenizer=tokenizer,aggregation_strategy='simple')

sample = "The live - birth prevalence of mucopolysaccharidoses in Estonia. Previous studies on the prevalence of mucopolysaccharidoses (MPS) in different populations have shown considerable variations. There are, however, few data with regard to the prevalence of MPSs in Fenno - Ugric populations or in north - eastern Europe, except for a report about Scandinavian countries. A retrospective epidemiological study of MPSs in Estonia was undertaken, and live - birth prevalence of MPS patients born between 1985 and 2006 was estimated. The live - birth prevalence for all MPS subtypes was found to be 4.05 per 100,000 live births, which is consistent with most other European studies. MPS II had the highest calculated incidence, with 2.16 per 100,000 live births (4.2 per 100,000 male live births), forming 53% of all diagnosed MPS cases, and was twice as high as in other studied European populations. The second most common subtype was MPS IIIA, with a live - birth prevalence of 1.62 in 100,000 live births. With 0.27 out of 100,000 live births, MPS VI had the third - highest live - birth prevalence. No cases of MPS I were diagnosed in Estonia, making the prevalence of MPS I in Estonia much lower than in other European populations. MPSs are the third most frequent inborn error of metabolism in Estonia after phenylketonuria and galactosemia."
sample2 = "Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn Screening: A 6 - Year Experience. Kuwait is a small Arabian Gulf country with a high rate of consanguinity and where a national newborn screening program was expanded in October 2014 to include a wide range of endocrine and metabolic disorders. A retrospective study conducted between January 2015 and December 2020 revealed a total of 304,086 newborns have been screened in Kuwait. Six newborns were diagnosed with classic homocystinuria with an incidence of 1:50,000, which is not as high as in Qatar but higher than the global incidence. Molecular testing for five of them has revealed three previously reported pathogenic variants in the <i>CBS</i> gene, c.969G>A, p.(Trp323Ter); c.982G>A, p.(Asp328Asn); and the Qatari founder variant c.1006C>T, p.(Arg336Cys). This is the first study to review the screening of newborns in Kuwait for classic homocystinuria, starting with the detection of elevated blood methionine and providing a follow - up strategy for positive results, including plasma total homocysteine and amino acid analyses. Further, we have demonstrated an increase in the specificity of the current newborn screening test for classic homocystinuria by including the methionine to phenylalanine ratio along with the elevated methionine blood levels in first - tier testing. Here, we provide evidence that the newborn screening in Kuwait has led to the early detection of classic homocystinuria cases and enabled the affected individuals to lead active and productive lives."
#Sample 1 is from: Krabbi K, Joost K, Zordania R, Talvik I, Rein R, Huijmans JG, Verheijen FV, Õunap K. The live - birth prevalence of mucopolysaccharidoses in Estonia. Genet Test Mol Biomarkers. 2012 Aug;16(8):846 - 9. doi: 10.1089/gtmb.2011.0307. Epub 2012 Apr 5. PMID: 22480138; PMCID: PMC3422553.
#Sample 2 is from: Alsharhan H, Ahmed AA, Ali NM, Alahmad A, Albash B, Elshafie RM, Alkanderi S, Elkazzaz UM, Cyril PX, Abdelrahman RM, Elmonairy AA, Ibrahim SM, Elfeky YME, Sadik DI, Al - Enezi SD, Salloum AM, Girish Y, Al - Ali M, Ramadan DG, Alsafi R, Al - Rushood M, Bastaki L. Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn Screening: A 6 - Year Experience. Int J Neonatal Screen. 2021 Aug 17;7(3):56. doi: 10.3390/ijns7030056. PMID: 34449519; PMCID: PMC8395821.

NER_pipeline(sample)
NER_pipeline(sample2)

或者，如果你從 GitHub 下載了 classify_abs.py、extract_abs.py 和 [gard - id - name - synonyms.json](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/gard - id - name - synonyms.json)，那麼你可以使用這個附加代碼進行測試：

import pandas as pd
import extract_abs
import classify_abs
pd.set_option('display.max_colwidth', None)

NER_pipeline = extract_abs.init_NER_pipeline()
GARD_dict, max_length = extract_abs.load_GARD_diseases()
nlp, nlpSci, nlpSci2, classify_model, classify_tokenizer = classify_abs.init_classify_model()


def search(term,num_results = 50):
    return extract_abs.search_term_extraction(term, num_results, NER_pipeline, GARD_dict, max_length,nlp, nlpSci, nlpSci2, classify_model, classify_tokenizer)
    
a = search(7058)
a

b = search('Santos Mateus Leal syndrome')
b

c = search('Fellman syndrome')
c

d = search('GARD:0009941')
d

e = search('Homocystinuria')
e

✨ 主要特性

精細微調：EpiExtract4GARD - v2 是基於 [BioBERT - base - cased](https://huggingface.co/dmis - lab/biobert - base - cased - v1.1) 模型進行微調的，能夠準確識別位置（LOC）、流行病學類型（EPI）和流行病學比率（STAT）等命名實體。
特定領域適用：該模型在 EpiSet4NER - v2 數據集上進行微調，專注於從罕見病摘要中提取流行病學信息。

📦 安裝指南

文檔中未提及具體安裝步驟，暫不提供相關內容。

📚 詳細文檔

模型描述

EpiExtract4GARD - v2 是一個經過微調的 [BioBERT - base - cased](https://huggingface.co/dmis - lab/biobert - base - cased - v1.1) 模型，可用於對位置（LOC）、流行病學類型（EPI）和流行病學比率（STAT）進行命名實體識別。此模型在 EpiSet4NER - v2 數據集上進行微調，用於從罕見病摘要中提取流行病學信息。有關弱監督教學方法以及數據集偏差和侷限性的詳細信息，請參閱數據集文檔。有關整個流程的詳細信息，請參閱 GitHub 上的 EpiExtract4GARD。

訓練數據

該模型在 EpiSet4NER 數據集上進行訓練。有關弱監督教學方法以及數據集偏差和侷限性的詳細信息，請參閱數據集文檔。訓練數據集會區分實體的起始和延續，以便當存在連續的同類型實體時，模型能夠輸出第二個實體的起始位置。與數據集中一樣，每個標記將被分類為以下類別之一：

縮寫	描述
O	命名實體之外
B - LOC	位置的開始
I - LOC	位置內部
B - EPI	流行病學類型的開始（例如“發病率”、“患病率”、“發生率”）
I - EPI	非起始標記的流行病學類型
B - STAT	流行病學比率的開始
I - STAT	流行病學比率內部
+More	描述待定