Epiextract4gard V2
模型概述
模型特點
模型能力
使用案例
🚀 EpiExtract4GARD - v2
EpiExtract4GARD - v2 是一個經過微調的模型,可用於對位置(LOC)、流行病學類型(EPI)和流行病學比率(STAT)進行命名實體識別。該模型能從罕見病摘要中提取流行病學信息,為相關研究和分析提供有力支持。
🚀 快速開始
你可以使用右側的託管推理 API,結合這個測試句子:“自 1947 年以來,冰島已有 27 名患者被診斷出患有苯丙酮尿症(PKU)。1972 - 2008 年的發病率為每 8400 例活產 1 例。”
以下是使用 Transformers pipeline 進行命名實體識別(NER)的代碼示例:
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("ncats/EpiExtract4GARD")
tokenizer = AutoTokenizer.from_pretrained("ncats/EpiExtract4GARD")
NER_pipeline = pipeline('ner', model=model, tokenizer=tokenizer,aggregation_strategy='simple')
sample = "The live - birth prevalence of mucopolysaccharidoses in Estonia. Previous studies on the prevalence of mucopolysaccharidoses (MPS) in different populations have shown considerable variations. There are, however, few data with regard to the prevalence of MPSs in Fenno - Ugric populations or in north - eastern Europe, except for a report about Scandinavian countries. A retrospective epidemiological study of MPSs in Estonia was undertaken, and live - birth prevalence of MPS patients born between 1985 and 2006 was estimated. The live - birth prevalence for all MPS subtypes was found to be 4.05 per 100,000 live births, which is consistent with most other European studies. MPS II had the highest calculated incidence, with 2.16 per 100,000 live births (4.2 per 100,000 male live births), forming 53% of all diagnosed MPS cases, and was twice as high as in other studied European populations. The second most common subtype was MPS IIIA, with a live - birth prevalence of 1.62 in 100,000 live births. With 0.27 out of 100,000 live births, MPS VI had the third - highest live - birth prevalence. No cases of MPS I were diagnosed in Estonia, making the prevalence of MPS I in Estonia much lower than in other European populations. MPSs are the third most frequent inborn error of metabolism in Estonia after phenylketonuria and galactosemia."
sample2 = "Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn Screening: A 6 - Year Experience. Kuwait is a small Arabian Gulf country with a high rate of consanguinity and where a national newborn screening program was expanded in October 2014 to include a wide range of endocrine and metabolic disorders. A retrospective study conducted between January 2015 and December 2020 revealed a total of 304,086 newborns have been screened in Kuwait. Six newborns were diagnosed with classic homocystinuria with an incidence of 1:50,000, which is not as high as in Qatar but higher than the global incidence. Molecular testing for five of them has revealed three previously reported pathogenic variants in the <i>CBS</i> gene, c.969G>A, p.(Trp323Ter); c.982G>A, p.(Asp328Asn); and the Qatari founder variant c.1006C>T, p.(Arg336Cys). This is the first study to review the screening of newborns in Kuwait for classic homocystinuria, starting with the detection of elevated blood methionine and providing a follow - up strategy for positive results, including plasma total homocysteine and amino acid analyses. Further, we have demonstrated an increase in the specificity of the current newborn screening test for classic homocystinuria by including the methionine to phenylalanine ratio along with the elevated methionine blood levels in first - tier testing. Here, we provide evidence that the newborn screening in Kuwait has led to the early detection of classic homocystinuria cases and enabled the affected individuals to lead active and productive lives."
#Sample 1 is from: Krabbi K, Joost K, Zordania R, Talvik I, Rein R, Huijmans JG, Verheijen FV, Õunap K. The live - birth prevalence of mucopolysaccharidoses in Estonia. Genet Test Mol Biomarkers. 2012 Aug;16(8):846 - 9. doi: 10.1089/gtmb.2011.0307. Epub 2012 Apr 5. PMID: 22480138; PMCID: PMC3422553.
#Sample 2 is from: Alsharhan H, Ahmed AA, Ali NM, Alahmad A, Albash B, Elshafie RM, Alkanderi S, Elkazzaz UM, Cyril PX, Abdelrahman RM, Elmonairy AA, Ibrahim SM, Elfeky YME, Sadik DI, Al - Enezi SD, Salloum AM, Girish Y, Al - Ali M, Ramadan DG, Alsafi R, Al - Rushood M, Bastaki L. Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn Screening: A 6 - Year Experience. Int J Neonatal Screen. 2021 Aug 17;7(3):56. doi: 10.3390/ijns7030056. PMID: 34449519; PMCID: PMC8395821.
NER_pipeline(sample)
NER_pipeline(sample2)
或者,如果你從 GitHub 下載了 classify_abs.py、extract_abs.py 和 [gard - id - name - synonyms.json](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/gard - id - name - synonyms.json),那麼你可以使用這個附加代碼進行測試:
import pandas as pd
import extract_abs
import classify_abs
pd.set_option('display.max_colwidth', None)
NER_pipeline = extract_abs.init_NER_pipeline()
GARD_dict, max_length = extract_abs.load_GARD_diseases()
nlp, nlpSci, nlpSci2, classify_model, classify_tokenizer = classify_abs.init_classify_model()
def search(term,num_results = 50):
return extract_abs.search_term_extraction(term, num_results, NER_pipeline, GARD_dict, max_length,nlp, nlpSci, nlpSci2, classify_model, classify_tokenizer)
a = search(7058)
a
b = search('Santos Mateus Leal syndrome')
b
c = search('Fellman syndrome')
c
d = search('GARD:0009941')
d
e = search('Homocystinuria')
e
✨ 主要特性
- 精細微調:EpiExtract4GARD - v2 是基於 [BioBERT - base - cased](https://huggingface.co/dmis - lab/biobert - base - cased - v1.1) 模型進行微調的,能夠準確識別位置(LOC)、流行病學類型(EPI)和流行病學比率(STAT)等命名實體。
- 特定領域適用:該模型在 EpiSet4NER - v2 數據集上進行微調,專注於從罕見病摘要中提取流行病學信息。
📦 安裝指南
文檔中未提及具體安裝步驟,暫不提供相關內容。
📚 詳細文檔
模型描述
EpiExtract4GARD - v2 是一個經過微調的 [BioBERT - base - cased](https://huggingface.co/dmis - lab/biobert - base - cased - v1.1) 模型,可用於對位置(LOC)、流行病學類型(EPI)和流行病學比率(STAT)進行命名實體識別。此模型在 EpiSet4NER - v2 數據集上進行微調,用於從罕見病摘要中提取流行病學信息。有關弱監督教學方法以及數據集偏差和侷限性的詳細信息,請參閱數據集文檔。有關整個流程的詳細信息,請參閱 GitHub 上的 EpiExtract4GARD。
訓練數據
該模型在 EpiSet4NER 數據集上進行訓練。有關弱監督教學方法以及數據集偏差和侷限性的詳細信息,請參閱數據集文檔。訓練數據集會區分實體的起始和延續,以便當存在連續的同類型實體時,模型能夠輸出第二個實體的起始位置。與數據集中一樣,每個標記將被分類為以下類別之一:
縮寫 | 描述 |
---|---|
O | 命名實體之外 |
B - LOC | 位置的開始 |
I - LOC | 位置內部 |
B - EPI | 流行病學類型的開始(例如“發病率”、“患病率”、“發生率”) |
I - EPI | 非起始標記的流行病學類型 |
B - STAT | 流行病學比率的開始 |
I - STAT | 流行病學比率內部 |
+More | 描述待定 |
EpiSet 統計信息
除了 EpiSet4NER 數據集帶來的任何限制之外,由於基於 BERT 的模型使用子詞嵌入,該模型在數值處理方面存在侷限性,這對於流行病學比率的識別至關重要,並限制了實體級別的結果。可以使用最新的數值處理技術來提高模型的性能,而無需改進底層數據集。
訓練過程
該模型在 [AWS EC2 p3.2xlarge](https://aws.amazon.com/ec2/instance - types/) 上進行訓練,使用了單個 Tesla V100 GPU,並採用了以下超參數:
- 訓練 4 個週期(AdamW 權重衰減 = 0.05),批量大小為 16。
- 最大序列長度 = 192。
- 模型一次輸入一個句子。
🔧 技術細節
模型侷限性
- 數值處理限制:由於基於 BERT 的模型使用子詞嵌入,該模型在數值處理方面存在侷限性,這對於流行病學比率的識別至關重要,並限制了實體級別的結果。
- 數據集相關限制:模型的性能受到 EpiSet4NER 數據集的限制,包括弱監督教學方法和數據集偏差等。
改進建議
可以使用最新的數值處理技術來提高模型的性能,而無需改進底層數據集。
📄 許可證
本項目採用其他許可證。具體許可證信息請參考相關文檔。








