đ BioBERT Disease NER Model
One of the strongest and most accurate disease NER models, fine - tuned on BioBERT using the NCBI Disease dataset, delivering high - performance disease extraction.
đ Quick Start
This is an outstanding disease NER model, fine - tuned on BioBERT with the reliable NCBI Disease dataset. It attains an excellent 98.64% accuracy and an impressive F1 - score of 89.04%, offering high - performance for disease extraction tasks. It is optimized for precisely identifying diseases, symptoms, and medical conditions from clinical and biomedical texts.
⨠Features
- High - Performance Metrics: Achieves remarkable precision, recall, F1 - score, and accuracy.
- Fine - Tuned: Fine - tuned over 6,800+ annotated examples for 5 epochs, ensuring consistently high validation scores.
- Intended Use: Extracts disease mentions from clinical and biomedical documents and supports healthcare AI systems and medical research automation.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
You can use this model with the Hugging Face Transformers library:
Note: LABEL_0 corresponds to "O" (Outside), LABEL_1 to "B - Disease", and LABEL_2 to "I - Disease" following the BIO tagging format.
from transformers import pipeline
nlp = pipeline(
"ner",
model="Ishan0612/biobert-ner-disease-ncbi",
tokenizer="Ishan0612/biobert-ner-disease-ncbi",
aggregation_strategy="simple"
)
text = "The patient has signs of diabetes mellitus and chronic obstructive pulmonary disease."
results = nlp(text)
for entity in results:
print(f"{entity['word']} - ({entity['entity_group']})")
This should output:
Extracted Medical Entities:
the patient has signs of - (LABEL_0)
diabetes - (LABEL_1)
mellitus - (LABEL_2)
and - (LABEL_0)
chronic - (LABEL_1)
obstructive pulmonary disease - (LABEL_2)
. - (LABEL_0)
đ Documentation
Model Performance
- Precision: 86.80%
- Recall: 91.39%
- F1 - Score: 89.04%
- Accuracy: 98.64%
Intended Use
- Extract disease mentions from clinical and biomedical documents.
- Support healthcare AI systems and medical research automation.
Training Data
This model was trained on the NCBI disease dataset, which consists of 793 PubMed abstracts with 6892 disease mentions.
Technical Details
The model is fine - tuned on the BioBERT base model (dmis - lab/biobert - base - cased - v1.1
). It uses the BIO tagging format where LABEL_0 corresponds to "O" (Outside), LABEL_1 to "B - Disease", and LABEL_2 to "I - Disease".
đ License
This model is licensed under the Apache 2.0 License, same as the original BioBERT (dmis - lab/biobert - base - cased - v1.1
).
đ§ Technical Details
This model is a token - classification model fine - tuned on the dmis - lab/biobert - base - cased - v1.1
base model. It is trained on the NCBI disease dataset for 5 epochs over 6,800+ annotated examples. The model uses the BIO tagging format for named - entity recognition.
đ License
This model is licensed under the Apache 2.0 License, same as the original BioBERT (dmis - lab/biobert - base - cased - v1.1
).
đ Documentation
Citation
@article{lee2020biobert,
title={BioBERT: a pre - trained biomedical language representation model for biomedical text mining},
author={Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and So, Chan Ho and Kang, Jaewoo},
journal={Bioinformatics},
volume={36},
number={4},
pages={1234--1240},
year={2020},
publisher={Oxford University Press}
}
Property |
Details |
Model Type |
Token - Classification |
Training Data |
NCBI disease dataset (793 PubMed abstracts with 6892 disease mentions) |
Base Model |
dmis - lab/biobert - base - cased - v1.1 |
License |
Apache 2.0 License |
Metrics |
F1, Precision, Recall, Accuracy |