🚀 BIOMed_NER: Named Entity Recognition for Biomedical Entities
BIOMed_NER is a Named Entity Recognition (NER) model that uses DeBERTaV3 to identify biomedical entities. It's highly useful for extracting structured information from clinical text, such as diseases, procedures, medications, and anatomical terms.
🚀 Quick Start
You can use the Hugging Face pipeline
for easy inference:
from transformers import pipeline
model_path = "Helios9/BIOMed_NER"
pipe = pipeline(
task="token-classification",
model=model_path,
tokenizer=model_path,
aggregation_strategy="simple"
)
text = ("A 48-year-old female presented with vaginal bleeding and abnormal Pap smears. "
"Upon diagnosis of invasive non-keratinizing SCC of the cervix, she underwent a radical "
"hysterectomy with salpingo-oophorectomy which demonstrated positive spread to the pelvic "
"lymph nodes and the parametrium.")
result = pipe(text)
print(result)
✨ Features
Why DeBERTa for Biomedical NER?
DeBERTa (Decoding-enhanced BERT with Disentangled Attention) represents a significant leap forward in NLP model architecture, particularly for nuanced tasks like Named Entity Recognition (NER) in complex domains such as biomedical texts. Here’s why DeBERTa was the ideal choice for BIOMed_NER:
-
Advanced Disentangled Attention Mechanism:
- DeBERTa goes beyond traditional transformers by using a unique disentangled attention mechanism that separately encodes word content and word position. This allows DeBERTa to capture the contextual meaning of biomedical terms and understand complex sentence structures, which is essential for accurately tagging biomedical entities that often have overlapping or highly specific terms.
-
Enhanced Embedding for Richer Contextual Understanding:
- Biomedical text often contains long sentences, specialized terminology, and hierarchical relationships between entities (e.g., "diabetes" vs. "Type 1 diabetes"). DeBERTa’s improved embedding layer allows it to capture these nuanced relationships better than traditional transformer models, making it especially effective in understanding context-rich medical documents.
-
Superior Performance on Downstream NLP Tasks:
- DeBERTa consistently ranks among the top models on NLP benchmarks like GLUE and SQuAD, which is a testament to its ability to generalize across tasks. This high performance is especially beneficial for BIOMed_NER, where accurate recognition of subtle differences between biomedical entities can significantly enhance the quality of structured data extracted from unstructured clinical notes.
-
Pre-trained for Optimal Transfer Learning:
- Leveraging the "base" DeBERTaV3 variant allows us to tap into a model pre-trained on vast amounts of text, thus providing an excellent foundation for fine-tuning on domain-specific biomedical data. This pre-training, combined with the fine-tuning on the dataset, allows BIOMed_NER to accurately distinct biomedical entities, from diseases and medications to clinical events and anatomical structures.
-
Efficient Fine-Tuning for Large Biomedical Datasets:
- DeBERTa is optimized for both accuracy and efficiency, making it easier to train on large and complex datasets without needing excessive computational resources. This means faster iterations during model development and a more accessible deployment pipeline.
By selecting DeBERTa for BIOMed_NER, we've built a model that excels in understanding the intricate language of medicine, providing high accuracy and contextual depth essential for healthcare applications. Whether for researchers analyzing clinical data or applications structuring patient records, DeBERTa enables BIOMed_NER to extract, tag, and organize critical medical information effectively.
💻 Usage Examples
Basic Usage
from transformers import pipeline
model_path = "Helios9/BIOMed_NER"
pipe = pipeline(
task="token-classification",
model=model_path,
tokenizer=model_path,
aggregation_strategy="simple"
)
text = ("A 48-year-old female presented with vaginal bleeding and abnormal Pap smears. "
"Upon diagnosis of invasive non-keratinizing SCC of the cervix, she underwent a radical "
"hysterectomy with salpingo-oophorectomy which demonstrated positive spread to the pelvic "
"lymph nodes and the parametrium.")
result = pipe(text)
print(result)
Advanced Usage
def merge_consecutive_entities(entities, text):
entities = sorted(entities, key=lambda x: x['start'])
merged_entities = []
current_entity = None
for entity in entities:
if current_entity is None:
current_entity = entity
elif (
entity['entity_group'] == current_entity['entity_group'] and
(entity['start'] <= current_entity['end'])
):
current_entity['end'] = max(current_entity['end'], entity['end'])
current_entity['word'] = text[current_entity['start']:current_entity['end']]
current_entity['score'] = (current_entity['score'] + entity['score']) / 2
else:
merged_entities.append(current_entity)
current_entity = entity
if current_entity:
merged_entities.append(current_entity)
return merged_entities
from transformers import pipeline
model_path = "Helios9/BIOMed_NER"
pipe = pipeline(
task="token-classification",
model=model_path,
tokenizer=model_path,
aggregation_strategy="simple"
)
text = ("A 48-year-old female presented with vaginal bleeding and abnormal Pap smears. "
"Upon diagnosis of invasive non-keratinizing SCC of the cervix, she underwent a radical "
"hysterectomy with salpingo-oophorectomy which demonstrated positive spread to the pelvic "
"lymph nodes and the parametrium.")
result = pipe(text)
final_result=merge_consecutive_entities(result,text)
print(final_result)
📚 Documentation
Hyperparameters
Property |
Details |
Base Model |
microsoft/deberta-v3-base |
Learning Rate |
3e-5 |
Batch Size |
8 |
Gradient Accumulation Steps |
2 |
Scheduler |
Cosine schedule with warmup |
Epochs |
30 |
Optimizer |
AdamW with betas (0.9, 0.999) and epsilon 1e-8 |
Output Example
The output will be a list of recognized entities with their entity type, score, and start/end positions in the text. Here’s a sample output format:
[
{
"entity_group": "Disease_disorder",
"score": 0.98,
"word": "SCC of the cervix",
"start": 63,
"end": 80
},
...
]
Use Cases
- Extracting clinical information from unstructured text in medical records.
- Structuring data for downstream biomedical research or applications.
- Assisting healthcare professionals by highlighting relevant biomedical entities.
This model is publicly available on Hugging Face and can be easily integrated into applications for medical text analysis.
🔧 Technical Details
The BIOMed_NER model uses the DeBERTaV3 architecture. DeBERTa's disentangled attention mechanism separately encodes word content and position, enabling it to capture the contextual meaning of biomedical terms and understand complex sentence structures. Its enhanced embedding layer helps in better understanding long sentences, specialized terminology, and hierarchical relationships in biomedical text. The pre - trained "base" DeBERTaV3 variant provides a solid foundation for fine - tuning on biomedical data, and its efficiency allows for training on large datasets without excessive computational resources.
📄 License
License information is unknown.