BioMed_NER Open-source Biomedical NER Model - Free Extraction of Disease and Drug Information from Clinical Texts

Biomed NER

Developed by Helios9

A biomedical named entity recognition model based on DeBERTaV3, specifically designed to extract structured information such as diseases and medications from clinical texts

Sequence Labeling

Safetensors

English#Biomedical Entity Recognition #DeBERTaV3 Architecture #Clinical Text Parsing

Downloads 554

Release Time : 11/11/2024

Model Overview

This model effectively identifies biomedical entities, including diseases, medical procedures, medications, and anatomical terms, suitable for text analysis in the healthcare domain

Model Features

Disentangled Attention Mechanism

Utilizes a unique disentangled attention mechanism to separately encode lexical content and positional information, accurately capturing the contextual meaning of biomedical terms

Deep Contextual Understanding

Improved embedding layers effectively understand complex medical sentence structures and hierarchical relationships between professional terms

Efficient Fine-tuning Capability

Based on the pre-trained DeBERTaV3 base model, achieves efficient fine-tuning on biomedical domain data

Model Capabilities

Identify biomedical entities

Extract structured information from clinical texts

Annotate diseases/medications/anatomical terms

Use Cases

Healthcare

Electronic Medical Record Analysis

Extract key clinical information from unstructured electronic medical records

Achieve structured processing of medical record information

Medical Research Support

Build structured datasets for biomedical research

Improve research data collection efficiency

🚀 BIOMed_NER: Named Entity Recognition for Biomedical Entities

BIOMed_NER is a Named Entity Recognition (NER) model that uses DeBERTaV3 to identify biomedical entities. It's highly useful for extracting structured information from clinical text, such as diseases, procedures, medications, and anatomical terms.

🚀 Quick Start

You can use the Hugging Face pipeline for easy inference:

from transformers import pipeline

# Load the model
model_path = "Helios9/BIOMed_NER"
pipe = pipeline(
    task="token-classification",
    model=model_path,
    tokenizer=model_path,
    aggregation_strategy="simple"
)

# Test the pipeline
text = ("A 48-year-old female presented with vaginal bleeding and abnormal Pap smears. "
        "Upon diagnosis of invasive non-keratinizing SCC of the cervix, she underwent a radical "
        "hysterectomy with salpingo-oophorectomy which demonstrated positive spread to the pelvic "
        "lymph nodes and the parametrium.")
result = pipe(text)
print(result)

✨ Features

Why DeBERTa for Biomedical NER?

DeBERTa (Decoding-enhanced BERT with Disentangled Attention) represents a significant leap forward in NLP model architecture, particularly for nuanced tasks like Named Entity Recognition (NER) in complex domains such as biomedical texts. Here’s why DeBERTa was the ideal choice for BIOMed_NER:

Advanced Disentangled Attention Mechanism:
- DeBERTa goes beyond traditional transformers by using a unique disentangled attention mechanism that separately encodes word content and word position. This allows DeBERTa to capture the contextual meaning of biomedical terms and understand complex sentence structures, which is essential for accurately tagging biomedical entities that often have overlapping or highly specific terms.
Enhanced Embedding for Richer Contextual Understanding:
- Biomedical text often contains long sentences, specialized terminology, and hierarchical relationships between entities (e.g., "diabetes" vs. "Type 1 diabetes"). DeBERTa’s improved embedding layer allows it to capture these nuanced relationships better than traditional transformer models, making it especially effective in understanding context-rich medical documents.
Superior Performance on Downstream NLP Tasks:
- DeBERTa consistently ranks among the top models on NLP benchmarks like GLUE and SQuAD, which is a testament to its ability to generalize across tasks. This high performance is especially beneficial for BIOMed_NER, where accurate recognition of subtle differences between biomedical entities can significantly enhance the quality of structured data extracted from unstructured clinical notes.
Pre-trained for Optimal Transfer Learning:
- Leveraging the "base" DeBERTaV3 variant allows us to tap into a model pre-trained on vast amounts of text, thus providing an excellent foundation for fine-tuning on domain-specific biomedical data. This pre-training, combined with the fine-tuning on the dataset, allows BIOMed_NER to accurately distinct biomedical entities, from diseases and medications to clinical events and anatomical structures.
Efficient Fine-Tuning for Large Biomedical Datasets:
- DeBERTa is optimized for both accuracy and efficiency, making it easier to train on large and complex datasets without needing excessive computational resources. This means faster iterations during model development and a more accessible deployment pipeline.

By selecting DeBERTa for BIOMed_NER, we've built a model that excels in understanding the intricate language of medicine, providing high accuracy and contextual depth essential for healthcare applications. Whether for researchers analyzing clinical data or applications structuring patient records, DeBERTa enables BIOMed_NER to extract, tag, and organize critical medical information effectively.

💻 Usage Examples

Basic Usage

from transformers import pipeline

# Load the model
model_path = "Helios9/BIOMed_NER"
pipe = pipeline(
    task="token-classification",
    model=model_path,
    tokenizer=model_path,
    aggregation_strategy="simple"
)

# Test the pipeline
text = ("A 48-year-old female presented with vaginal bleeding and abnormal Pap smears. "
        "Upon diagnosis of invasive non-keratinizing SCC of the cervix, she underwent a radical "
        "hysterectomy with salpingo-oophorectomy which demonstrated positive spread to the pelvic "
        "lymph nodes and the parametrium.")
result = pipe(text)
print(result)

Advanced Usage

def merge_consecutive_entities(entities, text):
    entities = sorted(entities, key=lambda x: x['start'])
    merged_entities = []
    current_entity = None

    for entity in entities:
        if current_entity is None:
            current_entity = entity
        elif (
            entity['entity_group'] == current_entity['entity_group'] and
            (entity['start'] <= current_entity['end'])
        ):
            # Merge based on start and end positions in the text
            current_entity['end'] = max(current_entity['end'], entity['end'])
            current_entity['word'] = text[current_entity['start']:current_entity['end']]
            current_entity['score'] = (current_entity['score'] + entity['score']) / 2  
        else:
            merged_entities.append(current_entity)
            current_entity = entity
    if current_entity:
        merged_entities.append(current_entity)

    return merged_entities

from transformers import pipeline

# Load the model
model_path = "Helios9/BIOMed_NER"
pipe = pipeline(
    task="token-classification",
    model=model_path,
    tokenizer=model_path,
    aggregation_strategy="simple"
)

# Test the pipeline
text = ("A 48-year-old female presented with vaginal bleeding and abnormal Pap smears. "
        "Upon diagnosis of invasive non-keratinizing SCC of the cervix, she underwent a radical "
        "hysterectomy with salpingo-oophorectomy which demonstrated positive spread to the pelvic "
        "lymph nodes and the parametrium.")
result = pipe(text)
final_result=merge_consecutive_entities(result,text)
print(final_result)

📚 Documentation

Hyperparameters

Property	Details
Base Model	`microsoft/deberta-v3-base`
Learning Rate	`3e-5`
Batch Size	`8`
Gradient Accumulation Steps	`2`
Scheduler	Cosine schedule with warmup
Epochs	`30`
Optimizer	AdamW with betas `(0.9, 0.999)` and epsilon `1e-8`

Output Example

The output will be a list of recognized entities with their entity type, score, and start/end positions in the text. Here’s a sample output format:

[
    {
        "entity_group": "Disease_disorder",
        "score": 0.98,
        "word": "SCC of the cervix",
        "start": 63,
        "end": 80
    },
    ...
]

Use Cases

Extracting clinical information from unstructured text in medical records.
Structuring data for downstream biomedical research or applications.
Assisting healthcare professionals by highlighting relevant biomedical entities.

This model is publicly available on Hugging Face and can be easily integrated into applications for medical text analysis.

🔧 Technical Details

The BIOMed_NER model uses the DeBERTaV3 architecture. DeBERTa's disentangled attention mechanism separately encodes word content and position, enabling it to capture the contextual meaning of biomedical terms and understand complex sentence structures. Its enhanced embedding layer helps in better understanding long sentences, specialized terminology, and hierarchical relationships in biomedical text. The pre - trained "base" DeBERTaV3 variant provides a solid foundation for fine - tuning on biomedical data, and its efficiency allows for training on large datasets without excessive computational resources.

📄 License

License information is unknown.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご