SD-NER Open-Source Named Entity Recognition Model - Free Deployment to Aid Information Extraction from English Texts in Life Sciences

Sd Ner

Developed by EMBO

A named entity recognition model fine-tuned on English scientific texts in the life sciences domain, based on the RoBERTa base model

Sequence Labeling #Biological Entity Recognition #Life Science Texts #RoBERTa Fine-tuning

Downloads 14

Release Time : 3/2/2022

Model Overview

This model is specifically designed to identify biological entities in the SourceData annotation system, including 7 types of biomedical entities such as small molecules, gene products, and subcellular components

Model Features

Specialized for Biomedical Domain

Optimized for life science literature, capable of accurately identifying biomedical entities

Multi-category Entity Recognition

Can identify 7 types of biomedical entities, including gene products and small molecules

Optimized Based on RoBERTa

Further trained on biomedical corpora based on the RoBERTa base model

Model Capabilities

Biomedical Entity Recognition

Scientific Text Analysis

Multi-category Classification

Use Cases

Biomedical Literature Analysis

Research Paper Entity Extraction

Extract key biological entities from papers in the life sciences domain

F1 score reaches 0.74 (micro-average)

Experimental Data Annotation

Automatically annotate key information such as experimental methods and cell types

Gene product recognition F1 score reaches 0.82

🚀 sd-ner

This model is designed for Named Entity Recognition (NER) of bioentities. It leverages a pre - trained RoBERTa base model and fine - tunes it on specific datasets to accurately identify various biological entities.

🚀 Quick Start

How to use

The intended use of this model is for Named Entity Recognition of biological entities used in SourceData annotations (https://sourcedata.embo.org), including small molecules, gene products (genes and proteins), subcellular components, cell line and cell types, organ and tissues, species as well as experimental methods.

To have a quick check of the model:

from transformers import pipeline, RobertaTokenizerFast, RobertaForTokenClassification
example = """<s> F. Western blot of input and eluates of Upf1 domains purification in a Nmd4 - HA strain. The band with the # might corresponds to a dimer of Upf1 - CH, bands marked with a star correspond to residual signal with the anti - HA antibodies (Nmd4). Fragments in the eluate have a smaller size because the protein A part of the tag was removed by digestion with the TEV protease. G6PDH served as a loading control in the input samples </s>"""
tokenizer = RobertaTokenizerFast.from_pretrained('roberta - base', max_len = 512)
model = RobertaForTokenClassification.from_pretrained('EMBO/sd - ner')
ner = pipeline('ner', model, tokenizer = tokenizer)
res = ner(example)
for r in res:
    print(r['word'], r['entity'])

✨ Features

Model description

This model is a [RoBERTa base model](https://huggingface.co/roberta - base) that was further trained using a masked language modeling task on a compendium of English scientific textual examples from the life sciences using the BioLang dataset. It was then fine - tuned for token classification on the SourceData [sd - nlp](https://huggingface.co/datasets/EMBO/sd - nlp) dataset with the NER configuration to perform Named Entity Recognition of bioentities.

Limitations and bias

The model must be used with the roberta - base tokenizer.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import pipeline, RobertaTokenizerFast, RobertaForTokenClassification
example = """<s> F. Western blot of input and eluates of Upf1 domains purification in a Nmd4 - HA strain. The band with the # might corresponds to a dimer of Upf1 - CH, bands marked with a star correspond to residual signal with the anti - HA antibodies (Nmd4). Fragments in the eluate have a smaller size because the protein A part of the tag was removed by digestion with the TEV protease. G6PDH served as a loading control in the input samples </s>"""
tokenizer = RobertaTokenizerFast.from_pretrained('roberta - base', max_len = 512)
model = RobertaForTokenClassification.from_pretrained('EMBO/sd - ner')
ner = pipeline('ner', model, tokenizer = tokenizer)
res = ner(example)
for r in res:
    print(r['word'], r['entity'])

📚 Documentation

Training data

The model was trained for token classification using the [EMBO/sd - nlp dataset](https://huggingface.co/datasets/EMBO/sd - nlp) dataset which includes manually annotated examples.

Training procedure

The training was run on an NVIDIA DGX Station with 4XTesla V100 GPUs.

Training code is available at https://github.com/source - data/soda - roberta

Property	Details
Model fine - tuned	EMBO/bio - lm
Tokenizer vocab size	50265
Training data	EMBO/sd - nlp
Dataset configuration	NER
Training examples	48771
Evaluating examples	13801
Training features	O, I - SMALL_MOLECULE, B - SMALL_MOLECULE, I - GENEPROD, B - GENEPROD, I - SUBCELLULAR, B - SUBCELLULAR, I - CELL, B - CELL, I - TISSUE, B - TISSUE, I - ORGANISM, B - ORGANISM, I - EXP_ASSAY, B - EXP_ASSAY
Epochs	0.6
`per_device_train_batch_size`	16
`per_device_eval_batch_size`	16
`learning_rate`	0.0001
`weight_decay`	0.0
`adam_beta1`	0.9
`adam_beta2`	0.999
`adam_epsilon`	1e - 08
`max_grad_norm`	1.0

Eval results

Testing on 7178 examples of test set with sklearn.metrics:

                precision    recall  f1 - score   support

          CELL       0.69      0.81      0.74      5245
     EXP_ASSAY       0.56      0.57      0.56     10067
      GENEPROD       0.77      0.89      0.82     23587
      ORGANISM       0.72      0.82      0.77      3623
SMALL_MOLECULE       0.70      0.80      0.75      6187
   SUBCELLULAR       0.65      0.72      0.69      3700
        TISSUE       0.62      0.73      0.67      3207

     micro avg       0.70      0.79      0.74     55616
     macro avg       0.67      0.77      0.72     55616
  weighted avg       0.70      0.79      0.74     55616

{'test_loss': 0.1830928772687912, 'test_accuracy_score': 0.9334821000160841, 'test_precision': 0.6987463009514112, 'test_recall': 0.789682825086306, 'test_f1': 0.7414366506288511, 'test_runtime': 61.0547, 'test_samples_per_second': 117.567, 'test_steps_per_second': 1.851}

📄 License

The license of this model is agpl - 3.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご