Bio-LM Open-Source Language Model - Free Support for English Scientific Text Analysis Applications in Life Sciences

Bio Lm

Developed by EMBO

A language model further trained on English scientific texts in the life sciences domain, based on the RoBERTa base pre-trained model

Large Language Model #Life Science Text #RoBERTa Fine-tuning #Paper Abstract Processing

Downloads 16

Release Time : 3/2/2022

Model Overview

This model is primarily used for text processing tasks in the life sciences domain, particularly suitable for fine-tuning on downstream tasks (such as token classification)

Model Features

Specialization in Life Sciences

Trained on 12 million life science paper abstracts and figure captions, offering domain-specific expertise

Easy to Fine-tune

Particularly suitable for fine-tuning on downstream tasks (such as token classification)

High Performance

Achieves a recall score of 0.814 on the test set

Model Capabilities

Life Science Text Understanding

Masked Language Modeling

Text Classification

Domain-Specific Text Processing

Use Cases

Scientific Research

Scientific Literature Analysis

Used for processing and analyzing life science paper abstracts

Capable of accurately understanding professional terminology and context

Biomedical Text Classification

Classifying and tagging biomedical literature

Suitable for fine-tuning for specific classification tasks

🚀 bio-lm

This is a language model specifically designed for the life sciences. It leverages a pre - trained model and further trains it on a large - scale scientific dataset, providing strong capabilities for downstream tasks such as token classification.

🚀 Quick Start

Basic Usage

To quickly test this model in a fill - mask task, you can use the following code:

from transformers import pipeline, RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
text = "Let us try this model to see if it <mask>."
fill_mask = pipeline(
    "fill-mask",
    model='EMBO/bio-lm',
    tokenizer=tokenizer
)
fill_mask(text)

✨ Features

Domain - Specific Training: This model is further trained on the BioLang dataset, which contains a large number of English scientific texts from the life sciences, making it more suitable for life - science - related tasks.
Adaptable for Downstream Tasks: It is intended to be fine - tuned for downstream tasks, especially token classification.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The code above shows how to use the model in a fill - mask task.

Advanced Usage

The main intended use of this model is fine - tuning for downstream tasks. For example, if you want to fine - tune it for token classification, you need to prepare your own dataset and follow the general fine - tuning process of Hugging Face models.

📚 Documentation

Model description

This model is a [RoBERTa base pre - trained model](https://huggingface.co/roberta - base) that was further trained using a masked language modeling task on a compendium of English scientific textual examples from the life sciences using the BioLang dataset.

Intended uses & limitations

How to use

The intended use of this model is to be fine - tuned for downstream tasks, token classification in particular.

Limitations and bias

This model should be fine - tuned on a specific task like token classification. The model must be used with the roberta - base tokenizer.

Training data

The model was trained with a masked language modeling task on the BioLang dataset which includes 12Mio examples from abstracts and figure legends extracted from papers published in life sciences.

Training procedure

The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.

Training code is available at https://github.com/source - data/soda - roberta

Command: python -m lm.train /data/json/oapmc_abstracts_figs/ MLM
Tokenizer vocab size: 50265
Training data: EMBO/biolang MLM
Training with: 12005390 examples
Evaluating on: 36713 examples
Epochs: 3.0
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
learning_rate: 5e - 05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e - 08
max_grad_norm: 1.0
tensorboard run: lm - MLM - 2021 - 01 - 27T15 - 17 - 43.113766

End of training:

trainset: 'loss': 0.8653350830078125
validation set: 'eval_loss': 0.8192330598831177, 'eval_recall': 0.8154601116513597

Eval results

Eval on test set:

recall: 0.814471959728645

🔧 Technical Details

Hardware: The model was trained on a NVIDIA DGX Station with 4XTesla V100 GPUs.
Training Task: Masked language modeling.
Dataset: BioLang dataset with 12Mio examples from life - science papers.
Training Parameters: A series of parameters such as epochs, batch size, learning rate, etc., are provided as shown in the training procedure section.

📄 License

No license information is provided in the original document, so this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご