BioMedLM 2.7B Open-source Biomedical Language Model - Free Support for Biomedical Text Processing Tasks

Biomedlm

Developed by stanford-crfm

BioMedLM 2.7B is a specialized 2.7-billion-parameter language model trained on biomedical texts, demonstrating outstanding performance in biomedical NLP tasks.

Large Language Model

Transformers

Open Source License:Openrail #Biomedical Q&A #PubMed-trained #High-precision medical NLP

Downloads 14.51k

Release Time : 12/14/2022

Model Overview

BioMedLM 2.7B is a GPT-style language model trained on biomedical abstracts and papers, specializing in natural language processing tasks within the biomedical field, with excellent Q&A and text generation capabilities.

Model Features

Biomedical domain optimization

Specifically trained on PubMed abstracts and papers, excelling in biomedical NLP tasks.

Domain-specific tokenizer

Uses a custom tokenizer trained on biomedical texts, better handling medical terminology.

High-performance Q&A capability

Achieved 50.3% accuracy on MedQA biomedical Q&A tasks, setting a new record.

Model Capabilities

Biomedical text comprehension

Biomedical Q&A

Biomedical text generation

Use Cases

Medical research

Medical literature Q&A

Answering professional questions based on medical literature

Achieved 50.3% accuracy on MedQA tasks

Medical abstract generation

Generating medical research abstracts

Medical education

Medical knowledge Q&A

Answering questions from medical students and professionals

🚀 BioMedLM 2.7B

BioMedLM 2.7B is a language model trained on biomedical text. It can achieve strong results in various biomedical NLP tasks and is suitable for research in the biomedical field.

🚀 Quick Start

This model is licensed under the BigScience Open RAIL - M license used for [BLOOM](https://huggingface.co/bigscience/bloom - 1b1). It can be used to generate text for experimentation. However, note that the license prohibits using the model to "provide medical advice and medical results interpretation".

✨ Features

High - performance in Biomedical NLP: Achieves a new state - of - the - art performance of 50.3% accuracy on the MedQA biomedical question - answering task.
Natural Language Generation: Capable of natural language generation for research purposes.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Details

Developed by: Stanford CRFM, MosaicML
Shared by: Stanford CRFM
Model type: Language model
Language(s) (NLP): en
License: bigscience - bloom - rail - 1.0

Property	Details
Model Type	Language model
Training Data	Pubmed Abstracts and Full Text from The Pile

This model was previously known as PubMedGPT 2.7B, but the name was changed due to a request from the NIH which holds the trademark for "PubMed". It is a GPT - style model trained exclusively on biomedical abstracts and papers from The Pile.

Uses

Direct Use

It is possible to use this model to generate text for experimentation and understanding its capabilities. But it should not be directly used for production or work that may directly impact people.

Downstream Use

The main way to use this model is finetuning for downstream question - answering tasks.

Out - of - Scope Use

Do not recommend using this model for natural language generation in a production environment, whether finetuned or not.

Bias, Risks, and Limitations

Predictions generated by the model may include disturbing and harmful stereotypes. Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl - long.330.pdf)).

⚠️ Important Note

This license forbids use of the model (or derivatives thereof) "To provide medical advice and medical results interpretation." If you are concerned that your use case would follow under the "letter" of this restriction, but not the "spirit," you can contact us to discuss.

💡 Usage Tip

We strongly recommend against using this model in production for natural language generation.

Training Details

Training Data

This model was trained on the Pubmed Abstracts and Full Text from The Pile.

Training Procedure

The model was trained on MosaicML Cloud using the Composer training library and PyTorch FSDP. It was trained across 128 A100 - 40GB GPUs for ~6.25 days with batch size = 1024, sequence length = 1024 for 300B tokens using Decoupled AdamW with the following settings:

Parameter	Value
lr	1.6e - 4
eps	1e - 8
betas	[0.9, 0.95]
weight decay	1.6e - 5

The training process was smooth, and there were steady perplexity improvements on the validation and training sets throughout the training. Preliminary experiments showed improved downstream task performance as the model was trained to 300B tokens.

Preprocessing

The model uses a custom tokenizer trained on the PubMed Abstracts. Using a tokenizer trained on in - domain text can maximize performance on downstream tasks. For example, common biomedical terms are represented as entire tokens by the biomedical tokenizer, while they are split into multiple tokens by the standard GPT - 2 tokenizer.

Technical Specifications

Model Architecture and Objective

BioMedLM 2.7B is a standard GPT - 2 implementation (trained with Flash Attention) with the following hyperparameters:

Parameter	Value
hidden size	2560
heads	20
layers	32
vocab size	28896
sequence length	1024

Compute Infrastructure

The model was trained on MosaicML Cloud using the Composer training library and PyTorch FSDP across 128 A100 - 40GB GPUs in ~6.25 days.

Paper

BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

🔧 Technical Details

The model uses a custom tokenizer trained on the PubMed Abstracts, which helps to maximize performance on downstream tasks by representing common biomedical terms as entire tokens. The training was conducted on MosaicML Cloud with the help of Composer and PyTorch FSDP, enabling multi - node training across 128 A100 - 40GB GPUs.

📄 License

This model is licensed under the bigscience - bloom - rail - 1.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご