đ BioMedLM 2.7B
BioMedLM 2.7B is a language model trained on biomedical text. It can achieve strong results in various biomedical NLP tasks and is suitable for research in the biomedical field.
đ Quick Start
This model is licensed under the BigScience Open RAIL - M license used for [BLOOM](https://huggingface.co/bigscience/bloom - 1b1). It can be used to generate text for experimentation. However, note that the license prohibits using the model to "provide medical advice and medical results interpretation".
⨠Features
- High - performance in Biomedical NLP: Achieves a new state - of - the - art performance of 50.3% accuracy on the MedQA biomedical question - answering task.
- Natural Language Generation: Capable of natural language generation for research purposes.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
Model Details
Property |
Details |
Model Type |
Language model |
Training Data |
Pubmed Abstracts and Full Text from The Pile |
This model was previously known as PubMedGPT 2.7B, but the name was changed due to a request from the NIH which holds the trademark for "PubMed". It is a GPT - style model trained exclusively on biomedical abstracts and papers from The Pile.
Uses
Direct Use
It is possible to use this model to generate text for experimentation and understanding its capabilities. But it should not be directly used for production or work that may directly impact people.
Downstream Use
The main way to use this model is finetuning for downstream question - answering tasks.
Out - of - Scope Use
Do not recommend using this model for natural language generation in a production environment, whether finetuned or not.
Bias, Risks, and Limitations
Predictions generated by the model may include disturbing and harmful stereotypes. Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl - long.330.pdf)).
â ī¸ Important Note
This license forbids use of the model (or derivatives thereof) "To provide medical advice and medical results interpretation." If you are concerned that your use case would follow under the "letter" of this restriction, but not the "spirit," you can contact us to discuss.
đĄ Usage Tip
We strongly recommend against using this model in production for natural language generation.
Training Details
Training Data
This model was trained on the Pubmed Abstracts and Full Text from The Pile.
Training Procedure
The model was trained on MosaicML Cloud using the Composer training library and PyTorch FSDP. It was trained across 128 A100 - 40GB GPUs for ~6.25 days with batch size = 1024, sequence length = 1024 for 300B tokens using Decoupled AdamW with the following settings:
Parameter |
Value |
lr |
1.6e - 4 |
eps |
1e - 8 |
betas |
[0.9, 0.95] |
weight decay |
1.6e - 5 |
The training process was smooth, and there were steady perplexity improvements on the validation and training sets throughout the training. Preliminary experiments showed improved downstream task performance as the model was trained to 300B tokens.
Preprocessing
The model uses a custom tokenizer trained on the PubMed Abstracts. Using a tokenizer trained on in - domain text can maximize performance on downstream tasks. For example, common biomedical terms are represented as entire tokens by the biomedical tokenizer, while they are split into multiple tokens by the standard GPT - 2 tokenizer.
Technical Specifications
Model Architecture and Objective
BioMedLM 2.7B is a standard GPT - 2 implementation (trained with Flash Attention) with the following hyperparameters:
Parameter |
Value |
hidden size |
2560 |
heads |
20 |
layers |
32 |
vocab size |
28896 |
sequence length |
1024 |
Compute Infrastructure
The model was trained on MosaicML Cloud using the Composer training library and PyTorch FSDP across 128 A100 - 40GB GPUs in ~6.25 days.
Paper
BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text
đ§ Technical Details
The model uses a custom tokenizer trained on the PubMed Abstracts, which helps to maximize performance on downstream tasks by representing common biomedical terms as entire tokens. The training was conducted on MosaicML Cloud with the help of Composer and PyTorch FSDP, enabling multi - node training across 128 A100 - 40GB GPUs.
đ License
This model is licensed under the bigscience - bloom - rail - 1.0 license.