đ bio-lm
This is a language model specifically designed for the life sciences. It leverages a pre - trained model and further trains it on a large - scale scientific dataset, providing strong capabilities for downstream tasks such as token classification.
đ Quick Start
Basic Usage
To quickly test this model in a fill - mask task, you can use the following code:
from transformers import pipeline, RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
text = "Let us try this model to see if it <mask>."
fill_mask = pipeline(
"fill-mask",
model='EMBO/bio-lm',
tokenizer=tokenizer
)
fill_mask(text)
⨠Features
- Domain - Specific Training: This model is further trained on the BioLang dataset, which contains a large number of English scientific texts from the life sciences, making it more suitable for life - science - related tasks.
- Adaptable for Downstream Tasks: It is intended to be fine - tuned for downstream tasks, especially token classification.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
The code above shows how to use the model in a fill - mask task.
Advanced Usage
The main intended use of this model is fine - tuning for downstream tasks. For example, if you want to fine - tune it for token classification, you need to prepare your own dataset and follow the general fine - tuning process of Hugging Face models.
đ Documentation
Model description
This model is a [RoBERTa base pre - trained model](https://huggingface.co/roberta - base) that was further trained using a masked language modeling task on a compendium of English scientific textual examples from the life sciences using the BioLang dataset.
Intended uses & limitations
How to use
The intended use of this model is to be fine - tuned for downstream tasks, token classification in particular.
Limitations and bias
This model should be fine - tuned on a specific task like token classification. The model must be used with the roberta - base
tokenizer.
Training data
The model was trained with a masked language modeling task on the BioLang dataset which includes 12Mio examples from abstracts and figure legends extracted from papers published in life sciences.
Training procedure
The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.
Training code is available at https://github.com/source - data/soda - roberta
- Command:
python -m lm.train /data/json/oapmc_abstracts_figs/ MLM
- Tokenizer vocab size: 50265
- Training data: EMBO/biolang MLM
- Training with: 12005390 examples
- Evaluating on: 36713 examples
- Epochs: 3.0
per_device_train_batch_size
: 16
per_device_eval_batch_size
: 16
learning_rate
: 5e - 05
weight_decay
: 0.0
adam_beta1
: 0.9
adam_beta2
: 0.999
adam_epsilon
: 1e - 08
max_grad_norm
: 1.0
- tensorboard run: lm - MLM - 2021 - 01 - 27T15 - 17 - 43.113766
End of training:
trainset: 'loss': 0.8653350830078125
validation set: 'eval_loss': 0.8192330598831177, 'eval_recall': 0.8154601116513597
Eval results
Eval on test set:
recall: 0.814471959728645
đ§ Technical Details
- Hardware: The model was trained on a NVIDIA DGX Station with 4XTesla V100 GPUs.
- Training Task: Masked language modeling.
- Dataset: BioLang dataset with 12Mio examples from life - science papers.
- Training Parameters: A series of parameters such as epochs, batch size, learning rate, etc., are provided as shown in the training procedure section.
đ License
No license information is provided in the original document, so this section is skipped.