Drbert 7GB
Model Overview
Model Features
Model Capabilities
Use Cases
๐ DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains
In recent years, pre - trained language models have shown excellent performance in natural language processing. DrBERT is a specialized pre - trained model in French for the biomedical and clinical fields, offering high - quality solutions for related tasks.
๐ Quick Start
In recent years, pre - trained language models (PLMs) have achieved remarkable performance across a wide range of natural language processing (NLP) tasks. Initially, models were trained on general - domain data, but specialized ones have emerged to handle specific domains more effectively.
This paper presents an original study of PLMs in the medical domain for the French language. For the first time, it compares the performance of PLMs trained on both public web data and private data from healthcare establishments. Different learning strategies are also evaluated on a set of biomedical tasks.
Finally, the first specialized PLMs for the biomedical field in French, named DrBERT, are released, along with the largest corpus of medical data under a free license on which these models are trained.
โจ Features
DrBERT models
DrBERT is a French RoBERTa trained on an open - source corpus of French medical crawled textual data called NACHOS. Models with different amounts of data from various public and private sources are trained using the CNRS (French National Centre for Scientific Research) [Jean Zay](http://www.idris.fr/jean - zay/) French supercomputer. To prevent any personal information leak and comply with European GDPR laws, only the weights of the models trained using exclusively open - source data are publicly released.
Property | Details |
---|---|
Model Type | French RoBERTa |
Training Data | NACHOS (open - source corpus of French medical crawled textual data) |
Model name | Corpus | Number of layers | Attention Heads | Embedding Dimension | Sequence Length | Model URL |
---|---|---|---|---|---|---|
DrBERT-7-GB-cased-Large |
NACHOS 7 GB | 24 | 16 | 1024 | 512 | HuggingFace |
DrBERT-7-GB-cased |
NACHOS 7 GB | 12 | 12 | 768 | 512 | HuggingFace |
DrBERT-4-GB-cased |
NACHOS 4 GB | 12 | 12 | 768 | 512 | HuggingFace |
DrBERT-4-GB-cased-CP-CamemBERT |
NACHOS 4 GB | 12 | 12 | 768 | 512 | HuggingFace |
DrBERT-4-GB-cased-CP-PubMedBERT |
NACHOS 4 GB | 12 | 12 | 768 | 512 | HuggingFace |
๐ป Usage Examples
Basic Usage
Loading the model and tokenizer
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Dr-BERT/DrBERT-7GB")
model = AutoModel.from_pretrained("Dr-BERT/DrBERT-7GB")
Perform the mask filling task
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="Dr-BERT/DrBERT-7GB", tokenizer="Dr-BERT/DrBERT-7GB")
results = fill_mask("La patiente est atteinte d'une <mask>")
๐ฆ Installation
Install dependencies
accelerate @ git+https://github.com/huggingface/accelerate@66edfe103a0de9607f9b9fdcf6a8e2132486d99b
datasets==2.6.1
sentencepiece==0.1.97
protobuf==3.20.1
evaluate==0.2.2
tensorboard==2.11.0
torch >= 1.3
Download NACHOS Dataset text file
Download the full NACHOS dataset from Zenodo and place it in the from_scratch
or continued_pretraining
directory.
Build your own tokenizer from scratch based on NACHOS
Note: This step is required only for from - scratch pre - training. If you want to do continued pre - training, you just need to download the model and the tokenizer corresponding to the model you want to continue training from. In this case, go to the HuggingFace Hub, select a model (e.g., [RoBERTa - base](https://huggingface.co/roberta - base)), download the entire model/tokenizer repository by clicking on the Use In Transformers
button and get the Git link git clone https://huggingface.co/roberta - base
.
Build the tokenizer from scratch on your data in the file ./corpus.txt
using ./build_tokenizer.sh
.
Preprocessing and tokenization of the dataset
First, replace the field tokenizer_path
in the shell script to match the path of your tokenizer directory downloaded using HuggingFace Git or the one you built.
Run ./preprocessing_dataset.sh
to generate the tokenized dataset using the given tokenizer.
Model training
First, change the number of GPUs --ntasks = 128
in the shell script run_training.sh
to match your computational capabilities. In this case, 128 V100 32 GB GPUs from 32 nodes of 4 GPUs (--ntasks - per - node = 4
and --gres = gpu:4
) were used for 20 hours (--time = 20:00:00
).
If you are using Jean Zay, also change the -A
flag to match one of your @gpu
profiles capable of running the job. Move ALL of your datasets, tokenizer, script, and outputs to the $SCRATCH
disk space to prevent other users from experiencing IO issues.
Pre - training from scratch
Once the SLURM parameters are updated, change the name of the model architecture in the flag --model_type="camembert"
and update the --config_overrides=
according to the specifications of the architecture you are training. In this case, RoBERTa had a 514
sequence length, a vocabulary of 32005
(32K tokens of the tokenizer and 5 of the model architecture) tokens, and the identifiers of the beginning - of - sentence token (BOS) and end - of - sentence token (EOS) are 5
and 6
respectively.
Then, go to the ./from_scratch/
directory.
Run sbatch ./run_training.sh
to send the training job to the SLURM queue.
Continue pre - training
Once the SLURM parameters are updated, change the path of the model/tokenizer you want to start from --model_name_or_path=
/ --tokenizer_name=
to the path of the model downloaded from HuggingFace's Git in section 3.3.
Then, go to the ./continued_pretraining/
directory.
Run sbatch ./run_training.sh
to send the training job to the SLURM queue.
๐ Documentation
Fine - tuning on a downstream task
You just need to change the name of the model to Dr - BERT/DrBERT - 7GB
in any of the examples provided by HuggingFace's team here.
๐ License
The project is licensed under the Apache - 2.0 license.
๐ Citation
@inproceedings{labrak2023drbert,
title = {{DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains}},
author = {Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Rouvier, Mickael and Morin, Emmanuel and Daille, Bรฉatrice and Gourraud, Pierre-Antoine},
booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL'23), Long Paper},
month = july,
year = 2023,
address = {Toronto, Canada},
publisher = {Association for Computational Linguistics}
}

