DrBERT-7GB Open-Source French Model - Empowering Text Processing in Biomedical and Clinical Fields

Drbert 7GB

Developed by Dr-BERT

DrBERT is a French RoBERTa model trained on the open-source French medical text corpus NACHOS, specializing in the biomedical and clinical fields

Large Language Model

Transformers

FrenchOpen Source License:Apache-2.0 #French Biomedical #Clinical Text Processing #RoBERTa Architecture

Downloads 4,781

Release Time : 12/25/2022

Model Overview

A robust pretrained model for French biomedical and clinical fields, applicable to tasks such as medical text processing and disease prediction

Model Features

Domain-specific Pretraining

Trained on the 7GB French medical corpus NACHOS, optimized for the biomedical field

Multi-version Support

Offers different model sizes (Large/Base) to meet varying computational needs

Continued Pretraining Compatibility

Supports continued pretraining strategies based on CamemBERT or PubMedBERT

Model Capabilities

Medical Text Understanding

Disease Terminology Recognition

Clinical Information Extraction

Biomedical Text Classification

Use Cases

Clinical Decision Support

Disease Prediction

Predict potential diseases based on patient symptom descriptions

Example: 'The patient suffers from <mask> disease' → Can predict specific disease names

Medical Research

Literature Analysis

Extract key information from biomedical literature

🚀 DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains

In recent years, pre - trained language models have shown excellent performance in natural language processing. DrBERT is a specialized pre - trained model in French for the biomedical and clinical fields, offering high - quality solutions for related tasks.

🚀 Quick Start

In recent years, pre - trained language models (PLMs) have achieved remarkable performance across a wide range of natural language processing (NLP) tasks. Initially, models were trained on general - domain data, but specialized ones have emerged to handle specific domains more effectively.

This paper presents an original study of PLMs in the medical domain for the French language. For the first time, it compares the performance of PLMs trained on both public web data and private data from healthcare establishments. Different learning strategies are also evaluated on a set of biomedical tasks.

Finally, the first specialized PLMs for the biomedical field in French, named DrBERT, are released, along with the largest corpus of medical data under a free license on which these models are trained.

✨ Features

DrBERT models

DrBERT is a French RoBERTa trained on an open - source corpus of French medical crawled textual data called NACHOS. Models with different amounts of data from various public and private sources are trained using the CNRS (French National Centre for Scientific Research) [Jean Zay](http://www.idris.fr/jean - zay/) French supercomputer. To prevent any personal information leak and comply with European GDPR laws, only the weights of the models trained using exclusively open - source data are publicly released.

Property	Details
Model Type	French RoBERTa
Training Data	NACHOS (open - source corpus of French medical crawled textual data)

Model name	Corpus	Number of layers	Attention Heads	Embedding Dimension	Sequence Length	Model URL
`DrBERT-7-GB-cased-Large`	NACHOS 7 GB	24	16	1024	512	HuggingFace
`DrBERT-7-GB-cased`	NACHOS 7 GB	12	12	768	512	HuggingFace
`DrBERT-4-GB-cased`	NACHOS 4 GB	12	12	768	512	HuggingFace
`DrBERT-4-GB-cased-CP-CamemBERT`	NACHOS 4 GB	12	12	768	512	HuggingFace
`DrBERT-4-GB-cased-CP-PubMedBERT`	NACHOS 4 GB	12	12	768	512	HuggingFace

💻 Usage Examples

Basic Usage

Loading the model and tokenizer

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Dr-BERT/DrBERT-7GB")
model = AutoModel.from_pretrained("Dr-BERT/DrBERT-7GB")

Perform the mask filling task

from transformers import pipeline 

fill_mask  = pipeline("fill-mask", model="Dr-BERT/DrBERT-7GB", tokenizer="Dr-BERT/DrBERT-7GB")
results = fill_mask("La patiente est atteinte d'une <mask>")

📦 Installation

Install dependencies

accelerate @ git+https://github.com/huggingface/accelerate@66edfe103a0de9607f9b9fdcf6a8e2132486d99b
datasets==2.6.1
sentencepiece==0.1.97
protobuf==3.20.1
evaluate==0.2.2
tensorboard==2.11.0
torch >= 1.3

Download NACHOS Dataset text file

Download the full NACHOS dataset from Zenodo and place it in the from_scratch or continued_pretraining directory.

Build your own tokenizer from scratch based on NACHOS

Note: This step is required only for from - scratch pre - training. If you want to do continued pre - training, you just need to download the model and the tokenizer corresponding to the model you want to continue training from. In this case, go to the HuggingFace Hub, select a model (e.g., [RoBERTa - base](https://huggingface.co/roberta - base)), download the entire model/tokenizer repository by clicking on the Use In Transformers button and get the Git link git clone https://huggingface.co/roberta - base.

Build the tokenizer from scratch on your data in the file ./corpus.txt using ./build_tokenizer.sh.

Preprocessing and tokenization of the dataset

First, replace the field tokenizer_path in the shell script to match the path of your tokenizer directory downloaded using HuggingFace Git or the one you built.

Run ./preprocessing_dataset.sh to generate the tokenized dataset using the given tokenizer.

Model training

First, change the number of GPUs --ntasks = 128 in the shell script run_training.sh to match your computational capabilities. In this case, 128 V100 32 GB GPUs from 32 nodes of 4 GPUs (--ntasks - per - node = 4 and --gres = gpu:4) were used for 20 hours (--time = 20:00:00).

If you are using Jean Zay, also change the -A flag to match one of your @gpu profiles capable of running the job. Move ALL of your datasets, tokenizer, script, and outputs to the $SCRATCH disk space to prevent other users from experiencing IO issues.

Pre - training from scratch

Once the SLURM parameters are updated, change the name of the model architecture in the flag --model_type="camembert" and update the --config_overrides= according to the specifications of the architecture you are training. In this case, RoBERTa had a 514 sequence length, a vocabulary of 32005 (32K tokens of the tokenizer and 5 of the model architecture) tokens, and the identifiers of the beginning - of - sentence token (BOS) and end - of - sentence token (EOS) are 5 and 6 respectively.

Then, go to the ./from_scratch/ directory.

Run sbatch ./run_training.sh to send the training job to the SLURM queue.

Continue pre - training

Once the SLURM parameters are updated, change the path of the model/tokenizer you want to start from --model_name_or_path= / --tokenizer_name= to the path of the model downloaded from HuggingFace's Git in section 3.3.

Then, go to the ./continued_pretraining/ directory.

Run sbatch ./run_training.sh to send the training job to the SLURM queue.

📚 Documentation

Fine - tuning on a downstream task

You just need to change the name of the model to Dr - BERT/DrBERT - 7GB in any of the examples provided by HuggingFace's team here.

📄 License

The project is licensed under the Apache - 2.0 license.

📖 Citation

@inproceedings{labrak2023drbert,
    title = {{DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains}},
    author = {Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Rouvier, Mickael and Morin, Emmanuel and Daille, Béatrice and Gourraud, Pierre-Antoine},
    booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL'23), Long Paper},
    month = july,
    year = 2023,
    address = {Toronto, Canada},
    publisher = {Association for Computational Linguistics}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご