Camembert-bio-base Open-source Language Model - Free Support for French Biomedical Named Entity Recognition

Camembert Bio Base

Developed by almanach

CamemBERT - bio is a language model optimized for the French biomedical field. It is based on camembert - base and undergoes continuous pre - training, showing excellent performance in biomedical named entity recognition tasks.

Large Language Model

Transformers

FrenchOpen Source License:MIT #French biomedical NER #Clinical text analysis #Drug label processing

Downloads 6,029

Release Time : 2/23/2023

Model Overview

CamemBERT - bio is an advanced French biomedical language model. Through continuous pre - training on a large - scale French biomedical corpus, its performance in biomedical named entity recognition tasks has been significantly improved.

Model Features

Optimized for professional fields

Designed specifically for the French biomedical field, it performs excellently in biomedical named entity recognition tasks and shows significant performance improvement compared to the base model.

Trained with rich corpus

Trained using a large - scale French biomedical corpus containing scientific literature, drug labels, and clinical cases, with a wide coverage of data.

Efficient training

Using the continuous pre - training method, it has lower computational cost and higher efficiency compared to training from scratch.

Model Capabilities

French biomedical text understanding

Biomedical named entity recognition

Clinical document information extraction

Use Cases

Clinical research

Medical report information extraction

Extract information from unstructured documents in the hospital's clinical data warehouse to support clinical research

The F1 score increased by 2.54 points on the clinical dataset

Drug information processing

Drug label analysis

Extract key information from drug labels

The F1 score reached 76.71 on the EMEA dataset

Scientific literature processing

Biomedical literature analysis

Process and analyze French biomedical scientific literature

The F1 score reached 68.47 on the MEDLINE dataset

🚀 CamemBERT-bio : a Tasty French Language Model Better for your Health

CamemBERT-bio is a state-of-the-art French biomedical language model. It's built via continual-pretraining from camembert-base. Trained on a 413M-word French public biomedical corpus with scientific documents, drug leaflets, and clinical cases, it outperforms camembert-base by an average of 2.54 points in F1 score on 5 different biomedical named entity recognition tasks.

✨ Features

Developed by Rian Touchent and Eric Villemonte de La Clergerie.
Logo designed by Alix Chagué.
Licensed under the MIT license.

📚 Documentation

Abstract

Clinical data in hospitals are increasingly accessible for research through clinical data warehouses. However, these documents are unstructured, so it's necessary to extract information from medical reports for clinical studies. Transfer learning with BERT-like models like CamemBERT has made significant progress, especially in named entity recognition. But these models are trained for plain language and are less effective on biomedical data. That's why we propose a new French public biomedical dataset and continue the pre-training of CamemBERT. Thus, we introduce the first version of CamemBERT-bio, a specialized public model for the French biomedical domain, which shows an average improvement of 2.54 points in F1 score on different biomedical named entity recognition tasks.

🔧 Technical Details

Training Details

Training Data

Property	Details
ISTEX	Diverse scientific literature indexed on ISTEX
CLEAR	Drug leaflets
E3C	Various documents from journals, drug leaflets, and clinical cases
Total	413M

Training Procedure

We used continual-pretraining from camembert-base. The model was trained using the Masked Language Modeling (MLM) objective with Whole Word Masking for 50k steps over 39 hours on 2 Tesla V100s.

Evaluation

Fine-tuning

For fine-tuning, we used Optuna to select hyperparameters. The learning rate was set to 5e-5, with a warmup ratio of 0.224 and a batch size of 16. The fine-tuning process lasted for 2000 steps. A simple linear layer was added on top of the model for prediction, and none of the CamemBERT layers were frozen during fine-tuning.

Scoring

We used the seqeval tool in strict mode with the IOB2 scheme to evaluate the model's performance. For each evaluation, the best fine-tuned model on the validation set was selected to calculate the final score on the test set. To ensure reliability, we averaged the results over 10 evaluations with different seeds.

Results

Style	Dataset	Score	CamemBERT	CamemBERT-bio
Clinical	CAS1	F1	70.50 ± 1.75	73.03 ± 1.29
		P	70.12 ± 1.93	71.71 ± 1.61
		R	70.89 ± 1.78	74.42 ± 1.49
	CAS2	F1	79.02 ± 0.92	81.66 ± 0.59
		P	77.3 ± 1.36	80.96 ± 0.91
		R	80.83 ± 0.96	82.37 ± 0.69
	E3C	F1	67.63 ± 1.45	69.85 ± 1.58
		P	78.19 ± 0.72	79.11 ± 0.42
		R	59.61 ± 2.25	62.56 ± 2.50
Drug leaflets	EMEA	F1	74.14 ± 1.95	76.71 ± 1.50
		P	74.62 ± 1.97	76.92 ± 1.96
		R	73.68 ± 2.22	76.52 ± 1.62
Scientific	MEDLINE	F1	65.73 ± 0.40	68.47 ± 0.54
		P	64.94 ± 0.82	67.77 ± 0.88
		R	66.56 ± 0.56	69.21 ± 1.32

Environmental Impact estimation

Hardware Type: 2 x Tesla V100
Hours used: 39 hours
Provider: INRIA clusters
Compute Region: Paris, France
Carbon Emitted: 0.84 kg CO2 eq.

📄 License

This project is licensed under the MIT license.

📖 Citation information

@inproceedings{touchent-de-la-clergerie-2024-camembert-bio,
    title = "{C}amem{BERT}-bio: Leveraging Continual Pre-training for Cost-Effective Models on {F}rench Biomedical Data",
    author = "Touchent, Rian  and
      de la Clergerie, {\'E}ric",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.241",
    pages = "2692--2701",
    abstract = "Clinical data in hospitals are increasingly accessible for research through clinical data warehouses. However these documents are unstructured and it is therefore necessary to extract information from medical reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT has allowed major advances for French, especially for named entity recognition. However, these models are trained for plain language and are less efficient on biomedical data. Addressing this gap, we introduce CamemBERT-bio, a dedicated French biomedical model derived from a new public French biomedical dataset. Through continual pre-training of the original CamemBERT, CamemBERT-bio achieves an improvement of 2.54 points of F1-score on average across various biomedical named entity recognition tasks, reinforcing the potential of continual pre-training as an equally proficient yet less computationally intensive alternative to training from scratch. Additionally, we highlight the importance of using a standard evaluation protocol that provides a clear view of the current state-of-the-art for French biomedical models.",
}

@inproceedings{touchent:hal-04130187,
  TITLE = {{CamemBERT-bio : Un mod{\`e}le de langue fran{\c c}ais savoureux et meilleur pour la sant{\'e}}},
  AUTHOR = {Touchent, Rian and Romary, Laurent and De La Clergerie, Eric},
  URL = {https://hal.science/hal-04130187},
  BOOKTITLE = {{18e  Conf{\'e}rence en Recherche d'Information et Applications \\ 16e Rencontres Jeunes Chercheurs en RI \\ 30e Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles \\ 25e Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues}},
  ADDRESS = {Paris, France},
  EDITOR = {Servan, Christophe and Vilnat, Anne},
  PUBLISHER = {{ATALA}},
  PAGES = {323-334},
  YEAR = {2023},
  KEYWORDS = {comptes rendus m{\'e}dicaux ; TAL clinique ; CamemBERT ; extraction d'information ; biom{\'e}dical ; reconnaissance d'entit{\'e}s nomm{\'e}es},
  HAL_ID = {hal-04130187},
  HAL_VERSION = {v1},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご