đ FlauBERT: Unsupervised Language Model Pre-training for French
FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Different-sized models are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer.
Along with FlauBERT comes FLUE, an evaluation setup for French NLP systems similar to the popular GLUE benchmark. The goal is to enable further reproducible experiments in the future and to share models and progress on the French language. For more details, please refer to the official website.
⨠Features
- FlauBERT is a pre - trained French language model, trained on a large and diverse French corpus.
- It comes with the FLUE evaluation setup, facilitating reproducible experiments in French NLP.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
import torch
from transformers import FlaubertModel, FlaubertTokenizer
modelname = 'flaubert/flaubert_base_cased'
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])
last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
cls_embedding = last_layer[:, 0, :]
Advanced Usage
['flaubert-small-cased', 'flaubert-base-uncased', 'flaubert-base-cased', 'flaubert-large-cased']
đ Documentation
FlauBERT models
Property |
Details |
Model Type |
flaubert-small-cased , flaubert-base-uncased , flaubert-base-cased , flaubert-large-cased |
Number of layers |
6 (flaubert-small-cased ), 12 (flaubert-base-uncased , flaubert-base-cased ), 24 (flaubert-large-cased ) |
Attention Heads |
8 (flaubert-small-cased ), 12 (flaubert-base-uncased , flaubert-base-cased ), 16 (flaubert-large-cased ) |
Embedding Dimension |
512 (flaubert-small-cased ), 768 (flaubert-base-uncased , flaubert-base-cased ), 1024 (flaubert-large-cased ) |
Total Parameters |
54 M (flaubert-small-cased ), 137 M (flaubert-base-uncased ), 138 M (flaubert-base-cased ), 373 M (flaubert-large-cased ) |
â ī¸ Important Note
flaubert-small-cased
is partially trained, so performance is not guaranteed. Consider using it for debugging purposes only.
đ License
This project is licensed under the MIT license.
đ References
If you use FlauBERT or the FLUE Benchmark for your scientific publication, or if you find the resources in this repository useful, please cite one of the following papers:
LREC paper
@InProceedings{le2020flaubert,
author = {Le, Hang and Vial, Lo\"{i}c and Frej, Jibril and Segonne, Vincent and Coavoux, Maximin and Lecouteux, Benjamin and Allauzen, Alexandre and Crabb\'{e}, Beno\^{i}t and Besacier, Laurent and Schwab, Didier},
title = {FlauBERT: Unsupervised Language Model Pre-training for French},
booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
month = {May},
year = {2020},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {2479--2490},
url = {https://www.aclweb.org/anthology/2020.lrec-1.302}
}
TALN paper
@inproceedings{le2020flaubert,
title = {FlauBERT: des mod{\`e}les de langue contextualis{\'e}s pr{\'e}-entra{\^\i}n{\'e}s pour le fran{\c{c}}ais},
author = {Le, Hang and Vial, Lo{\"\i}c and Frej, Jibril and Segonne, Vincent and Coavoux, Maximin and Lecouteux, Benjamin and Allauzen, Alexandre and Crabb{\'e}, Beno{\^\i}t and Besacier, Laurent and Schwab, Didier},
booktitle = {Actes de la 6e conf{\'e}rence conjointe Journ{\'e}es d'{\'E}tudes sur la Parole (JEP, 31e {\'e}dition), Traitement Automatique des Langues Naturelles (TALN, 27e {\'e}dition), Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (R{\'E}CITAL, 22e {\'e}dition). Volume 2: Traitement Automatique des Langues Naturelles},
pages = {268--278},
year = {2020},
organization = {ATALA}
}