flaubert_small_cased Open-source French BERT Model - Adapt to Multiple Requirements and Boost French Language Processing

Flaubert Small Cased

Developed by flaubert

FlauBERT is a French BERT model pretrained on a large-scale French corpus, developed by the French National Centre for Scientific Research (CNRS), offering different versions to accommodate various needs.

Large Language Model

Transformers

FrenchOpen Source License:MIT #French BERT #Contextual Embeddings #FLUE Benchmark

Downloads 10.11k

Release Time : 3/2/2022

Model Overview

FlauBERT is an unsupervised language model for French, pretrained on a large and heterogeneous French corpus, supporting various French NLP tasks.

Model Features

Multi-version Support

Offers small, base, and large versions to meet different computational resource requirements

FLUE Evaluation Benchmark

Includes the FLUE benchmark for French NLP tasks, facilitating model performance comparison

Supercomputer Training

All models were trained on the Jean Zay supercomputer

Model Capabilities

French Text Understanding

Contextual Word Vector Generation

Fine-tuning for French NLP Tasks

Use Cases

Natural Language Processing

Text Classification

Can be used for sentiment analysis or topic classification of French texts

Named Entity Recognition

Identifies entities such as person names and locations in French texts

🚀 FlauBERT: Unsupervised Language Model Pre-training for French

FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Different-sized models are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer.

Along with FlauBERT comes FLUE, an evaluation setup for French NLP systems similar to the popular GLUE benchmark. The goal is to enable further reproducible experiments in the future and to share models and progress on the French language. For more details, please refer to the official website.

✨ Features

FlauBERT is a pre - trained French language model, trained on a large and diverse French corpus.
It comes with the FLUE evaluation setup, facilitating reproducible experiments in French NLP.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import FlaubertModel, FlaubertTokenizer

# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 
#               'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased' 

# Load pretrained model and tokenizer
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
# do_lowercase=False if using cased models, True if using uncased ones

sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])

last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
# torch.Size([1, 8, 768])  -> (batch size x number of tokens x embedding dimension)

# The BERT [CLS] token correspond to the first hidden state of the last layer
cls_embedding = last_layer[:, 0, :]

Advanced Usage

# Notes: if your `transformers` version is <=2.10.0, `modelname` should take one
# of the following values:

['flaubert-small-cased', 'flaubert-base-uncased', 'flaubert-base-cased', 'flaubert-large-cased']

📚 Documentation

FlauBERT models

Property	Details
Model Type	`flaubert-small-cased`, `flaubert-base-uncased`, `flaubert-base-cased`, `flaubert-large-cased`
Number of layers	6 (`flaubert-small-cased`), 12 (`flaubert-base-uncased`, `flaubert-base-cased`), 24 (`flaubert-large-cased`)
Attention Heads	8 (`flaubert-small-cased`), 12 (`flaubert-base-uncased`, `flaubert-base-cased`), 16 (`flaubert-large-cased`)
Embedding Dimension	512 (`flaubert-small-cased`), 768 (`flaubert-base-uncased`, `flaubert-base-cased`), 1024 (`flaubert-large-cased`)
Total Parameters	54 M (`flaubert-small-cased`), 137 M (`flaubert-base-uncased`), 138 M (`flaubert-base-cased`), 373 M (`flaubert-large-cased`)

⚠️ Important Note

flaubert-small-cased is partially trained, so performance is not guaranteed. Consider using it for debugging purposes only.

📄 License

This project is licensed under the MIT license.

📚 References

If you use FlauBERT or the FLUE Benchmark for your scientific publication, or if you find the resources in this repository useful, please cite one of the following papers:

LREC paper

@InProceedings{le2020flaubert,
  author    = {Le, Hang  and  Vial, Lo\"{i}c  and  Frej, Jibril  and  Segonne, Vincent  and  Coavoux, Maximin  and  Lecouteux, Benjamin  and  Allauzen, Alexandre  and  Crabb\'{e}, Beno\^{i}t  and  Besacier, Laurent  and  Schwab, Didier},
  title     = {FlauBERT: Unsupervised Language Model Pre-training for French},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
  month     = {May},
  year      = {2020},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {2479--2490},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.302}
}

TALN paper

@inproceedings{le2020flaubert,
  title         = {FlauBERT: des mod{\`e}les de langue contextualis{\'e}s pr{\'e}-entra{\^\i}n{\'e}s pour le fran{\c{c}}ais},
  author        = {Le, Hang and Vial, Lo{\"\i}c and Frej, Jibril and Segonne, Vincent and Coavoux, Maximin and Lecouteux, Benjamin and Allauzen, Alexandre and Crabb{\'e}, Beno{\^\i}t and Besacier, Laurent and Schwab, Didier},
  booktitle     = {Actes de la 6e conf{\'e}rence conjointe Journ{\'e}es d'{\'E}tudes sur la Parole (JEP, 31e {\'e}dition), Traitement Automatique des Langues Naturelles (TALN, 27e {\'e}dition), Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (R{\'E}CITAL, 22e {\'e}dition). Volume 2: Traitement Automatique des Langues Naturelles},
  pages         = {268--278},
  year          = {2020},
  organization  = {ATALA}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご