🚀 GPT-fr: A French GPT Model
GPT-fr is a GPT model tailored for the French language, developed by Quantmetry and the Laboratoire de Linguistique Formelle (LLF). This model is trained on an extensive and diverse French corpus. The weights are released for the following configurations:
✨ Features
Model Configurations
Model name |
Number of layers |
Attention Heads |
Embedding Dimension |
Total Parameters |
gpt-fr-cased-small |
12 |
12 |
768 |
124 M |
gpt-fr-cased-base |
24 |
14 |
1,792 |
1,017 B |
Intended Uses
The model can be applied to language generation tasks. Moreover, many tasks can be formatted to directly generate outputs in natural language, such as automatic summarization or question answering. It is suitable for both academic and industrial applications.
📦 Installation
This model can be used through the Transformers
library. You can install the necessary libraries using the following command:
pip install transformers
💻 Usage Examples
Basic Usage
The model can be used via the Transformers
library as follows:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("asi/gpt-fr-cased-small")
tokenizer = GPT2Tokenizer.from_pretrained("asi/gpt-fr-cased-small")
model.eval()
input_sentence = "Longtemps je me suis couché de bonne heure."
input_ids = tokenizer.encode(input_sentence, return_tensors='pt')
beam_outputs = model.generate(
input_ids,
max_length=100,
do_sample=True,
top_k=50,
top_p=0.95,
num_return_sequences=1
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_outputs[0], skip_special_tokens=True))
📚 Documentation
Model Index
Model Name |
Task Type |
Dataset |
Metrics |
Value |
asi/gpt-fr-cased-base |
Text Generation |
Wikitext-fr |
Perplexity |
109.2 |
asi/gpt-fr-cased-base |
Text Classification |
CLS-Books |
Accuracy |
88.3 |
asi/gpt-fr-cased-base |
Text Classification |
CLS-Dvd |
Accuracy |
86.9 |
asi/gpt-fr-cased-base |
Text Classification |
CLS-Music |
Accuracy |
89.3 |
asi/gpt-fr-cased-base |
Text Classification |
PAWS-X |
Accuracy |
83.3 |
asi/gpt-fr-cased-base |
Text Classification |
XNLI |
Accuracy |
75.6 |
asi/gpt-fr-cased-base |
Summarization |
OrangeSum-Abstract |
ROUGE-1 |
17.5 |
asi/gpt-fr-cased-base |
Summarization |
OrangeSum-Abstract |
ROUGE-2 |
3.1 |
asi/gpt-fr-cased-base |
Summarization |
OrangeSum-Abstract |
ROUGE-L |
12.1 |
asi/gpt-fr-cased-base |
Summarization |
OrangeSum-Title |
ROUGE-1 |
13.9 |
asi/gpt-fr-cased-base |
Summarization |
OrangeSum-Title |
ROUGE-2 |
2.3 |
asi/gpt-fr-cased-base |
Summarization |
OrangeSum-Title |
ROUGE-L |
9.7 |
Limitations and Bias
Large language models often replicate biases present in pre-training datasets, such as gender discrimination or the generation of offensive content.
To reduce exposure to explicit material, we carefully select data sources in advance. This process, detailed in our paper, aims to limit the model's generation of offensive content without manual and arbitrary filtering.
However, some societal biases in the data may still be reflected in the model. For example, when generating the sentence sequence "Ma femme/Mon mari vient d'obtenir un nouveau poste. A partir de demain elle/il sera _______", the model generates different positions based on gender. We used a top-k random sampling strategy with k = 50 and stopped at the first punctuation mark. The position generated for the wife is 'femme de ménage de la maison', while that for the husband is 'à la tête de la police'. We welcome your feedback to better assess these effects qualitatively and quantitatively.
Training Data
We created a dedicated corpus for training this generative model. The model uses a fixed-length context size of 1,024 and requires long documents for training. We aggregated existing corpora, including Wikipedia, OpenSubtitle (Tiedemann, 2012), and Gutenberg. The corpora are filtered and split into sentences, and consecutive sentences are concatenated within the limit of 1,024 tokens per document.
Training Procedure
We pre-trained the model on a TPU v2 - 8 using the Google Colab inter - server.
Eval Results
We evaluated GPT-fr using a dedicated language model evaluation benchmark. Similar to the English WikiText benchmark, we collected over 70 million tokens from verified good and featured articles on French Wikipedia. The model achieves a zero - shot perplexity of 109.2 on the test set.
📄 License
This project is licensed under the apache-2.0
license.
BibTeX Entry and Citation Info
Along with the model hosted by the HuggingFace transformers library, we maintain a git repository. If you use GPT-fr in your scientific publications or industrial applications, please cite the following paper:
@inproceedings{simoulin:hal-03265900,
TITLE = {{Un mod{\`e}le Transformer G{\'e}n{\'e}ratif Pr{\'e}-entrain{\'e} pour le \_\_\_\_\_\_ fran{\c c}ais}},
AUTHOR = {Simoulin, Antoine and Crabb{\'e}, Benoit},
URL = {https://hal.archives-ouvertes.fr/hal-03265900},
BOOKTITLE = {{Traitement Automatique des Langues Naturelles}},
ADDRESS = {Lille, France},
EDITOR = {Denis, Pascal and Grabar, Natalia and Fraisse, Amel and Cardon, R{\'e}mi and Jacquemin, Bernard and Kergosien, Eric and Balvet, Antonio},
PUBLISHER = {{ATALA}},
PAGES = {246-255},
YEAR = {2021},
KEYWORDS = {fran{\c c}ais. ; GPT ; G{\'e}n{\'e}ratif ; Transformer ; Pr{\'e}-entra{\^i}n{\'e}},
PDF = {https://hal.archives-ouvertes.fr/hal-03265900/file/7.pdf},
HAL_ID = {hal-03265900},
HAL_VERSION = {v1},
}
References
Jörg Tiedemann: Parallel Data, Tools and Interfaces in OPUS. LREC 2012: 2214 - 2218