gpt-fr-cased-small Open-Source French GPT Model - A Practical Language Tool Based on a Rich Corpus

Gpt Fr Cased Small

Developed by asi

GPT-fr is a French GPT model developed by Quantmetry and Laboratoire de Linguistique Formelle (LLF), trained on a large and diverse corpus of French texts.

Large Language Model FrenchOpen Source License:Apache-2.0 #French Text Generation #Multi-task Classification #Low Perplexity

Downloads 4,314

Release Time : 3/2/2022

Model Overview

GPT-fr is a French generative pre-trained Transformer model that can be used for various natural language processing tasks such as text generation, classification, and summarization.

Model Features

French Language Optimization

Specially trained and optimized for the characteristics of the French language.

Diverse Task Support

Supports various natural language processing tasks such as text generation, classification, and summarization.

Pre-trained Model

Pre-trained on a large-scale French corpus and can be directly used for downstream tasks.

Model Capabilities

French Text Generation

Text Classification

Summarization

Question Answering

Language Understanding

Use Cases

Content Generation

Article Continuation

Continue writing an article based on a given beginning.

Can generate coherent French text.

Text Classification

Product Review Classification

Classify reviews for books, DVDs, and music.

Accuracy of 86.9%-89.3%.

Summarization

News Summarization

Generate concise summaries of French news articles.

ROUGE-1 score of 17.5.

🚀 GPT-fr: A French GPT Model

GPT-fr is a GPT model tailored for the French language, developed by Quantmetry and the Laboratoire de Linguistique Formelle (LLF). This model is trained on an extensive and diverse French corpus. The weights are released for the following configurations:

✨ Features

Model Configurations

Model name	Number of layers	Attention Heads	Embedding Dimension	Total Parameters
`gpt-fr-cased-small`	12	12	768	124 M
`gpt-fr-cased-base`	24	14	1,792	1,017 B

Intended Uses

The model can be applied to language generation tasks. Moreover, many tasks can be formatted to directly generate outputs in natural language, such as automatic summarization or question answering. It is suitable for both academic and industrial applications.

📦 Installation

This model can be used through the Transformers library. You can install the necessary libraries using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

The model can be used via the Transformers library as follows:

from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pretrained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("asi/gpt-fr-cased-small")
tokenizer = GPT2Tokenizer.from_pretrained("asi/gpt-fr-cased-small")

# Generate a sample of text
model.eval()
input_sentence = "Longtemps je me suis couché de bonne heure."
input_ids = tokenizer.encode(input_sentence, return_tensors='pt')

beam_outputs = model.generate(
    input_ids, 
    max_length=100, 
    do_sample=True,   
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=1
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_outputs[0], skip_special_tokens=True))

📚 Documentation

Model Index

Model Name	Task Type	Dataset	Metrics	Value
`asi/gpt-fr-cased-base`	Text Generation	Wikitext-fr	Perplexity	109.2
`asi/gpt-fr-cased-base`	Text Classification	CLS-Books	Accuracy	88.3
`asi/gpt-fr-cased-base`	Text Classification	CLS-Dvd	Accuracy	86.9
`asi/gpt-fr-cased-base`	Text Classification	CLS-Music	Accuracy	89.3
`asi/gpt-fr-cased-base`	Text Classification	PAWS-X	Accuracy	83.3
`asi/gpt-fr-cased-base`	Text Classification	XNLI	Accuracy	75.6
`asi/gpt-fr-cased-base`	Summarization	OrangeSum-Abstract	ROUGE-1	17.5
`asi/gpt-fr-cased-base`	Summarization	OrangeSum-Abstract	ROUGE-2	3.1
`asi/gpt-fr-cased-base`	Summarization	OrangeSum-Abstract	ROUGE-L	12.1
`asi/gpt-fr-cased-base`	Summarization	OrangeSum-Title	ROUGE-1	13.9
`asi/gpt-fr-cased-base`	Summarization	OrangeSum-Title	ROUGE-2	2.3
`asi/gpt-fr-cased-base`	Summarization	OrangeSum-Title	ROUGE-L	9.7

Limitations and Bias

Large language models often replicate biases present in pre-training datasets, such as gender discrimination or the generation of offensive content.

To reduce exposure to explicit material, we carefully select data sources in advance. This process, detailed in our paper, aims to limit the model's generation of offensive content without manual and arbitrary filtering.

However, some societal biases in the data may still be reflected in the model. For example, when generating the sentence sequence "Ma femme/Mon mari vient d'obtenir un nouveau poste. A partir de demain elle/il sera _______", the model generates different positions based on gender. We used a top-k random sampling strategy with k = 50 and stopped at the first punctuation mark. The position generated for the wife is 'femme de ménage de la maison', while that for the husband is 'à la tête de la police'. We welcome your feedback to better assess these effects qualitatively and quantitatively.

Training Data

We created a dedicated corpus for training this generative model. The model uses a fixed-length context size of 1,024 and requires long documents for training. We aggregated existing corpora, including Wikipedia, OpenSubtitle (Tiedemann, 2012), and Gutenberg. The corpora are filtered and split into sentences, and consecutive sentences are concatenated within the limit of 1,024 tokens per document.

Training Procedure

We pre-trained the model on a TPU v2 - 8 using the Google Colab inter - server.

Eval Results

We evaluated GPT-fr using a dedicated language model evaluation benchmark. Similar to the English WikiText benchmark, we collected over 70 million tokens from verified good and featured articles on French Wikipedia. The model achieves a zero - shot perplexity of 109.2 on the test set.

📄 License

This project is licensed under the apache-2.0 license.

BibTeX Entry and Citation Info

Along with the model hosted by the HuggingFace transformers library, we maintain a git repository. If you use GPT-fr in your scientific publications or industrial applications, please cite the following paper:

@inproceedings{simoulin:hal-03265900,
  TITLE = {{Un mod{\`e}le Transformer G{\'e}n{\'e}ratif Pr{\'e}-entrain{\'e} pour le \_\_\_\_\_\_ fran{\c c}ais}},
  AUTHOR = {Simoulin, Antoine and Crabb{\'e}, Benoit},
  URL = {https://hal.archives-ouvertes.fr/hal-03265900},
  BOOKTITLE = {{Traitement Automatique des Langues Naturelles}},
  ADDRESS = {Lille, France},
  EDITOR = {Denis, Pascal and Grabar, Natalia and Fraisse, Amel and Cardon, R{\'e}mi and Jacquemin, Bernard and Kergosien, Eric and Balvet, Antonio},
  PUBLISHER = {{ATALA}},
  PAGES = {246-255},
  YEAR = {2021},
  KEYWORDS = {fran{\c c}ais. ; GPT ; G{\'e}n{\'e}ratif ; Transformer ; Pr{\'e}-entra{\^i}n{\'e}},
  PDF = {https://hal.archives-ouvertes.fr/hal-03265900/file/7.pdf},
  HAL_ID = {hal-03265900},
  HAL_VERSION = {v1},
}

References

Jörg Tiedemann: Parallel Data, Tools and Interfaces in OPUS. LREC 2012: 2214 - 2218

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご