🚀 FrALBERT Base Cased
A pre - trained model on the French language using a masked language modeling (MLM) objective. It can learn a bidirectional representation of sentences and be used for downstream tasks.
🚀 Quick Start
FrALBERT Base Cased is a pre - trained model on French Wikipedia. It uses masked language modeling and sentence ordering prediction for pre - training. You can use it for masked language modeling, next sentence prediction, or fine - tune it for downstream tasks.
✨ Features
- Bidirectional Representation: Through masked language modeling, it can learn a bidirectional understanding of sentences.
- Shared Layers: It shares layers across its Transformer, resulting in a small memory footprint.
- SOP Objective: Uses Sentence Ordering Prediction to enhance the understanding of text ordering.
📦 Installation
This README does not provide specific installation steps. You can refer to the official Hugging Face documentation for the installation of related libraries such as transformers
.
💻 Usage Examples
Basic Usage
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill - mask', model='cservan/french - albert - base - cased')
>>> unmasker("Paris est la capitale de la [MASK] .")
[
{
"sequence": "paris est la capitale de la france.",
"score": 0.6231236457824707,
"token": 3043,
"token_str": "france"
},
{
"sequence": "paris est la capitale de la region.",
"score": 0.2993471622467041,
"token": 10531,
"token_str": "region"
},
{
"sequence": "paris est la capitale de la societe.",
"score": 0.02028230018913746,
"token": 24622,
"token_str": "societe"
},
{
"sequence": "paris est la capitale de la bretagne.",
"score": 0.012089950032532215,
"token": 24987,
"token_str": "bretagne"
},
{
"sequence": "paris est la capitale de la chine.",
"score": 0.010002839379012585,
"token": 14860,
"token_str": "chine"
}
]
Advanced Usage
Get features in PyTorch
from transformers import AlbertTokenizer, AlbertModel
tokenizer = AlbertTokenizer.from_pretrained('cservan/french - albert - base - cased')
model = AlbertModel.from_pretrained("cservan/french - albert - base - cased")
text = "Remplacez - moi par le texte en français que vous souhaitez."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Get features in TensorFlow
from transformers import AlbertTokenizer, TFAlbertModel
tokenizer = AlbertTokenizer.from_pretrained('cservan/french - albert - base - cased')
model = TFAlbertModel.from_pretrained("cservan/french - albert - base - cased")
text = "Remplacez - moi par le texte en français que vous souhaitez."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
📚 Documentation
Model description
FrALBERT is a transformers model pretrained on 16Go of French Wikipedia in a self - supervised fashion. It was pretrained with two objectives: Masked language modeling (MLM) and Sentence Ordering Prediction (SOP). This way, it learns an inner representation of the French language that can be used for downstream tasks.
This model is particular in that it shares its layers across its Transformer, resulting in a small memory footprint. It's the second version of the base model with the following configuration:
- 12 repeating layers
- 128 embedding dimension
- 768 hidden dimension
- 12 attention heads
- 11M parameters
Intended uses & limitations
You can use the raw model for masked language modeling or next sentence prediction, but it's mostly intended to be fine - tuned on downstream tasks. It's mainly suitable for tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For text generation tasks, you should look at models like GPT2.
Training data
The FrALBERT model was pretrained on 4go of French Wikipedia (excluding lists, tables and headers).
Training procedure
Preprocessing
The texts are lowercased and tokenized using SentencePiece with a vocabulary size of 32,000. The inputs of the model are in the form of [CLS] Sentence A [SEP] Sentence B [SEP]
.
Training
The FrALBERT procedure follows the BERT setup. For each sentence, 15% of the tokens are masked. In 80% of the cases, the masked tokens are replaced by [MASK]
; in 10% of the cases, they are replaced by a random token; and in the remaining 10% of the cases, they are left as is.
Evaluation results
When fine - tuned on downstream tasks, the ALBERT models achieve the following results in slot - filling:
Property |
Details |
MEDIA (FrALBERT - base) |
81.76 (0.59) |
MEDIA (FrALBERT - base - cased) |
85.09 (0.14) |
BibTeX entry and citation info
@inproceedings{cattan2021fralbert,
author = {Oralie Cattan and
Christophe Servan and
Sophie Rosset},
booktitle = {Recent Advances in Natural Language Processing, RANLP 2021},
title = {{On the Usability of Transformers - based models for a French Question - Answering task}},
year = {2021},
address = {Online},
month = sep,
}
Link to the paper: [PDF](https://hal.archives - ouvertes.fr/hal - 03336060)
📄 License
This model is licensed under the Apache - 2.0 license.