french-albert-base-cased Open-source Model - Supports French case recognition and is suitable for various NLP tasks

Home

French Albert Base Cased

Developed by cservan

ALBERT base model pre-trained on French Wikipedia, supports case sensitivity, suitable for French NLP tasks.

Large Language Model

Transformers

FrenchOpen Source License:Apache-2.0 #French NLP #Weight-sharing architecture #Wikipedia pre-training

Downloads 38

Release Time : 3/2/2022

Model Overview

Transformer model pre-trained on French text through self-supervised learning, supports masked language modeling and sentence order prediction tasks, suitable for downstream task fine-tuning.

Model Features

Case Sensitivity

Distinguishes from other ALBERT models by accurately recognizing case differences (e.g., 'french' vs. 'French').

Parameter Efficiency

Utilizes ALBERT's weight-sharing mechanism, significantly reducing the number of parameters (only 11 million) while maintaining performance.

Bidirectional Context Understanding

Learns bidirectional text representations through masked language modeling objectives.

Model Capabilities

French text feature extraction

Masked word prediction

Sentence order judgment

Downstream task fine-tuning

Use Cases

Semantic Understanding

Slot Filling

Identifies key information segments in user utterances for dialogue systems.

Achieved 85.09 accuracy on the MEDIA dataset.

Text Classification

Sentiment Analysis

Classifies French reviews into positive/negative sentiments.

🚀 FrALBERT Base Cased

A pre - trained model on the French language using a masked language modeling (MLM) objective. It can learn a bidirectional representation of sentences and be used for downstream tasks.

🚀 Quick Start

FrALBERT Base Cased is a pre - trained model on French Wikipedia. It uses masked language modeling and sentence ordering prediction for pre - training. You can use it for masked language modeling, next sentence prediction, or fine - tune it for downstream tasks.

✨ Features

Bidirectional Representation: Through masked language modeling, it can learn a bidirectional understanding of sentences.
Shared Layers: It shares layers across its Transformer, resulting in a small memory footprint.
SOP Objective: Uses Sentence Ordering Prediction to enhance the understanding of text ordering.

📦 Installation

This README does not provide specific installation steps. You can refer to the official Hugging Face documentation for the installation of related libraries such as transformers.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill - mask', model='cservan/french - albert - base - cased')
>>> unmasker("Paris est la capitale de la  [MASK] .")
[
  {
    "sequence": "paris est la capitale de la france.",
    "score": 0.6231236457824707,
    "token": 3043,
    "token_str": "france"
  },
  {
    "sequence": "paris est la capitale de la region.",
    "score": 0.2993471622467041,
    "token": 10531,
    "token_str": "region"
  },
  {
    "sequence": "paris est la capitale de la societe.",
    "score": 0.02028230018913746,
    "token": 24622,
    "token_str": "societe"
  },
  {
    "sequence": "paris est la capitale de la bretagne.",
    "score": 0.012089950032532215,
    "token": 24987,
    "token_str": "bretagne"
  },
  {
    "sequence": "paris est la capitale de la chine.",
    "score": 0.010002839379012585,
    "token": 14860,
    "token_str": "chine"
  }
]

Advanced Usage

Get features in PyTorch

from transformers import AlbertTokenizer, AlbertModel
tokenizer = AlbertTokenizer.from_pretrained('cservan/french - albert - base - cased')
model = AlbertModel.from_pretrained("cservan/french - albert - base - cased")
text = "Remplacez - moi par le texte en français que vous souhaitez."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Get features in TensorFlow

from transformers import AlbertTokenizer, TFAlbertModel
tokenizer = AlbertTokenizer.from_pretrained('cservan/french - albert - base - cased')
model = TFAlbertModel.from_pretrained("cservan/french - albert - base - cased")
text = "Remplacez - moi par le texte en français que vous souhaitez."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 Documentation

Model description

FrALBERT is a transformers model pretrained on 16Go of French Wikipedia in a self - supervised fashion. It was pretrained with two objectives: Masked language modeling (MLM) and Sentence Ordering Prediction (SOP). This way, it learns an inner representation of the French language that can be used for downstream tasks.

This model is particular in that it shares its layers across its Transformer, resulting in a small memory footprint. It's the second version of the base model with the following configuration:

12 repeating layers
128 embedding dimension
768 hidden dimension
12 attention heads
11M parameters

Intended uses & limitations

You can use the raw model for masked language modeling or next sentence prediction, but it's mostly intended to be fine - tuned on downstream tasks. It's mainly suitable for tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For text generation tasks, you should look at models like GPT2.

Training data

The FrALBERT model was pretrained on 4go of French Wikipedia (excluding lists, tables and headers).

Training procedure

Preprocessing

The texts are lowercased and tokenized using SentencePiece with a vocabulary size of 32,000. The inputs of the model are in the form of [CLS] Sentence A [SEP] Sentence B [SEP].

Training

The FrALBERT procedure follows the BERT setup. For each sentence, 15% of the tokens are masked. In 80% of the cases, the masked tokens are replaced by [MASK]; in 10% of the cases, they are replaced by a random token; and in the remaining 10% of the cases, they are left as is.

Evaluation results

When fine - tuned on downstream tasks, the ALBERT models achieve the following results in slot - filling:

Property	Details
MEDIA (FrALBERT - base)	81.76 (0.59)
MEDIA (FrALBERT - base - cased)	85.09 (0.14)

BibTeX entry and citation info

@inproceedings{cattan2021fralbert,
  author    = {Oralie Cattan and
               Christophe Servan and
               Sophie Rosset},
  booktitle = {Recent Advances in Natural Language Processing, RANLP 2021},
  title     = {{On the Usability of Transformers - based models for a French Question - Answering task}},
  year      = {2021},
  address   = {Online},
  month     = sep,
}

Link to the paper: [PDF](https://hal.archives - ouvertes.fr/hal - 03336060)

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご