🚀 mALBERT Base Cased 128k
A pretrained multilingual language model using a masked language modeling (MLM) objective. It can distinguish between different cases of words, like 'french' and 'French'.
Supported Languages
- fr, en, de, es, ru, it, zh, sv, pt, pl, ar, nl, ca, vi, ja, hu, he, id, no, fa, ko, tr, fi, ro, el, hy, da, eu, ms, sl, az, bn, cy, hi, ta, ur, th, ka, te, af, sq, lv, ml, kn, tl, is, sw, jv, my, mn, km, am
License
The model is licensed under the Apache-2.0 license.
Datasets
The model is pretrained on the Wikipedia dataset.
🚀 Quick Start
This is a pretrained multilingual language model using a masked language modeling (MLM) objective. For more details about the model, refer to here. Unlike other ALBERT models, this model is cased, meaning it can distinguish between different cases of words, such as 'french' and 'French'.
✨ Features
- Multilingual Support: Supports a wide range of languages including French, English, German, etc.
- Bidirectional Representation: Learns a bidirectional representation of sentences through masked language modeling (MLM).
- Sentence Ordering Prediction: Uses Sentence Ordering Prediction (SOP) as a pretraining loss.
- Shared Layers: Shares layers across its Transformer, resulting in a small memory footprint.
📚 Documentation
Model description
mALBERT is a transformers model pretrained on 16GB of French Wikipedia in a self - supervised fashion. It was pretrained on raw texts without any human labeling, using an automatic process to generate inputs and labels from the texts. It was pretrained with two objectives:
- Masked language modeling (MLM): The model randomly masks 15% of the words in a sentence, runs the masked sentence through the model, and predicts the masked words. This allows it to learn a bidirectional representation of the sentence, different from traditional RNNs or autoregressive models like GPT.
- Sentence Ordering Prediction (SOP): mALBERT uses a pretraining loss based on predicting the ordering of two consecutive segments of text.
The model learns an inner representation of languages that can be used to extract features for downstream tasks. For example, if you have a dataset of labeled sentences, you can train a standard classifier using the features produced by the mALBERT model as inputs.
mALBERT is unique in that it shares its layers across its Transformer, so all layers have the same weights. Using repeating layers results in a small memory footprint, but the computational cost is similar to a BERT - like architecture with the same number of hidden layers.
This is the second version of the base model with the following configuration:
- 12 repeating layers
- 128 embedding dimension
- 768 hidden dimension
- 12 attention heads
- 11M parameters
- 128k of vocabulary size
Intended uses & limitations
You can use the raw model for either masked language modeling or next sentence prediction, but it's mainly intended to be fine - tuned on a downstream task. Check the [model hub](https://huggingface.co/models?filter=malbert - base - cased - 128k) for fine - tuned versions on tasks that interest you.
Note that this model is primarily aimed at tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification, or question answering. For tasks like text generation, you should consider models like GPT2.
💻 Usage Examples
Basic Usage
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import AlbertTokenizer, AlbertModel
tokenizer = AlbertTokenizer.from_pretrained('cservan/multilingual-albert-base-cased-128k')
model = AlbertModel.from_pretrained("cservan/multilingual-albert-base-cased-128k")
text = "Remplacez-moi par le texte en français que vous souhaitez."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
In TensorFlow:
from transformers import AlbertTokenizer, TFAlbertModel
tokenizer = AlbertTokenizer.from_pretrained('cservan/multilingual-albert-base-cased-128k')
model = TFAlbertModel.from_pretrained("cservan/multilingual-albert-base-cased-128k")
text = "Remplacez-moi par le texte en français que vous souhaitez."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
🔧 Technical Details
Training data
The mALBERT model was pretrained on 13GB of Multiligual Wikipedia (excluding lists, tables, and headers).
Training procedure
Preprocessing
The texts are lowercased and tokenized using SentencePiece with a vocabulary size of 128,000. The inputs of the model are in the form:
[CLS] Sentence A [SEP] Sentence B [SEP]
Training
The mALBERT procedure follows the BERT setup.
The details of the masking procedure for each sentence are as follows:
- 15% of the tokens are masked.
- In 80% of the cases, the masked tokens are replaced by
[MASK]
.
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
- In the 10% remaining cases, the masked tokens are left as is.
Tools
The tools used to pre - train the model are available [here](https://gitlab.lisn.upsaclay.fr/nlp/deep - learning/UER - py)
Evaluation results
When fine - tuned on downstream tasks, the ALBERT models achieve the following results:
Slot - filling
Models ⧹ Tasks |
MMNLU |
MultiATIS++ |
CoNLL2003 |
MultiCoNER |
SNIPS |
MEDIA |
EnALBERT |
N/A |
N/A |
89.67 (0.34) |
42.36 (0.22) |
95.95 (0.13) |
N/A |
FrALBERT |
N/A |
N/A |
N/A |
N/A |
N/A |
81.76 (0.59) |
mALBERT - 128k |
65.81 (0.11) |
89.14 (0.15) |
88.27 (0.24) |
46.01 (0.18) |
91.60 (0.31) |
83.15 (0.38) |
mALBERT - 64k |
65.29 (0.14) |
88.88 (0.14) |
86.44 (0.37) |
44.70 (0.27) |
90.84 (0.47) |
82.30 (0.19) |
mALBERT - 32k |
64.83 (0.22) |
88.60 (0.27) |
84.96 (0.41) |
44.13 (0.39) |
89.89 (0.68) |
82.04 (0.28) |
Classification task
Models ⧹ Tasks |
MMNLU |
MultiATIS++ |
SNIPS |
SST2 |
mALBERT - 128k |
72.35 (0.09) |
90.58 (0.98) |
96.84 (0.49) |
34.66 (1.46) |
mALBERT - 64k |
71.26 (0.11) |
90.97 (0.70) |
96.53 (0.44) |
34.64 (1.02) |
mALBERT - 32k |
70.76 (0.11) |
90.55 (0.98) |
96.49 (0.45) |
34.18 (1.64) |
BibTeX entry and citation info
@inproceedings{servan2024mALBERT,
author = {Christophe Servan and
Sahar Ghannay and
Sophie Rosset},
booktitle = {the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC - COLING 2024)},
title = {{mALBERT: Is a Compact Multilingual BERT Model Still Worth It?}},
year = {2024},
address = {Torino, Italy},
month = may,
}
Link to the paper: [PDF](https://hal.science/hal - 04520797)
📄 License
The model is licensed under the Apache - 2.0 license.