Multilingual - Albert - Base - Cased - 128k Open - source Multilingual Model - Supports Text Processing in 60+ Languages

Multilingual Albert Base Cased 128k

Developed by cservan

A multilingual ALBERT model pretrained with masked language modeling (MLM) objective, supporting 60+ languages, featuring a lightweight architecture with parameter sharing

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual pretraining #Shared weight architecture #Wikipedia corpus

Downloads 277

Release Time : 12/20/2023

Model Overview

This is a case-sensitive multilingual ALBERT model, pretrained in a self-supervised manner on Wikipedia text, suitable for fine-tuning downstream tasks. The model employs a Transformer layer weight-sharing mechanism with smaller memory footprint.

Model Features

Multilingual support

Supports processing of over 60 languages, including major European and Asian languages

Parameter-shared architecture

Utilizes ALBERT's unique Transformer layer weight-sharing mechanism, significantly reducing model parameters

Case-sensitive

Unlike standard ALBERT, this model can distinguish between uppercase and lowercase word forms

Efficient pretraining

Combines both masked language modeling (MLM) and sentence order prediction (SOP) pretraining objectives

Model Capabilities

Multilingual text understanding

Sentence order prediction

Masked word prediction

Downstream task fine-tuning

Use Cases

Natural Language Processing

Slot filling tasks

Used for information extraction tasks in dialogue systems

Achieved 89.14 accuracy on MultiATIS++ dataset

Text classification

Used for multilingual text classification tasks

Achieved 96.84 accuracy on SNIPS dataset

Named entity recognition

Used for identifying named entities in text

Achieved 88.27 F1 score on CoNLL2003 dataset

🚀 mALBERT Base Cased 128k

A pretrained multilingual language model using a masked language modeling (MLM) objective. It can distinguish between different cases of words, like 'french' and 'French'.

Supported Languages

fr, en, de, es, ru, it, zh, sv, pt, pl, ar, nl, ca, vi, ja, hu, he, id, no, fa, ko, tr, fi, ro, el, hy, da, eu, ms, sl, az, bn, cy, hi, ta, ur, th, ka, te, af, sq, lv, ml, kn, tl, is, sw, jv, my, mn, km, am

License

The model is licensed under the Apache-2.0 license.

Datasets

The model is pretrained on the Wikipedia dataset.

🚀 Quick Start

This is a pretrained multilingual language model using a masked language modeling (MLM) objective. For more details about the model, refer to here. Unlike other ALBERT models, this model is cased, meaning it can distinguish between different cases of words, such as 'french' and 'French'.

✨ Features

Multilingual Support: Supports a wide range of languages including French, English, German, etc.
Bidirectional Representation: Learns a bidirectional representation of sentences through masked language modeling (MLM).
Sentence Ordering Prediction: Uses Sentence Ordering Prediction (SOP) as a pretraining loss.
Shared Layers: Shares layers across its Transformer, resulting in a small memory footprint.

📚 Documentation

Model description

mALBERT is a transformers model pretrained on 16GB of French Wikipedia in a self - supervised fashion. It was pretrained on raw texts without any human labeling, using an automatic process to generate inputs and labels from the texts. It was pretrained with two objectives:

Masked language modeling (MLM): The model randomly masks 15% of the words in a sentence, runs the masked sentence through the model, and predicts the masked words. This allows it to learn a bidirectional representation of the sentence, different from traditional RNNs or autoregressive models like GPT.
Sentence Ordering Prediction (SOP): mALBERT uses a pretraining loss based on predicting the ordering of two consecutive segments of text.

The model learns an inner representation of languages that can be used to extract features for downstream tasks. For example, if you have a dataset of labeled sentences, you can train a standard classifier using the features produced by the mALBERT model as inputs.

mALBERT is unique in that it shares its layers across its Transformer, so all layers have the same weights. Using repeating layers results in a small memory footprint, but the computational cost is similar to a BERT - like architecture with the same number of hidden layers.

This is the second version of the base model with the following configuration:

12 repeating layers
128 embedding dimension
768 hidden dimension
12 attention heads
11M parameters
128k of vocabulary size

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mainly intended to be fine - tuned on a downstream task. Check the [model hub](https://huggingface.co/models?filter=malbert - base - cased - 128k) for fine - tuned versions on tasks that interest you.

Note that this model is primarily aimed at tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification, or question answering. For tasks like text generation, you should consider models like GPT2.

💻 Usage Examples

Basic Usage

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import AlbertTokenizer, AlbertModel
tokenizer = AlbertTokenizer.from_pretrained('cservan/multilingual-albert-base-cased-128k')
model = AlbertModel.from_pretrained("cservan/multilingual-albert-base-cased-128k")
text = "Remplacez-moi par le texte en français que vous souhaitez."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

In TensorFlow:

from transformers import AlbertTokenizer, TFAlbertModel
tokenizer = AlbertTokenizer.from_pretrained('cservan/multilingual-albert-base-cased-128k')
model = TFAlbertModel.from_pretrained("cservan/multilingual-albert-base-cased-128k")
text = "Remplacez-moi par le texte en français que vous souhaitez."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

🔧 Technical Details

Training data

The mALBERT model was pretrained on 13GB of Multiligual Wikipedia (excluding lists, tables, and headers).

Training procedure

Preprocessing

The texts are lowercased and tokenized using SentencePiece with a vocabulary size of 128,000. The inputs of the model are in the form:

[CLS] Sentence A [SEP] Sentence B [SEP]

Training

The mALBERT procedure follows the BERT setup.

The details of the masking procedure for each sentence are as follows:

15% of the tokens are masked.
In 80% of the cases, the masked tokens are replaced by [MASK].
In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
In the 10% remaining cases, the masked tokens are left as is.

Tools

The tools used to pre - train the model are available [here](https://gitlab.lisn.upsaclay.fr/nlp/deep - learning/UER - py)

Evaluation results

When fine - tuned on downstream tasks, the ALBERT models achieve the following results:

Slot - filling

Models ⧹ Tasks	MMNLU	MultiATIS++	CoNLL2003	MultiCoNER	SNIPS	MEDIA
EnALBERT	N/A	N/A	89.67 (0.34)	42.36 (0.22)	95.95 (0.13)	N/A
FrALBERT	N/A	N/A	N/A	N/A	N/A	81.76 (0.59)
mALBERT - 128k	65.81 (0.11)	89.14 (0.15)	88.27 (0.24)	46.01 (0.18)	91.60 (0.31)	83.15 (0.38)
mALBERT - 64k	65.29 (0.14)	88.88 (0.14)	86.44 (0.37)	44.70 (0.27)	90.84 (0.47)	82.30 (0.19)
mALBERT - 32k	64.83 (0.22)	88.60 (0.27)	84.96 (0.41)	44.13 (0.39)	89.89 (0.68)	82.04 (0.28)

Classification task

Models ⧹ Tasks	MMNLU	MultiATIS++	SNIPS	SST2
mALBERT - 128k	72.35 (0.09)	90.58 (0.98)	96.84 (0.49)	34.66 (1.46)
mALBERT - 64k	71.26 (0.11)	90.97 (0.70)	96.53 (0.44)	34.64 (1.02)
mALBERT - 32k	70.76 (0.11)	90.55 (0.98)	96.49 (0.45)	34.18 (1.64)

BibTeX entry and citation info

@inproceedings{servan2024mALBERT,
  author    = {Christophe Servan and
               Sahar Ghannay and
               Sophie Rosset},
  booktitle = {the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC - COLING 2024)},
  title     = {{mALBERT: Is a Compact Multilingual BERT Model Still Worth It?}},
  year      = {2024},
  address   = {Torino, Italy},
  month     = may,
}

Link to the paper: [PDF](https://hal.science/hal - 04520797)

📄 License

The model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご