Model Overview
Model Features
Model Capabilities
Use Cases
🚀 ALBERT Large v2
A pre - trained English language model using masked language modeling (MLM) objective, offering strong feature extraction for downstream tasks.
🚀 Quick Start
This ALBERT Large v2 model is a pre - trained English language model. It can be used directly for masked language modeling or next - sentence prediction, but is mainly designed to be fine - tuned for downstream tasks. You can find fine - tuned versions on the model hub.
✨ Features
- Bidirectional Representation Learning: Through Masked Language Modeling (MLM), it can learn a bidirectional representation of sentences, different from traditional RNNs and autoregressive models.
- Sentence Ordering Prediction: ALBERT uses a pretraining loss based on predicting the ordering of two consecutive text segments.
- Layer Sharing: It shares layers across its Transformer, resulting in a small memory footprint.
- Improved Version 2: Version 2 has better performance in nearly all downstream tasks due to different dropout rates, additional training data, and longer training.
📦 Installation
The text does not provide specific installation steps, so this section is skipped.
💻 Usage Examples
Basic Usage
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill - mask', model='albert - large - v2')
>>> unmasker("Hello I'm a [MASK] model.")
[
{
"sequence":"[CLS] hello i'm a modeling model.[SEP]",
"score":0.05816134437918663,
"token":12807,
"token_str":"â–modeling"
},
{
"sequence":"[CLS] hello i'm a modelling model.[SEP]",
"score":0.03748830780386925,
"token":23089,
"token_str":"â–modelling"
},
{
"sequence":"[CLS] hello i'm a model model.[SEP]",
"score":0.033725276589393616,
"token":1061,
"token_str":"â–model"
},
{
"sequence":"[CLS] hello i'm a runway model.[SEP]",
"score":0.017313428223133087,
"token":8014,
"token_str":"â–runway"
},
{
"sequence":"[CLS] hello i'm a lingerie model.[SEP]",
"score":0.014405295252799988,
"token":29104,
"token_str":"â–lingerie"
}
]
Advanced Usage
Get Features in PyTorch
from transformers import AlbertTokenizer, AlbertModel
tokenizer = AlbertTokenizer.from_pretrained('albert - large - v2')
model = AlbertModel.from_pretrained("albert - large - v2")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Get Features in TensorFlow
from transformers import AlbertTokenizer, TFAlbertModel
tokenizer = AlbertTokenizer.from_pretrained('albert - large - v2')
model = TFAlbertModel.from_pretrained("albert - large - v2")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
📚 Documentation
Model Configuration
This model has the following configuration:
Property | Details |
---|---|
Repeating Layers | 24 |
Embedding Dimension | 128 |
Hidden Dimension | 1024 |
Attention Heads | 16 |
Parameters | 17M |
Intended Uses & Limitations
This model is primarily aimed at being fine - tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at models like GPT2.
Limitations and Bias
Even if the training data is fairly neutral, this model can have biased predictions. For example:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill - mask', model='albert - large - v2')
>>> unmasker("The man worked as a [MASK].")
[
{
"sequence":"[CLS] the man worked as a chauffeur.[SEP]",
"score":0.029577180743217468,
"token":28744,
"token_str":"â–chauffeur"
},
{
"sequence":"[CLS] the man worked as a janitor.[SEP]",
"score":0.028865724802017212,
"token":29477,
"token_str":"â–janitor"
},
{
"sequence":"[CLS] the man worked as a shoemaker.[SEP]",
"score":0.02581118606030941,
"token":29024,
"token_str":"â–shoemaker"
},
{
"sequence":"[CLS] the man worked as a blacksmith.[SEP]",
"score":0.01849772222340107,
"token":21238,
"token_str":"â–blacksmith"
},
{
"sequence":"[CLS] the man worked as a lawyer.[SEP]",
"score":0.01820771023631096,
"token":3672,
"token_str":"â–lawyer"
}
]
>>> unmasker("The woman worked as a [MASK].")
[
{
"sequence":"[CLS] the woman worked as a receptionist.[SEP]",
"score":0.04604868218302727,
"token":25331,
"token_str":"â–receptionist"
},
{
"sequence":"[CLS] the woman worked as a janitor.[SEP]",
"score":0.028220869600772858,
"token":29477,
"token_str":"â–janitor"
},
{
"sequence":"[CLS] the woman worked as a paramedic.[SEP]",
"score":0.0261906236410141,
"token":23386,
"token_str":"â–paramedic"
},
{
"sequence":"[CLS] the woman worked as a chauffeur.[SEP]",
"score":0.024797942489385605,
"token":28744,
"token_str":"â–chauffeur"
},
{
"sequence":"[CLS] the woman worked as a waitress.[SEP]",
"score":0.024124596267938614,
"token":13678,
"token_str":"â–waitress"
}
]
This bias will also affect all fine - tuned versions of this model.
Training Data
The ALBERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).
Training Procedure
Preprocessing
The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 30,000. The inputs of the model are then of the form:
[CLS] Sentence A [SEP] Sentence B [SEP]
Training
The ALBERT procedure follows the BERT setup. The details of the masking procedure for each sentence are the following:
- 15% of the tokens are masked.
- In 80% of the cases, the masked tokens are replaced by
[MASK]
. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
- In the 10% remaining cases, the masked tokens are left as is.
Evaluation Results
When fine - tuned on downstream tasks, the ALBERT models achieve the following results:
Average | SQuAD1.1 | SQuAD2.0 | MNLI | SST - 2 | RACE | |
---|---|---|---|---|---|---|
V2 | ||||||
ALBERT - base | 82.3 | 90.2/83.2 | 82.1/79.3 | 84.6 | 92.9 | 66.8 |
ALBERT - large | 85.7 | 91.8/85.2 | 84.9/81.8 | 86.5 | 94.9 | 75.2 |
ALBERT - xlarge | 87.9 | 92.9/86.4 | 87.9/84.1 | 87.9 | 95.4 | 80.7 |
ALBERT - xxlarge | 90.9 | 94.6/89.1 | 89.8/86.9 | 90.6 | 96.8 | 86.8 |
V1 | ||||||
ALBERT - base | 80.1 | 89.3/82.3 | 80.0/77.1 | 81.6 | 90.3 | 64.0 |
ALBERT - large | 82.4 | 90.6/83.9 | 82.3/79.4 | 83.5 | 91.7 | 68.5 |
ALBERT - xlarge | 85.5 | 92.5/86.1 | 86.1/83.1 | 86.4 | 92.4 | 74.8 |
ALBERT - xxlarge | 91.0 | 94.8/89.3 | 90.2/87.4 | 90.8 | 96.9 | 86.5 |
🔧 Technical Details
The text does not provide more in - depth technical details, so this section is skipped.
📄 License
The model is released under the Apache 2.0 license.

