Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Albertina 1.5B PTBR
Albertina 1.5B PTBR is a large foundation language model tailored for the American variant of Portuguese. It belongs to the BERT family of encoders, leveraging the Transformer neural architecture and built upon the DeBERTa model. This model offers outstanding performance in the Portuguese language and comes in different versions trained for various Portuguese variants, including the European (PTPT) and American (PTBR) variants. It is freely distributed under an open - source license.
✨ Features
- Variant - Specific Training: Available in versions trained for both the European (PTPT) and American (PTBR) variants of Portuguese.
- State - of - the - Art Performance: With 1.5 billion parameters, it sets a new benchmark for the American Portuguese variant at the time of its initial release.
- Open - Source Distribution: Freely available for reuse under a permissive license.
📦 Installation
No installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='PORTULAN/albertina-1b5-portuguese-ptbr-encoder')
>>> unmasker("A culinária portuguesa é rica em sabores e [MASK], tornando-se um dos maiores tesouros do país.")
[{'score': 0.8332648277282715, 'token': 14690, 'token_str': ' costumes', 'sequence': 'A culinária portuguesa é rica em sabores e costumes, tornando-se um dos maiores tesouros do país.'},
{'score': 0.07860890030860901, 'token': 29829, 'token_str': ' cores', 'sequence': 'A culinária portuguesa é rica em sabores e cores, tornando-se um dos maiores tesouros do país.'},
{'score': 0.03278181701898575, 'token': 35277, 'token_str': ' arte', 'sequence': 'A culinária portuguesa é rica em sabores e arte, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009515956044197083, 'token': 9240, 'token_str': ' cor', 'sequence': 'A culinária portuguesa é rica em sabores e cor, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009381960146129131, 'token': 33455, 'token_str': ' nuances', 'sequence': 'A culinária portuguesa é rica em sabores e nuances, tornando-se um dos maiores tesouros do país.'}]
Advanced Usage
The model can be used by fine - tuning it for a specific task:
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
>>> from datasets import load_dataset
>>> model = AutoModelForSequenceClassification.from_pretrained("PORTULAN/albertina-1b5-portuguese-ptbr-encoder", num_labels=2)
>>> tokenizer = AutoTokenizer.from_pretrained("PORTULAN/albertina-1b5-portuguese-ptbr-encoder")
>>> dataset = load_dataset("PORTULAN/glue-ptbr", "rte")
>>> def tokenize_function(examples):
... return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)
>>> tokenized_datasets = dataset.map(tokenize_function, batched=True)
>>> training_args = TrainingArguments(output_dir="albertina-ptbr-rte", evaluation_strategy="epoch")
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_datasets["train"],
... eval_dataset=tokenized_datasets["validation"],
... )
>>> trainer.train()
📚 Documentation
Model Description
This model card is for Albertina 1.5B PTBR, which has 1.5 billion parameters, 48 layers, and a hidden size of 1536. It is distributed under an MIT license. The underlying DeBERTa model is also distributed under an MIT license.
Training Data
Albertina 1.5B PTBR was trained on a 36 - billion - token dataset. The data was collected from openly available American Portuguese corpora from the following sources:
- CulturaX: A multilingual corpus, freely available for research and AI development. It is created by combining and cleaning mC4 and OSCAR datasets. It is derived from the Common Crawl dataset, with additional filtering to retain only pages with permission to be crawled, perform deduplication, and remove boilerplate. Since it does not distinguish between Portuguese variants, extra filtering was done to keep only documents with the Portuguese Internet country - code top - level domain.
Preprocessing
The PTBR corpora were filtered using the [BLOOM pre - processing](https://github.com/bigscience-workshop/data - preparation) pipeline. The default stopword filtering was skipped to preserve the syntactic structure, and language identification filtering was also skipped as the corpus was pre - selected as Portuguese.
Training
The [DeBERTa V2 xxlarge](https://huggingface.co/microsoft/deberta - v2 - xxlarge) for English was used as the codebase. To train Albertina 1.5B PTBR, the dataset was tokenized with the original DeBERTa tokenizer:
- 128 - token sequence truncation and dynamic padding for 250k steps (equivalent to 48 hours of computation on a2 - megagpu - 16gb Google Cloud A2 node for 128 - token input sequences).
- 256 - token sequence truncation for 80k steps (Albertina 1.5B PTBR 256, equivalent to 24 hours of computation for 256 - token input sequences).
- 512 - token sequence truncation for 60k steps (equivalent to 24 hours of computation for 512 - token input sequences).
A learning rate of 1e - 5 with linear decay and 10k warm - up steps was used.
Performance
The extraGLUE, a PTBR version of the GLUE and SUPERGLUE benchmark, was used to evaluate the model. The tasks from GLUE and SUPERGLUE were automatically translated using DeepL Translate.
Model | RTE (Accuracy) | WNLI (Accuracy) | MRPC (F1) | STS - B (Pearson) | COPA (Accuracy) | CB (F1) | MultiRC (F1) | BoolQ (Accuracy) |
---|---|---|---|---|---|---|---|---|
Albertina 1.5B PTBR | 0.8676 | 0.4742 | 0.8622 | 0.9007 | 0.7767 | 0.6372 | 0.7667 | 0.8654 |
Albertina 1.5B PTBR 256 | 0.8123 | 0.4225 | 0.8638 | 0.8968 | 0.8533 | 0.6884 | 0.6799 | 0.8509 |
Albertina 900M PTBR | 0.7545 | 0.4601 | 0.9071 | 0.8910 | 0.7767 | 0.5799 | 0.6731 | 0.8385 |
BERTimbau (335M) | 0.6446 | 0.5634 | 0.8873 | 0.8842 | 0.6933 | 0.5438 | 0.6787 | 0.7783 |
Albertina 100M PTBR | 0.6582 | 0.5634 | 0.8149 | 0.8489 | n.a. | 0.4771 | 0.6469 | 0.7537 |
DeBERTa 1.5B (English) | 0.7112 | 0.5634 | 0.8545 | 0.0123 | 0.5700 | 0.4307 | 0.3639 | 0.6217 |
DeBERTa 100M (English) | 0.5716 | 0.5587 | 0.8060 | 0.8266 | n.a. | 0.4739 | 0.6391 | 0.6838 |
🔧 Technical Details
Albertina 1.5B PTBR is an encoder in the BERT family, based on the Transformer architecture and developed over the DeBERTa model. It has 1.5 billion parameters, 48 layers, and a hidden size of 1536. The training process involves specific tokenization, sequence truncation, and learning rate settings as described in the Training section.
📄 License
This model is distributed under the MIT license. You can find the license details here.
📖 Citation
When using or citing this model, please use the following reference:
@misc{albertina-pt-fostering,
title={Fostering the Ecosystem of Open Neural Encoders
for Portuguese with Albertina PT-* family},
author={Rodrigo Santos and João Rodrigues and Luís Gomes
and João Silva and António Branco
and Henrique Lopes Cardoso and Tomás Freitas Osório
and Bernardo Leite},
year={2024},
eprint={2403.01897},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
🙏 Acknowledgments
The research was partially supported by the following:
- PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020, and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016.
- Research project ALBERTINA - Foundation Encoder Model for Portuguese and AI, funded by FCT—Fundação para a Ciência e Tecnologia under the grant CPCA - IAC/AV/478394/2022.
- Innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação under the grant C625734525 - 00462629, of Plano de Recuperação e Resiliência, call RE - C05 - i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização.
- LIACC - Laboratory for AI and Computer Science, funded by FCT—Fundação para a Ciência e Tecnologia under the grant FCT/UID/CEC/0027/2020.

