Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Albertina 100M PTBR
Albertina 100M PTBR is a large foundation language model for Brazilian Portuguese. It serves as an encoder in the BERT family, based on the Transformer neural architecture and developed from the DeBERTa model, offering highly competitive performance in this language. The model is freely distributed under a permissive license.
✨ Features
- Powerful Encoder: Based on the BERT family and the Transformer architecture, it provides strong encoding capabilities for Portuguese text.
- Trained on High - Quality Data: Trained on a carefully curated dataset of 3.7 billion tokens from the OSCAR dataset, with additional filtering for Brazilian Portuguese.
- Good Performance on Downstream Tasks: Demonstrates competitive results on various Portuguese downstream tasks, such as those in the PLUE dataset.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
>>> from transformers import pipeline
>>> unmasker = pipeline('fill - mask', model='PORTULAN/albertina-ptbr-base')
>>> unmasker("A culinária brasileira é rica em sabores e [MASK], tornando-se um dos maiores patrimônios do país.")
[{'score': 0.9391396045684814, 'token': 14690, 'token_str': ' costumes', 'sequence': 'A culinária brasileira é rica em sabores e costumes, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.04568921774625778, 'token': 29829, 'token_str': ' cores', 'sequence': 'A culinária brasileira é rica em sabores e cores, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.004134135786443949, 'token': 6696, 'token_str': ' drinks', 'sequence': 'A culinária brasileira é rica em sabores e drinks, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.0009097770671360195, 'token': 33455, 'token_str': ' nuances', 'sequence': 'A culinária brasileira é rica em sabores e nuances, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.0008549498743377626, 'token': 606, 'token_str': ' comes', 'sequence': 'A culinária brasileira é rica em sabores e comes, tornando-se um dos maiores patrimônios do país.'}]
Advanced Usage
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
>>> from datasets import load_dataset
>>> model = AutoModelForSequenceClassification.from_pretrained("PORTULAN/albertina-ptbr-base", num_labels=2)
>>> tokenizer = AutoTokenizer.from_pretrained("PORTULAN/albertina-ptbr-base")
>>> dataset = load_dataset("PORTULAN/glue-ptpt", "rte")
>>> def tokenize_function(examples):
... return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)
>>> tokenized_datasets = dataset.map(tokenize_function, batched=True)
>>> training_args = TrainingArguments(output_dir="albertina-ptpt-rte", evaluation_strategy="epoch")
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_datasets["train"],
... eval_dataset=tokenized_datasets["validation"],
... )
>>> trainer.train()
📚 Documentation
Model Description
This model card is for Albertina 100M PTBR, which has 100M parameters, 12 layers, and a hidden size of 768. Albertina - PT - BR base is distributed under an MIT license. DeBERTa is also distributed under an MIT license.
Training Data
Albertina P100M PTBR was trained on a 3.7 - billion - token curated selection of documents from the OSCAR dataset. The OSCAR dataset includes documents in over a hundred languages, including Portuguese, and is widely used in the literature. It is the result of a selection from the Common Crawl dataset, which crawls the web, retains only pages whose metadata indicates permission to be crawled, performs deduplication, and removes some boilerplate, among other filters. Since it does not distinguish between Portuguese variants, additional filtering was performed to keep only documents whose metadata indicates the Internet country - code top - level domain of Brazil. The January 2023 version of OSCAR, based on the November/December 2022 version of Common Crawl, was used.
Preprocessing
The PT - BR corpora were filtered using the BLOOM pre - processing pipeline. The default filtering of stopwords was skipped as it would disrupt the syntactic structure, and the filtering for language identification was also skipped since the corpus was pre - selected as Portuguese.
Training
The DeBERTa V1 base for English was used as the codebase. To train Albertina 100M PTBR, the dataset was tokenized with the original DeBERTa tokenizer, with a 128 - token sequence truncation and dynamic padding. The model was trained using the maximum available memory capacity, resulting in a batch size of 3072 samples (192 samples per GPU). A learning rate of 1e - 5 with linear decay and 10k warm - up steps was chosen. The model was trained for a total of 150 training epochs, resulting in approximately 180k steps. It was trained for one day on a2 - megagpu - 16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM.
Evaluation
The base model versions were evaluated on downstream tasks, specifically the translations of English datasets used for some tasks in the widely - used GLUE benchmark into PTBR.
GLUE tasks translated
The PLUE (Portuguese Language Understanding Evaluation) dataset, obtained by automatically translating GLUE into PT - BR, was used. Four tasks from PLUE were addressed:
- Two similarity tasks: MRPC, for detecting whether two sentences are paraphrases of each other, and STS - B, for semantic textual similarity.
- Two inference tasks: RTE, for recognizing textual entailment, and WNLI, for coreference and natural language inference.
Model | RTE (Accuracy) | WNLI (Accuracy) | MRPC (F1) | STS - B (Pearson) |
---|---|---|---|---|
Albertina 900M PTBR No - brWaC | 0.7798 | 0.5070 | 0.9167 | 0.8743 |
Albertina 900M PTBR | 0.7545 | 0.4601 | 0.9071 | 0.8910 |
Albertina 100M PTBR | 0.6462 | 0.5493 | 0.8779 | 0.8501 |
🔧 Technical Details
- Model Architecture: Based on the Transformer architecture and the DeBERTa model, it is an encoder in the BERT family.
- Training Parameters: Trained with a batch size of 3072 samples, a learning rate of 1e - 5 with linear decay and 10k warm - up steps, and a total of 150 training epochs.
- Hardware: Trained on a2 - megagpu - 16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM for one day.
📄 License
Albertina - PT - BR base is distributed under an MIT license. DeBERTa is distributed under an MIT license.
📜 Citation
When using or citing this model, please cite the following publication:
@misc{albertina-pt-fostering,
title={Fostering the Ecosystem of Open Neural Encoders
for Portuguese with Albertina PT-* family},
author={Rodrigo Santos and João Rodrigues and Luís Gomes
and João Silva and António Branco
and Henrique Lopes Cardoso and Tomás Freitas Osório
and Bernardo Leite},
year={2024},
eprint={2403.01897},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
🙏 Acknowledgments
The research reported here was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016; research project ALBERTINA - Foundation Encoder Model for Portuguese and AI, funded by FCT—Fundação para a Ciência e Tecnologia under the grant CPCA - IAC/AV/478394/2022; innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação under the grant C625734525 - 00462629, of Plano de Recuperação e Resiliência, call RE - C05 - i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização; and LIACC - Laboratory for AI and Computer Science, funded by FCT—Fundação para a Ciência e Tecnologia under the grant FCT/UID/CEC/0027/2020.
📋 Model Information
Property | Details |
---|---|
Model Type | Encoder in the BERT family, based on Transformer and DeBERTa |
Training Data | 3.7 billion token curated selection from the OSCAR dataset, with additional filtering for Brazilian Portuguese |

