Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Albertina 100M PTPT
Albertina 100M PTPT is a foundational large language model tailored for European Portuguese from Portugal. It serves as an encoder within the BERT family, leveraging the Transformer neural architecture and building upon the DeBERTa model. This model offers highly competitive performance in the Portuguese language and is freely distributed under a permissive license.
✨ Features
- Multilingual Tags: Associated with tags like
albertina-pt*
,albertina-ptpt
,albertina-ptbr
, etc., indicating its relevance to Portuguese language variants and related tasks such asfill-mask
. - Family of Models: Part of the Albertina family, with various model sizes and configurations available, including
Albertina 1.5B PTPT
,Albertina 1.5B PTBR
, etc. - Open-Source License: Distributed under the MIT license, promoting accessibility and reuse.
- Diverse Training Data: Trained on a 2.2 billion token dataset sourced from multiple openly available corpora of European Portuguese, including OSCAR, DCEP, Europarl, and ParlamentoPT.
- Evaluated on Downstream Tasks: Evaluated on downstream tasks using the PTPT version of the GLUE benchmark, demonstrating its performance in real-world applications.
📦 Installation
No specific installation steps were provided in the original document.
💻 Usage Examples
Basic Usage
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='PORTULAN/albertina-ptpt-base')
>>> unmasker("A culinária portuguesa é rica em sabores e [MASK], tornando-se um dos maiores tesouros do país.")
[{'score': 0.8332648277282715, 'token': 14690, 'token_str': ' costumes', 'sequence': 'A culinária portuguesa é rica em sabores e costumes, tornando-se um dos maiores tesouros do país.'},
{'score': 0.07860890030860901, 'token': 29829, 'token_str': ' cores', 'sequence': 'A culinária portuguesa é rica em sabores e cores, tornando-se um dos maiores tesouros do país.'},
{'score': 0.03278181701898575, 'token': 35277, 'token_str': ' arte', 'sequence': 'A culinária portuguesa é rica em sabores e arte, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009515956044197083, 'token': 9240, 'token_str': ' cor', 'sequence': 'A culinária portuguesa é rica em sabores e cor, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009381960146129131, 'token': 33455, 'token_str': ' nuances', 'sequence': 'A culinária portuguesa é rica em sabores e nuances, tornando-se um dos maiores tesouros do país.'}]
Advanced Usage
The model can be used by fine-tuning it for a specific task:
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
>>> from datasets import load_dataset
>>> model = AutoModelForSequenceClassification.from_pretrained("PORTULAN/albertina-ptpt-base", num_labels=2)
>>> tokenizer = AutoTokenizer.from_pretrained("PORTULAN/albertina-ptpt-base")
>>> dataset = load_dataset("PORTULAN/glue-ptpt", "rte")
>>> def tokenize_function(examples):
... return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)
>>> tokenized_datasets = dataset.map(tokenize_function, batched=True)
>>> training_args = TrainingArguments(output_dir="albertina-ptpt-rte", evaluation_strategy="epoch")
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_datasets["train"],
... eval_dataset=tokenized_datasets["validation"],
... )
>>> trainer.train()
📚 Documentation
Model Description
This model card pertains to Albertina 100M PTPT base, which has 100M parameters, 12 layers, and a hidden size of 768. It is distributed under an MIT license. Similarly, DeBERTa is also distributed under an MIT license.
Training Data
Albertina 100M PTPT was trained on a 2.2 billion token dataset. This dataset was compiled from several openly available European Portuguese corpora:
- OSCAR: A multilingual dataset that includes Portuguese documents. It is derived from the Common Crawl dataset, with additional filtering to retain only Portuguese documents from Portugal.
- DCEP: The Digital Corpus of the European Parliament, with its European Portuguese portion retained.
- Europarl: The European Parliament Proceedings Parallel Corpus, with its European Portuguese portion used.
- ParlamentoPT: A dataset obtained by gathering publicly available documents of Portuguese Parliament debates.
Preprocessing
The PTPT corpora were filtered using the BLOOM pre-processing pipeline. Default stopword filtering was skipped to preserve syntactic structure, and language identification filtering was omitted as the corpus was pre-selected as Portuguese.
Training
The DeBERTa V1 base for English was used as the codebase. The dataset was tokenized with the original DeBERTa tokenizer, with a 128 token sequence truncation and dynamic padding. The model was trained with a batch size of 3072 samples (192 samples per GPU), a learning rate of 1e-5 with linear decay and 10k warm-up steps. A total of 200 training epochs were performed, resulting in approximately 180k steps. The training was conducted on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM for one day.
Evaluation
The base model version was evaluated on downstream tasks using the translations of English datasets from the widely-used GLUE benchmark into PT-PT.
GLUE tasks translated
The GLUE-PT, a PTPT version of the GLUE benchmark, was used. Four tasks from GLUE were automatically translated using DeepL Translate. The evaluation results are as follows:
Model | RTE (Accuracy) | WNLI (Accuracy) | MRPC (F1) | STS-B (Pearson) |
---|---|---|---|---|
Albertina 900m PTPT | 0.8339 | 0.4225 | 0.9171 | 0.8801 |
Albertina 100m PTPT | 0.6787 | 0.4507 | 0.8829 | 0.8581 |
Citation
When using or citing this model, please cite the following publication:
@misc{albertina-pt-fostering,
title={Fostering the Ecosystem of Open Neural Encoders
for Portuguese with Albertina PT-* family},
author={Rodrigo Santos and João Rodrigues and Luís Gomes
and João Silva and António Branco
and Henrique Lopes Cardoso and Tomás Freitas Osório
and Bernardo Leite},
year={2024},
eprint={2403.01897},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Acknowledgments
The research reported here was partially supported by several entities:
- PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020, and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016.
- Research project ALBERTINA - Foundation Encoder Model for Portuguese and AI, funded by FCT—Fundação para a Ciência e Tecnologia under the grant CPCA-IAC/AV/478394/2022.
- Innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação under the grant C625734525-00462629, of Plano de Recuperação e Resiliência, call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização.
- LIACC - Laboratory for AI and Computer Science, funded by FCT—Fundação para a Ciência e Tecnologia under the grant FCT/UID/CEC/0027/2020.
🔧 Technical Details
- Model Architecture: Based on the Transformer architecture and the DeBERTa model, serving as an encoder in the BERT family.
- Parameters: 100M parameters, 12 layers, and a hidden size of 768.
- Training Configuration: Trained with a batch size of 3072 samples (192 samples per GPU), a learning rate of 1e-5 with linear decay and 10k warm-up steps, for 200 epochs (approximately 180k steps).
- Hardware: Trained on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM for one day.
📄 License
This model is distributed under the MIT license.

