Albertina 100M PTPT Open-source Portuguese Large Model - Free Deployment Facilitates Portuguese Language Communication and Processing

Albertina 100m Portuguese Ptpt Encoder

Developed by PORTULAN

Albertina 100M PTPT is a foundational large language model for European Portuguese (Portugal), belonging to the BERT family of encoders. It is based on the Transformer neural network architecture and developed upon the DeBERTa model.

Large Language Model

Transformers

OtherOpen Source License:MIT #Portuguese Encoder #DeBERTa Architecture #Masked Language Modeling

Downloads 171

Release Time : 5/25/2023

Model Overview

This model is optimized for European Portuguese, featuring 100 million parameters, and is suitable for tasks such as masked language modeling.

Model Features

Optimized for European Portuguese

Specifically trained and optimized for European Portuguese (PT-PT)

Based on DeBERTa Architecture

Developed upon the DeBERTa model, featuring improved attention mechanisms

Permissive License

Uses the MIT license, allowing free use and distribution

Model Capabilities

Masked Language Modeling

Text Understanding

Contextual Prediction

Use Cases

Natural Language Processing

Text Completion

Predicts masked words in sentences

As shown in examples, it can accurately predict words related to Portuguese cuisine

Language Understanding

Understands the semantics of European Portuguese text

🚀 Albertina 100M PTPT

Albertina 100M PTPT is a foundational large language model tailored for European Portuguese from Portugal. It serves as an encoder within the BERT family, leveraging the Transformer neural architecture and building upon the DeBERTa model. This model offers highly competitive performance in the Portuguese language and is freely distributed under a permissive license.

✨ Features

Multilingual Tags: Associated with tags like albertina-pt*, albertina-ptpt, albertina-ptbr, etc., indicating its relevance to Portuguese language variants and related tasks such as fill-mask.
Family of Models: Part of the Albertina family, with various model sizes and configurations available, including Albertina 1.5B PTPT, Albertina 1.5B PTBR, etc.
Open-Source License: Distributed under the MIT license, promoting accessibility and reuse.
Diverse Training Data: Trained on a 2.2 billion token dataset sourced from multiple openly available corpora of European Portuguese, including OSCAR, DCEP, Europarl, and ParlamentoPT.
Evaluated on Downstream Tasks: Evaluated on downstream tasks using the PTPT version of the GLUE benchmark, demonstrating its performance in real-world applications.

📦 Installation

No specific installation steps were provided in the original document.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='PORTULAN/albertina-ptpt-base')
>>> unmasker("A culinária portuguesa é rica em sabores e [MASK], tornando-se um dos maiores tesouros do país.")

[{'score': 0.8332648277282715, 'token': 14690, 'token_str': ' costumes', 'sequence': 'A culinária portuguesa é rica em sabores e costumes, tornando-se um dos maiores tesouros do país.'},
{'score': 0.07860890030860901, 'token': 29829, 'token_str': ' cores', 'sequence': 'A culinária portuguesa é rica em sabores e cores, tornando-se um dos maiores tesouros do país.'},
{'score': 0.03278181701898575, 'token': 35277, 'token_str': ' arte', 'sequence': 'A culinária portuguesa é rica em sabores e arte, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009515956044197083, 'token': 9240, 'token_str': ' cor', 'sequence': 'A culinária portuguesa é rica em sabores e cor, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009381960146129131, 'token': 33455, 'token_str': ' nuances', 'sequence': 'A culinária portuguesa é rica em sabores e nuances, tornando-se um dos maiores tesouros do país.'}]

Advanced Usage

The model can be used by fine-tuning it for a specific task:

>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
>>> from datasets import load_dataset

>>> model = AutoModelForSequenceClassification.from_pretrained("PORTULAN/albertina-ptpt-base", num_labels=2)
>>> tokenizer = AutoTokenizer.from_pretrained("PORTULAN/albertina-ptpt-base")
>>> dataset = load_dataset("PORTULAN/glue-ptpt", "rte")

>>> def tokenize_function(examples):
...     return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)

>>> tokenized_datasets = dataset.map(tokenize_function, batched=True)

>>> training_args = TrainingArguments(output_dir="albertina-ptpt-rte", evaluation_strategy="epoch")
>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_datasets["train"],
...     eval_dataset=tokenized_datasets["validation"],
... )

>>> trainer.train()

📚 Documentation

Model Description

This model card pertains to Albertina 100M PTPT base, which has 100M parameters, 12 layers, and a hidden size of 768. It is distributed under an MIT license. Similarly, DeBERTa is also distributed under an MIT license.

Training Data

Albertina 100M PTPT was trained on a 2.2 billion token dataset. This dataset was compiled from several openly available European Portuguese corpora:

OSCAR: A multilingual dataset that includes Portuguese documents. It is derived from the Common Crawl dataset, with additional filtering to retain only Portuguese documents from Portugal.
DCEP: The Digital Corpus of the European Parliament, with its European Portuguese portion retained.
Europarl: The European Parliament Proceedings Parallel Corpus, with its European Portuguese portion used.
ParlamentoPT: A dataset obtained by gathering publicly available documents of Portuguese Parliament debates.

Preprocessing

The PTPT corpora were filtered using the BLOOM pre-processing pipeline. Default stopword filtering was skipped to preserve syntactic structure, and language identification filtering was omitted as the corpus was pre-selected as Portuguese.

Training

The DeBERTa V1 base for English was used as the codebase. The dataset was tokenized with the original DeBERTa tokenizer, with a 128 token sequence truncation and dynamic padding. The model was trained with a batch size of 3072 samples (192 samples per GPU), a learning rate of 1e-5 with linear decay and 10k warm-up steps. A total of 200 training epochs were performed, resulting in approximately 180k steps. The training was conducted on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM for one day.

Evaluation

The base model version was evaluated on downstream tasks using the translations of English datasets from the widely-used GLUE benchmark into PT-PT.

GLUE tasks translated

The GLUE-PT, a PTPT version of the GLUE benchmark, was used. Four tasks from GLUE were automatically translated using DeepL Translate. The evaluation results are as follows:

Model	RTE (Accuracy)	WNLI (Accuracy)	MRPC (F1)	STS-B (Pearson)
Albertina 900m PTPT	0.8339	0.4225	0.9171	0.8801
Albertina 100m PTPT	0.6787	0.4507	0.8829	0.8581

Citation

When using or citing this model, please cite the following publication:

@misc{albertina-pt-fostering,
      title={Fostering the Ecosystem of Open Neural Encoders
            for Portuguese with Albertina PT-* family}, 
      author={Rodrigo Santos and João Rodrigues and Luís Gomes
              and João Silva and António Branco
              and Henrique Lopes Cardoso and Tomás Freitas Osório
              and Bernardo Leite},
      year={2024},
      eprint={2403.01897},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgments

The research reported here was partially supported by several entities:

PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020, and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016.
Research project ALBERTINA - Foundation Encoder Model for Portuguese and AI, funded by FCT—Fundação para a Ciência e Tecnologia under the grant CPCA-IAC/AV/478394/2022.
Innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação under the grant C625734525-00462629, of Plano de Recuperação e Resiliência, call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização.
LIACC - Laboratory for AI and Computer Science, funded by FCT—Fundação para a Ciência e Tecnologia under the grant FCT/UID/CEC/0027/2020.

🔧 Technical Details

Model Architecture: Based on the Transformer architecture and the DeBERTa model, serving as an encoder in the BERT family.
Parameters: 100M parameters, 12 layers, and a hidden size of 768.
Training Configuration: Trained with a batch size of 3072 samples (192 samples per GPU), a learning rate of 1e-5 with linear decay and 10k warm-up steps, for 200 epochs (approximately 180k steps).
Hardware: Trained on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM for one day.

📄 License

This model is distributed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご