Albertina 100M PTBR Open-Source Large Language Model - Empowering Brazilian Portuguese Understanding and Processing

Albertina 100m Portuguese Ptbr Encoder

Developed by PORTULAN

Albertina 100M PTBR is a foundational large language model for Brazilian Portuguese, belonging to the BERT family of encoders, based on the Transformer neural network architecture, and developed on the DeBERTa model.

Large Language Model

Transformers

OtherOpen Source License:MIT #Brazilian Portuguese Encoder #DeBERTa Architecture #Masked Language Modeling

Downloads 131

Release Time : 5/25/2023

Model Overview

This model is a foundational large language model for Brazilian Portuguese with 100 million parameters, distributed under the MIT license, suitable for tasks such as masked language modeling.

Model Features

Optimized for Brazilian Portuguese

Specifically trained and optimized for Brazilian Portuguese, providing more accurate language understanding capabilities.

Based on DeBERTa Architecture

Developed on the DeBERTa model, combining the advantages of the Transformer architecture for stronger performance.

Open Source License

Distributed under the MIT license, allowing free use and modification.

Model Capabilities

Masked Language Modeling

Text Understanding

Downstream Task Fine-tuning

Use Cases

Natural Language Processing

Semantic Similarity Analysis

Can be used to analyze the semantic similarity between two texts.

Achieved a Pearson coefficient of 0.8501 on the STS-B task.

Textual Entailment Recognition

Determines whether one text entails the meaning of another.

Achieved an accuracy of 0.6462 on the RTE task.

Text Processing

Auto-completion

Predicts masked words.

Accurately predicted 'costumes' as the best completion word in the example.

🚀 Albertina 100M PTBR

Albertina 100M PTBR is a large foundation language model for Brazilian Portuguese. It serves as an encoder in the BERT family, based on the Transformer neural architecture and developed from the DeBERTa model, offering highly competitive performance in this language. The model is freely distributed under a permissive license.

✨ Features

Powerful Encoder: Based on the BERT family and the Transformer architecture, it provides strong encoding capabilities for Portuguese text.
Trained on High - Quality Data: Trained on a carefully curated dataset of 3.7 billion tokens from the OSCAR dataset, with additional filtering for Brazilian Portuguese.
Good Performance on Downstream Tasks: Demonstrates competitive results on various Portuguese downstream tasks, such as those in the PLUE dataset.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

>>> from transformers import pipeline
>>> unmasker = pipeline('fill - mask', model='PORTULAN/albertina-ptbr-base')
>>> unmasker("A culinária brasileira é rica em sabores e [MASK], tornando-se um dos maiores patrimônios do país.")

[{'score': 0.9391396045684814, 'token': 14690, 'token_str': ' costumes', 'sequence': 'A culinária brasileira é rica em sabores e costumes, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.04568921774625778, 'token': 29829, 'token_str': ' cores', 'sequence': 'A culinária brasileira é rica em sabores e cores, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.004134135786443949, 'token': 6696, 'token_str': ' drinks', 'sequence': 'A culinária brasileira é rica em sabores e drinks, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.0009097770671360195, 'token': 33455, 'token_str': ' nuances', 'sequence': 'A culinária brasileira é rica em sabores e nuances, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.0008549498743377626, 'token': 606, 'token_str': ' comes', 'sequence': 'A culinária brasileira é rica em sabores e comes, tornando-se um dos maiores patrimônios do país.'}]

Advanced Usage

>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
>>> from datasets import load_dataset

>>> model = AutoModelForSequenceClassification.from_pretrained("PORTULAN/albertina-ptbr-base", num_labels=2)
>>> tokenizer = AutoTokenizer.from_pretrained("PORTULAN/albertina-ptbr-base")
>>> dataset = load_dataset("PORTULAN/glue-ptpt", "rte")

>>> def tokenize_function(examples):
...     return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)

>>> tokenized_datasets = dataset.map(tokenize_function, batched=True)

>>> training_args = TrainingArguments(output_dir="albertina-ptpt-rte", evaluation_strategy="epoch")
>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_datasets["train"],
...     eval_dataset=tokenized_datasets["validation"],
... )

>>> trainer.train()

📚 Documentation

Model Description

This model card is for Albertina 100M PTBR, which has 100M parameters, 12 layers, and a hidden size of 768. Albertina - PT - BR base is distributed under an MIT license. DeBERTa is also distributed under an MIT license.

Training Data

Albertina P100M PTBR was trained on a 3.7 - billion - token curated selection of documents from the OSCAR dataset. The OSCAR dataset includes documents in over a hundred languages, including Portuguese, and is widely used in the literature. It is the result of a selection from the Common Crawl dataset, which crawls the web, retains only pages whose metadata indicates permission to be crawled, performs deduplication, and removes some boilerplate, among other filters. Since it does not distinguish between Portuguese variants, additional filtering was performed to keep only documents whose metadata indicates the Internet country - code top - level domain of Brazil. The January 2023 version of OSCAR, based on the November/December 2022 version of Common Crawl, was used.

Preprocessing

The PT - BR corpora were filtered using the BLOOM pre - processing pipeline. The default filtering of stopwords was skipped as it would disrupt the syntactic structure, and the filtering for language identification was also skipped since the corpus was pre - selected as Portuguese.

Training

The DeBERTa V1 base for English was used as the codebase. To train Albertina 100M PTBR, the dataset was tokenized with the original DeBERTa tokenizer, with a 128 - token sequence truncation and dynamic padding. The model was trained using the maximum available memory capacity, resulting in a batch size of 3072 samples (192 samples per GPU). A learning rate of 1e - 5 with linear decay and 10k warm - up steps was chosen. The model was trained for a total of 150 training epochs, resulting in approximately 180k steps. It was trained for one day on a2 - megagpu - 16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM.

Evaluation

The base model versions were evaluated on downstream tasks, specifically the translations of English datasets used for some tasks in the widely - used GLUE benchmark into PTBR.

GLUE tasks translated

The PLUE (Portuguese Language Understanding Evaluation) dataset, obtained by automatically translating GLUE into PT - BR, was used. Four tasks from PLUE were addressed:

Two similarity tasks: MRPC, for detecting whether two sentences are paraphrases of each other, and STS - B, for semantic textual similarity.
Two inference tasks: RTE, for recognizing textual entailment, and WNLI, for coreference and natural language inference.

Model	RTE (Accuracy)	WNLI (Accuracy)	MRPC (F1)	STS - B (Pearson)
Albertina 900M PTBR No - brWaC	0.7798	0.5070	0.9167	0.8743
Albertina 900M PTBR	0.7545	0.4601	0.9071	0.8910
Albertina 100M PTBR	0.6462	0.5493	0.8779	0.8501

🔧 Technical Details

Model Architecture: Based on the Transformer architecture and the DeBERTa model, it is an encoder in the BERT family.
Training Parameters: Trained with a batch size of 3072 samples, a learning rate of 1e - 5 with linear decay and 10k warm - up steps, and a total of 150 training epochs.
Hardware: Trained on a2 - megagpu - 16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM for one day.

📄 License

Albertina - PT - BR base is distributed under an MIT license. DeBERTa is distributed under an MIT license.

📜 Citation

When using or citing this model, please cite the following publication:

@misc{albertina-pt-fostering,
      title={Fostering the Ecosystem of Open Neural Encoders
            for Portuguese with Albertina PT-* family}, 
      author={Rodrigo Santos and João Rodrigues and Luís Gomes
              and João Silva and António Branco
              and Henrique Lopes Cardoso and Tomás Freitas Osório
              and Bernardo Leite},
      year={2024},
      eprint={2403.01897},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

🙏 Acknowledgments

The research reported here was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016; research project ALBERTINA - Foundation Encoder Model for Portuguese and AI, funded by FCT—Fundação para a Ciência e Tecnologia under the grant CPCA - IAC/AV/478394/2022; innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação under the grant C625734525 - 00462629, of Plano de Recuperação e Resiliência, call RE - C05 - i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização; and LIACC - Laboratory for AI and Computer Science, funded by FCT—Fundação para a Ciência e Tecnologia under the grant FCT/UID/CEC/0027/2020.

📋 Model Information

Property	Details
Model Type	Encoder in the BERT family, based on Transformer and DeBERTa
Training Data	3.7 billion token curated selection from the OSCAR dataset, with additional filtering for Brazilian Portuguese

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご