Albertina 1.5B PTBR Open-source Language Model - Empowering the Development of Brazilian Portuguese-related Applications

Albertina 1b5 Portuguese Ptbr Encoder

Developed by PORTULAN

Albertina 1.5B PTBR is a foundational large language model for the Brazilian Portuguese variant. It is an encoder belonging to the BERT family, based on the Transformer neural network architecture and developed on the basis of the DeBERTa model.

Large Language Model

Transformers

OtherOpen Source License:MIT #Brazilian Portuguese encoder #Large model with 1.5 billion parameters #DeBERTa architecture optimization

Downloads 83

Release Time : 10/27/2023

Model Overview

This is a large language model specifically designed for the Brazilian Portuguese variant, with 1.5 billion parameters, and has the most competitive performance for this language.

Model Features

Optimized for Brazilian Portuguese

Specifically trained and optimized for the Brazilian Portuguese variant

Large-scale parameters

With 1.5 billion parameters, it sets a new technological benchmark for Brazilian Portuguese

High performance

Shows the most competitive performance on Brazilian Portuguese tasks

Open license

Freely distributed under the most permissive MIT license

Model Capabilities

Text understanding

Masked language modeling

Brazilian Portuguese text processing

Use Cases

Natural language processing

Text completion

Automatically complete masked text segments

In the example, 'tradition' was correctly predicted as the best completion word

Language understanding

Understand the semantics and context of Brazilian Portuguese text

🚀 Albertina 1.5B PTBR

Albertina 1.5B PTBR is a large foundation language model tailored for the American variant of Portuguese. It belongs to the BERT family of encoders, leveraging the Transformer neural architecture and built upon the DeBERTa model. This model offers outstanding performance in the Portuguese language and comes in different versions trained for various Portuguese variants, including the European (PTPT) and American (PTBR) variants. It is freely distributed under an open - source license.

✨ Features

Variant - Specific Training: Available in versions trained for both the European (PTPT) and American (PTBR) variants of Portuguese.
State - of - the - Art Performance: With 1.5 billion parameters, it sets a new benchmark for the American Portuguese variant at the time of its initial release.
Open - Source Distribution: Freely available for reuse under a permissive license.

📦 Installation

No installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='PORTULAN/albertina-1b5-portuguese-ptbr-encoder')
>>> unmasker("A culinária portuguesa é rica em sabores e [MASK], tornando-se um dos maiores tesouros do país.")

[{'score': 0.8332648277282715, 'token': 14690, 'token_str': ' costumes', 'sequence': 'A culinária portuguesa é rica em sabores e costumes, tornando-se um dos maiores tesouros do país.'},
{'score': 0.07860890030860901, 'token': 29829, 'token_str': ' cores', 'sequence': 'A culinária portuguesa é rica em sabores e cores, tornando-se um dos maiores tesouros do país.'},
{'score': 0.03278181701898575, 'token': 35277, 'token_str': ' arte', 'sequence': 'A culinária portuguesa é rica em sabores e arte, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009515956044197083, 'token': 9240, 'token_str': ' cor', 'sequence': 'A culinária portuguesa é rica em sabores e cor, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009381960146129131, 'token': 33455, 'token_str': ' nuances', 'sequence': 'A culinária portuguesa é rica em sabores e nuances, tornando-se um dos maiores tesouros do país.'}]

Advanced Usage

The model can be used by fine - tuning it for a specific task:

>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
>>> from datasets import load_dataset

>>> model = AutoModelForSequenceClassification.from_pretrained("PORTULAN/albertina-1b5-portuguese-ptbr-encoder", num_labels=2)
>>> tokenizer = AutoTokenizer.from_pretrained("PORTULAN/albertina-1b5-portuguese-ptbr-encoder")
>>> dataset = load_dataset("PORTULAN/glue-ptbr", "rte")

>>> def tokenize_function(examples):
...     return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)

>>> tokenized_datasets = dataset.map(tokenize_function, batched=True)

>>> training_args = TrainingArguments(output_dir="albertina-ptbr-rte", evaluation_strategy="epoch")
>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_datasets["train"],
...     eval_dataset=tokenized_datasets["validation"],
... )

>>> trainer.train()

📚 Documentation

Model Description

This model card is for Albertina 1.5B PTBR, which has 1.5 billion parameters, 48 layers, and a hidden size of 1536. It is distributed under an MIT license. The underlying DeBERTa model is also distributed under an MIT license.

Training Data

Albertina 1.5B PTBR was trained on a 36 - billion - token dataset. The data was collected from openly available American Portuguese corpora from the following sources:

CulturaX: A multilingual corpus, freely available for research and AI development. It is created by combining and cleaning mC4 and OSCAR datasets. It is derived from the Common Crawl dataset, with additional filtering to retain only pages with permission to be crawled, perform deduplication, and remove boilerplate. Since it does not distinguish between Portuguese variants, extra filtering was done to keep only documents with the Portuguese Internet country - code top - level domain.

Preprocessing

The PTBR corpora were filtered using the [BLOOM pre - processing](https://github.com/bigscience-workshop/data - preparation) pipeline. The default stopword filtering was skipped to preserve the syntactic structure, and language identification filtering was also skipped as the corpus was pre - selected as Portuguese.

Training

The [DeBERTa V2 xxlarge](https://huggingface.co/microsoft/deberta - v2 - xxlarge) for English was used as the codebase. To train Albertina 1.5B PTBR, the dataset was tokenized with the original DeBERTa tokenizer:

128 - token sequence truncation and dynamic padding for 250k steps (equivalent to 48 hours of computation on a2 - megagpu - 16gb Google Cloud A2 node for 128 - token input sequences).
256 - token sequence truncation for 80k steps (Albertina 1.5B PTBR 256, equivalent to 24 hours of computation for 256 - token input sequences).
512 - token sequence truncation for 60k steps (equivalent to 24 hours of computation for 512 - token input sequences).

A learning rate of 1e - 5 with linear decay and 10k warm - up steps was used.

Performance

The extraGLUE, a PTBR version of the GLUE and SUPERGLUE benchmark, was used to evaluate the model. The tasks from GLUE and SUPERGLUE were automatically translated using DeepL Translate.

Model	RTE (Accuracy)	WNLI (Accuracy)	MRPC (F1)	STS - B (Pearson)	COPA (Accuracy)	CB (F1)	MultiRC (F1)	BoolQ (Accuracy)
Albertina 1.5B PTBR	0.8676	0.4742	0.8622	0.9007	0.7767	0.6372	0.7667	0.8654
Albertina 1.5B PTBR 256	0.8123	0.4225	0.8638	0.8968	0.8533	0.6884	0.6799	0.8509
Albertina 900M PTBR	0.7545	0.4601	0.9071	0.8910	0.7767	0.5799	0.6731	0.8385
BERTimbau (335M)	0.6446	0.5634	0.8873	0.8842	0.6933	0.5438	0.6787	0.7783
Albertina 100M PTBR	0.6582	0.5634	0.8149	0.8489	n.a.	0.4771	0.6469	0.7537

DeBERTa 1.5B (English)	0.7112	0.5634	0.8545	0.0123	0.5700	0.4307	0.3639	0.6217
DeBERTa 100M (English)	0.5716	0.5587	0.8060	0.8266	n.a.	0.4739	0.6391	0.6838

🔧 Technical Details

Albertina 1.5B PTBR is an encoder in the BERT family, based on the Transformer architecture and developed over the DeBERTa model. It has 1.5 billion parameters, 48 layers, and a hidden size of 1536. The training process involves specific tokenization, sequence truncation, and learning rate settings as described in the Training section.

📄 License

This model is distributed under the MIT license. You can find the license details here.

📖 Citation

When using or citing this model, please use the following reference:

@misc{albertina-pt-fostering,
      title={Fostering the Ecosystem of Open Neural Encoders
            for Portuguese with Albertina PT-* family}, 
      author={Rodrigo Santos and João Rodrigues and Luís Gomes
              and João Silva and António Branco
              and Henrique Lopes Cardoso and Tomás Freitas Osório
              and Bernardo Leite},
      year={2024},
      eprint={2403.01897},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

🙏 Acknowledgments

The research was partially supported by the following:

PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020, and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016.
Research project ALBERTINA - Foundation Encoder Model for Portuguese and AI, funded by FCT—Fundação para a Ciência e Tecnologia under the grant CPCA - IAC/AV/478394/2022.
Innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação under the grant C625734525 - 00462629, of Plano de Recuperação e Resiliência, call RE - C05 - i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização.
LIACC - Laboratory for AI and Computer Science, funded by FCT—Fundação para a Ciência e Tecnologia under the grant FCT/UID/CEC/0027/2020.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご