BERTugues-base-portuguese-cased Open-source Model - Supports Portuguese Text Processing and Prediction Tasks

Bertugues Base Portuguese Cased

Developed by ricardoz

BERTugues is a BERT model trained on Portuguese text, strictly following the original BERT paper's pre-training process, completing masked language modeling and next sentence prediction tasks through 1 million training steps.

Large Language Model OtherOpen Source License:Other #Portuguese BERT #Legal text analysis #Emoji support

Downloads 92

Release Time : 8/7/2023

Model Overview

BERTugues is a BERT model optimized for Portuguese, achieving excellent performance in multiple Portuguese NLP tasks by improving tokenizers and training data quality.

Model Features

Optimized Tokenizer

Removes rare Portuguese characters and adds high-frequency emojis, significantly reducing the proportion of text being split into multiple tokens.

Data Quality Filtering

Adopts heuristic methods proposed in the Gopher model paper to filter the quality of BrWAC corpus.

Performance Advantage

Outperforms similar models in multiple Portuguese NLP tasks, with some tasks surpassing models three times larger in parameter count.

Model Capabilities

Masked language modeling

Sentence similarity calculation

Next sentence prediction

Text feature extraction

Text classification

Use Cases

Sentiment Analysis

Portuguese Movie Review Classification

Uses BERTugues-generated sentence representations with a random forest classifier for sentiment analysis.

Achieves an F1 score of 84.0% on the IMDB Portuguese dataset, outperforming similar models.

Legal Text Processing

Legal Text Topic Classification

Determines whether two legal texts belong to the same topic.

Achieves an F1 score of 45.2% on the STJ dataset, outperforming the Bertimbau-Large model.

🚀 BERTugues Base (aka "BERTugues-base-portuguese-cased")

BERTugues Base is a pre - trained model for Portuguese, aiming at tasks like Masked Language Modeling and Next Sentence Prediction, with performance improvements in multiple tasks compared to other models.

🚀 Quick Start

BERTugues was pre - trained following the steps in the original BERT paper, targeting Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). It was trained for 1 million steps using over 20 GB of text. For more training details, please read the published paper. Similar to Bertimbau, it was pre - trained with the BrWAC dataset and Portuguese Wikipedia for the Tokenizer, with some improvements in the training flow:

Removal of uncommon Portuguese characters from Tokenizer training: In Bertimbau, more than 7000 out of 29794 tokens use oriental or special characters rarely used in Portuguese. For example, tokens like "##漫", "##켝", "##前" exist. In BERTugues, we removed these characters before training the tokenizer.
😀 Addition of major Emojis to the Tokenizer: Wikipedia has few Emojis in its text, resulting in a low number of Emojis in the tokens. As demonstrated in the literature, they are important for a series of tasks.
Quality filtering of BrWAC texts: We followed the heuristic model proposed in the Gopher model paper from Google to remove low - quality texts from BrWAC.

✨ Features

Tokenizer

By replacing rarely used Portuguese tokens, we reduced the average number of words split into multiple tokens. In the test using assin2, the same dataset used by Bertimbau in its master's thesis test, we reduced the average number of split words per text from 3.8 to 3.0. For multilingual BERT, this number was 7.4.

imagem

Performance

To compare performance, we tested a text classification problem using the movie review dataset from IMDB, which has been translated into Portuguese and has good quality. We used the BERTugues representation of the sentence and passed it through a Random Forest model for classification.

We also used the performance comparison from the paper JurisBERT: Transformer - based model for embedding legal texts, which pre - trains a BERT specifically for texts in a domain, using multilingual BERT and Bertimbau as baselines. In this case, we used the code provided by the paper team and added BERTugues. The model is used to compare whether two texts are about the same topic.

Model	IMDB (F1)	STJ (F1)	PJERJ (F1)	TJMS (F1)	Average F1
Multilingual BERT	72.0%	30.4%	63.8%	65.0%	57.8%
Bertimbau - Base	82.2%	35.6%	63.9%	71.2%	63.2%
Bertimbau - Large	85.3%	43.0%	63.8%	74.0%	66.5%
BERTugues - Base	84.0%	45.2%	67.5%	70.0%	66.7%

BERTugues outperformed Bertimbau - base in 3 out of 4 tasks and Bertimbau - Large in 2 out of 4 tasks. Bertimbau - Large is a much larger model (3 times more parameters) and computationally expensive.

💻 Usage Examples

Basic Usage

Many usage examples are available on our GitHub. Here are 2 examples for quick reference:

Masked Language Modeling

from transformers import BertTokenizer, BertForMaskedLM, pipeline

model = BertForMaskedLM.from_pretrained("ricardoz/BERTugues-base-portuguese-cased")
tokenizer = BertTokenizer.from_pretrained("ricardoz/BERTugues-base-portuguese-cased", do_lower_case=False)

pipe = pipeline('fill-mask', model=model, tokenizer=tokenizer, top_k = 3)
pipe('[CLS] Eduardo abriu os [MASK], mas não quis se levantar. Ficou deitado e viu que horas eram.')

# [{'score': 0.7272418141365051,
#   'token': 7292,
#   'token_str': 'olhos',
#   'sequence': 'Eduardo abriu os olhos, mas não quis se levantar. Ficou deitado e viu que horas eram.'},
#  {'score': 0.2677205801010132,
#   'token': 12761,
#   'token_str': 'braços',
#   'sequence': 'Eduardo abriu os braços, mas não quis se levantar. Ficou deitado e viu que horas eram.'},
#  {'score': 0.0007434834260493517,
#   'token': 24298,
#   'token_str': 'óculos',
#   'sequence': 'Eduardo abriu os óculos, mas não quis se levantar. Ficou deitado e viu que horas eram.'}]

Sentence Embedding

from transformers import BertTokenizer, BertModel, pipeline
import torch

model = BertModel.from_pretrained("ricardoz/BERTugues-base-portuguese-cased")
tokenizer = BertTokenizer.from_pretrained("ricardoz/BERTugues-base-portuguese-cased", do_lower_case=False)

input_ids = tokenizer.encode('[CLS] Eduardo abriu os olhos, mas não quis se levantar. Ficou deitado e viu que horas eram.', return_tensors='pt')

with torch.no_grad():
    last_hidden_state = model(input_ids).last_hidden_state[:, 0]

last_hidden_state
	
# tensor([[-1.5727e+00,  5.4707e-01, -2.6169e-01, -3.0714e-01, -9.4545e-01,
#           4.1557e-01, -7.9884e-01, -2.1481e-01,  5.9792e-01, -1.4198e+00,
#           1.1929e-01,  8.0228e-01,  5.4977e-01,  1.2710e+00, -2.9444e-01,
#           1.3398e+00,  1.4413e-01,  8.2983e-01, -8.2349e-02,  1.8593e-01,

📄 License

This model is under the "other" license.

📚 Documentation

Citation

If you use BERTugues in your publications, please cite it! It helps a lot in the recognition and appreciation of the model in the scientific community.

@article{Zago2024bertugues,
	title   = {BERTugues: A Novel BERT Transformer Model Pre-trained for Brazilian Portuguese},
	volume  = {45},
    url     = {https://ojs.uel.br/revistas/uel/index.php/semexatas/article/view/50630},
	DOI     = {10.5433/1679-0375.2024.v45.50630},
	journal = {Semina: Ciências Exatas e Tecnológicas},
	author  = {Mazza Zago, Ricardo and Agnoletti dos Santos Pedotti, Luciane},
	year    = {2024},
	month   = {Dec.},
	pages   = {e50630}
}

More Information

For more information, please visit our GitHub!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご