B

Bertugues Base Portuguese Cased

Developed by ricardoz
BERTugues is a BERT model trained on Portuguese text, strictly following the original BERT paper's pre-training process, completing masked language modeling and next sentence prediction tasks through 1 million training steps.
Downloads 92
Release Time : 8/7/2023

Model Overview

BERTugues is a BERT model optimized for Portuguese, achieving excellent performance in multiple Portuguese NLP tasks by improving tokenizers and training data quality.

Model Features

Optimized Tokenizer
Removes rare Portuguese characters and adds high-frequency emojis, significantly reducing the proportion of text being split into multiple tokens.
Data Quality Filtering
Adopts heuristic methods proposed in the Gopher model paper to filter the quality of BrWAC corpus.
Performance Advantage
Outperforms similar models in multiple Portuguese NLP tasks, with some tasks surpassing models three times larger in parameter count.

Model Capabilities

Masked language modeling
Sentence similarity calculation
Next sentence prediction
Text feature extraction
Text classification

Use Cases

Sentiment Analysis
Portuguese Movie Review Classification
Uses BERTugues-generated sentence representations with a random forest classifier for sentiment analysis.
Achieves an F1 score of 84.0% on the IMDB Portuguese dataset, outperforming similar models.
Legal Text Processing
Legal Text Topic Classification
Determines whether two legal texts belong to the same topic.
Achieves an F1 score of 45.2% on the STJ dataset, outperforming the Bertimbau-Large model.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase