🚀 rufimelo/Legal-BERTimbau-sts-base-ma
This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. rufimelo/Legal-BERTimbau-sts-base-ma is based on Legal-BERTimbau-base, which is derived from the large BERTimbau model. It is adapted to the Portuguese legal domain and trained for Semantic Textual Similarity (STS) on Portuguese datasets.
📋 Model Information
Property |
Details |
Model Type |
Sentence-Transformers |
Task |
Sentence Similarity |
Datasets |
assin, assin2, stsb_multi_mt, rufimelo/PortugueseLegalSentences-v0 |
Model Index |
BERTimbau |
🛠️ Example Widget
- Source Sentence: "O advogado apresentou as provas ao juíz."
- Comparison Sentences:
- "O juíz leu as provas."
- "O juíz leu o recurso."
- "O juíz atirou uma pedra."
- Example Title: "Example 1"
📊 Model Results
Task |
Metric |
Value |
STS |
Pearson Correlation - assin Dataset |
0.75481 |
STS |
Pearson Correlation - assin2 Dataset |
0.80262 |
STS |
Pearson Correlation - stsb_multi_mt pt Dataset |
0.82178 |
🚀 Quick Start
📦 Installation
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]
model = SentenceTransformer('rufimelo/Legal-BERTimbau-sts-base-ma-v2')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('rufimelo/Legal-BERTimbau-sts-base-ma-v2')
model = AutoModel.from_pretrained('rufimelo/Legal-BERTimbau-sts-base-ma-v2')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
📈 Evaluation Results STS
Model |
Assin |
Assin2 |
stsb_multi_mt pt |
avg |
Legal-BERTimbau-sts-base |
0.71457 |
0.73545 |
0.72383 |
0.72462 |
Legal-BERTimbau-sts-base-ma |
0.74874 |
0.79532 |
0.82254 |
0.78886 |
Legal-BERTimbau-sts-base-ma-v2 |
0.75481 |
0.80262 |
0.82178 |
0.79307 |
Legal-BERTimbau-base-TSDAE-sts |
0.78814 |
0.81380 |
0.75777 |
0.78657 |
Legal-BERTimbau-sts-large |
0.76629 |
0.82357 |
0.79120 |
0.79369 |
Legal-BERTimbau-sts-large-v2 |
0.76299 |
0.81121 |
0.81726 |
0.79715 |
Legal-BERTimbau-sts-large-ma |
0.76195 |
0.81622 |
0.82608 |
0.80142 |
Legal-BERTimbau-sts-large-ma-v2 |
0.7836 |
0.8462 |
0.8261 |
0.81863 |
Legal-BERTimbau-sts-large-ma-v3 |
0.7749 |
0.8470 |
0.8364 |
0.81943 |
Legal-BERTimbau-large-v2-sts |
0.71665 |
0.80106 |
0.73724 |
0.75165 |
Legal-BERTimbau-large-TSDAE-sts |
0.72376 |
0.79261 |
0.73635 |
0.75090 |
Legal-BERTimbau-large-TSDAE-sts-v2 |
0.81326 |
0.83130 |
0.786314 |
0.81029 |
Legal-BERTimbau-large-TSDAE-sts-v3 |
0.80703 |
0.82270 |
0.77638 |
0.80204 |
---------------------------------------- |
---------- |
---------- |
---------- |
---------- |
BERTimbau base Fine-tuned for STS |
0.78455 |
0.80626 |
0.82841 |
0.80640 |
BERTimbau large Fine-tuned for STS |
0.78193 |
0.81758 |
0.83784 |
0.81245 |
---------------------------------------- |
---------- |
---------- |
---------- |
---------- |
paraphrase-multilingual-mpnet-base-v2 |
0.71457 |
0.79831 |
0.83999 |
0.78429 |
paraphrase-multilingual-mpnet-base-v2 Fine-tuned with assin(s) |
0.77641 |
0.79831 |
0.84575 |
0.80682 |
🔧 Training
rufimelo/Legal-BERTimbau-sts-base-ma-v2 is based on Legal-BERTimbau-base, which is derived from the base BERTimbau model.
Firstly, due to the lack of Portuguese datasets, it was trained using multilingual knowledge distillation. For the Multilingual Knowledge Distillation process, the teacher model was 'sentence-transformers/paraphrase-xlm-r-multilingual-v1', with English as the supposed supported language and Portuguese as the language to learn.
It was trained for Semantic Textual Similarity and underwent a fine-tuning stage with the assin, assin2, and stsb_multi_mt pt datasets.
📚 Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)
📄 Citing & Authors
If you use this work, please cite:
@inproceedings{souza2020bertimbau,
author = {F{\'a}bio Souza and
Rodrigo Nogueira and
Roberto Lotufo},
title = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
year = {2020}
}
@inproceedings{fonseca2016assin,
title={ASSIN: Avaliacao de similaridade semantica e inferencia textual},
author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S},
booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal},
pages={13--15},
year={2016}
}
@inproceedings{real2020assin,
title={The assin 2 shared task: a quick overview},
author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
booktitle={International Conference on Computational Processing of the Portuguese Language},
pages={406--412},
year={2020},
organization={Springer}
}
@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}