Legal-BERTimbau-sts-base-ma-v2 Open Source Model - Empowering Semantic Similarity Analysis of Portuguese Legal Texts

Legal BERTimbau Sts Base Ma V2

Developed by rufimelo

This is a BERTimbau-based sentence transformer model for the Portuguese legal domain, designed for semantic textual similarity tasks.

Text Embedding

Transformers

Other#Portuguese Legal Semantic Similarity #Legal Text Vectorization #BERTimbau Optimization

Downloads 249

Release Time : 9/19/2022

Model Overview

The model maps sentences and paragraphs in Portuguese legal texts into a 768-dimensional vector space, suitable for natural language processing tasks such as clustering and semantic search.

Model Features

Legal Domain Optimization

Specially optimized for Portuguese legal texts, enabling better understanding of legal terminology and expressions.

Semantic Similarity Calculation

Accurately calculates semantic similarity between sentences, suitable for applications like legal document retrieval.

High-Dimensional Vector Representation

Maps text into a 768-dimensional vector space, preserving rich semantic information.

Model Capabilities

Text Vectorization

Semantic Similarity Calculation

Legal Text Understanding

Document Clustering

Use Cases

Legal Document Processing

Legal Document Retrieval

Retrieve relevant legal documents based on semantic similarity

Improves accuracy and efficiency in legal document retrieval

Case Matching

Find precedents similar to current cases

Assists in legal research and case preparation

Legal Text Analysis

Contract Clause Analysis

Compare similarities between different contract clauses

Helps identify key clauses and potential risks in contracts

🚀 rufimelo/Legal-BERTimbau-sts-base-ma

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. rufimelo/Legal-BERTimbau-sts-base-ma is based on Legal-BERTimbau-base, which is derived from the large BERTimbau model. It is adapted to the Portuguese legal domain and trained for Semantic Textual Similarity (STS) on Portuguese datasets.

📋 Model Information

Property	Details
Model Type	Sentence-Transformers
Task	Sentence Similarity
Datasets	assin, assin2, stsb_multi_mt, rufimelo/PortugueseLegalSentences-v0
Model Index	BERTimbau

🛠️ Example Widget

Source Sentence: "O advogado apresentou as provas ao juíz."
Comparison Sentences:
- "O juíz leu as provas."
- "O juíz leu o recurso."
- "O juíz atirou uma pedra."
Example Title: "Example 1"

📊 Model Results

Task	Metric	Value
STS	Pearson Correlation - assin Dataset	0.75481
STS	Pearson Correlation - assin2 Dataset	0.80262
STS	Pearson Correlation - stsb_multi_mt pt Dataset	0.82178

🚀 Quick Start

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]

model = SentenceTransformer('rufimelo/Legal-BERTimbau-sts-base-ma-v2')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('rufimelo/Legal-BERTimbau-sts-base-ma-v2')
model = AutoModel.from_pretrained('rufimelo/Legal-BERTimbau-sts-base-ma-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📈 Evaluation Results STS

Model	Assin	Assin2	stsb_multi_mt pt	avg
Legal-BERTimbau-sts-base	0.71457	0.73545	0.72383	0.72462
Legal-BERTimbau-sts-base-ma	0.74874	0.79532	0.82254	0.78886
Legal-BERTimbau-sts-base-ma-v2	0.75481	0.80262	0.82178	0.79307
Legal-BERTimbau-base-TSDAE-sts	0.78814	0.81380	0.75777	0.78657
Legal-BERTimbau-sts-large	0.76629	0.82357	0.79120	0.79369
Legal-BERTimbau-sts-large-v2	0.76299	0.81121	0.81726	0.79715
Legal-BERTimbau-sts-large-ma	0.76195	0.81622	0.82608	0.80142
Legal-BERTimbau-sts-large-ma-v2	0.7836	0.8462	0.8261	0.81863
Legal-BERTimbau-sts-large-ma-v3	0.7749	0.8470	0.8364	0.81943
Legal-BERTimbau-large-v2-sts	0.71665	0.80106	0.73724	0.75165
Legal-BERTimbau-large-TSDAE-sts	0.72376	0.79261	0.73635	0.75090
Legal-BERTimbau-large-TSDAE-sts-v2	0.81326	0.83130	0.786314	0.81029
Legal-BERTimbau-large-TSDAE-sts-v3	0.80703	0.82270	0.77638	0.80204
----------------------------------------	----------	----------	----------	----------
BERTimbau base Fine-tuned for STS	0.78455	0.80626	0.82841	0.80640
BERTimbau large Fine-tuned for STS	0.78193	0.81758	0.83784	0.81245
----------------------------------------	----------	----------	----------	----------
paraphrase-multilingual-mpnet-base-v2	0.71457	0.79831	0.83999	0.78429
paraphrase-multilingual-mpnet-base-v2 Fine-tuned with assin(s)	0.77641	0.79831	0.84575	0.80682

🔧 Training

rufimelo/Legal-BERTimbau-sts-base-ma-v2 is based on Legal-BERTimbau-base, which is derived from the base BERTimbau model.

Firstly, due to the lack of Portuguese datasets, it was trained using multilingual knowledge distillation. For the Multilingual Knowledge Distillation process, the teacher model was 'sentence-transformers/paraphrase-xlm-r-multilingual-v1', with English as the supposed supported language and Portuguese as the language to learn.

It was trained for Semantic Textual Similarity and underwent a fine-tuning stage with the assin, assin2, and stsb_multi_mt pt datasets.

📚 Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

📄 Citing & Authors

If you use this work, please cite:

@inproceedings{souza2020bertimbau,
  author    = {F{\'a}bio Souza and
               Rodrigo Nogueira and
               Roberto Lotufo},
  title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
  booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
  year      = {2020}
}

@inproceedings{fonseca2016assin,
  title={ASSIN: Avaliacao de similaridade semantica e inferencia textual},
  author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S},
  booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal},
  pages={13--15},
  year={2016}
}

@inproceedings{real2020assin,
  title={The assin 2 shared task: a quick overview},
  author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
  booktitle={International Conference on Computational Processing of the Portuguese Language},
  pages={406--412},
  year={2020},
  organization={Springer}
}
@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご