bert-large-portuguese-cased-legal-mlm-sts-v1.0 Open Source Model - Supports Portuguese Legal Sentence Similarity Calculation

Bert Large Portuguese Cased Legal Mlm Sts V1.0

Developed by stjiris

A legal domain-specific Portuguese sentence transformation model developed based on the BERTimbau large model, supporting sentence similarity calculation

Text Embedding

Transformers

Other#Portuguese legal text #Sentence similarity calculation #1024-dimensional vector

Downloads 880

Release Time : 11/22/2022

Model Overview

This is a sentence-transformers model that maps sentences and paragraphs into a 1024-dimensional vector space, suitable for tasks such as clustering or semantic search. The model is specifically optimized for the Portuguese legal domain and trained on multiple Portuguese sentence similarity datasets.

Model Features

Legal domain optimization

Specifically trained and optimized for the Portuguese legal domain, using approximately 30,000 legal documents as training data

High-performance sentence embedding

Maps sentences and paragraphs into a 1024-dimensional dense vector space, supporting semantic search and clustering tasks

Multi-dataset training

Trained on multiple datasets including assin, assin2, and the Portuguese subset of stsb_multi_mt

Model Capabilities

Sentence embedding generation

Semantic similarity calculation

Legal text processing

Portuguese text analysis

Use Cases

Legal text processing

Legal document similarity analysis

Compare semantic similarity between different legal documents

Legal case retrieval

Legal case retrieval system based on semantic similarity

General text processing

Document clustering

Automatically group Portuguese documents with similar content

Semantic search

Build a Portuguese search system based on semantics rather than keywords

🚀 stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0 (Legal BERTimbau)

This is a sentence-transformers model that maps sentences and paragraphs to a 1024-dimensional dense vector space. It can be used for tasks like clustering or semantic search. The model is derived from BERTimbau large and is adapted to the Portuguese legal domain, trained for STS on Portuguese datasets.

📋 Metadata

Property	Details
Language	Portuguese
Thumbnail	Portuguese BERT for the Legal Domain
Pipeline Tag	sentence-similarity
Tags	sentence-transformers, sentence-similarity, transformers
Datasets	assin, assin2, stjiris/portuguese-legal-sentences-v1.0

🚀 Quick Start

Installation

To use this model, you need to install sentence-transformers:

pip install -U sentence-transformers

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]

model = SentenceTransformer('stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0')
model = AutoModel.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

📊 Model Results

Task	Metric	Dataset	Value
STS	Pearson Correlation	assin	0.7716333759993093
STS	Pearson Correlation	assin2	0.8403302138785704
STS	Pearson Correlation	stsb_multi_mt pt	0.8249826985133595

🔧 Technical Details

The model was trained using the MLM technique with a learning rate of 3e-5 on Legal Sentences from +-30000 documents for 130k training steps, achieving the best performance for the semantic search system implementation.

📦 Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 514, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1028, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

📄 Citing & Authors

If you use this work, please cite:

@InProceedings{MeloSemantic,
  author="Melo, Rui
  and Santos, Pedro A.
  and Dias, Jo{\~a}o",
  editor="Moniz, Nuno
  and Vale, Zita
  and Cascalho, Jos{\'e}
  and Silva, Catarina
  and Sebasti{\~a}o, Raquel",
  title="A Semantic Search System for the Supremo Tribunal de Justi{\c{c}}a",
  booktitle="Progress in Artificial Intelligence",
  year="2023",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="142--154",
  abstract="Many information retrieval systems use lexical approaches to retrieve information. Such approaches have multiple limitations, and these constraints are exacerbated when tied to specific domains, such as the legal one. Large language models, such as BERT, deeply understand a language and may overcome the limitations of older methodologies, such as BM25. This work investigated and developed a prototype of a Semantic Search System to assist the Supremo Tribunal de Justi{\c{c}}a (Portuguese Supreme Court of Justice) in its decision-making process. We built a Semantic Search System that uses specially trained BERT models (Legal-BERTimbau variants) and a Hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau. In this context, we obtained a {\$}{\$}335{\backslash}{\%}{\$}{\$}335{\%}increase on the discovery metric when compared to BM25 for the first query result. This work also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence and introduces a new technique of Metadata Knowledge Distillation.",
  isbn="978-3-031-49011-8"
}


@inproceedings{souza2020bertimbau,
  author    = {F{\'a}bio Souza and
               Rodrigo Nogueira and
               Roberto Lotufo},
  title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
  booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
  year      = {2020}
}

@inproceedings{fonseca2016assin,
  title={ASSIN: Avaliacao de similaridade semantica e inferencia textual},
  author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S},
  booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal},
  pages={13--15},
  year={2016}
}

@inproceedings{real2020assin,
  title={The assin 2 shared task: a quick overview},
  author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
  booktitle={International Conference on Computational Processing of the Portuguese Language},
  pages={406--412},
  year={2020},
  organization={Springer}
}
@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご