Open-source model bert-large-portuguese-cased-legal-mlm-nli-sts-v1 - Supports Portuguese legal sentence similarity calculation and semantic search

Bert Large Portuguese Cased Legal Mlm Nli Sts V1

Developed by stjiris

A Portuguese BERT model specialized for the legal domain based on the BERTimbau large model, supporting sentence similarity calculation and semantic search

Text Embedding

Transformers

OtherOpen Source License:MIT #Portuguese legal text #Semantic similarity calculation #Legal domain BERT

Downloads 331

Release Time : 1/6/2023

Model Overview

This is a BERT model optimized for Portuguese legal texts, capable of mapping sentences and paragraphs into a 1024-dimensional vector space, suitable for natural language processing tasks such as clustering and semantic search.

Model Features

Legal domain optimization

Trained on approximately 30,000 legal documents, excelling in legal text processing

Multi-stage training

Undergone a three-stage training process: MLM pre-training, NLI fine-tuning, and STS-specific optimization

High-dimensional vector space

Generates 1024-dimensional dense vectors, better capturing semantic features of legal texts

Model Capabilities

Sentence vectorization

Semantic similarity calculation

Legal text analysis

Semantic search

Text clustering

Use Cases

Judicial system

Legal document semantic search

Implement semantic-based retrieval of similar cases in legal document repositories

Practically applied in the IRIS project, improving legal retrieval efficiency

Judgment analysis

Analyze key sentence similarity in judgments

Natural language processing

Text similarity calculation

Calculate semantic similarity between two Portuguese sentences

Achieved a Pearson correlation coefficient of 0.81 on the assin2 dataset

🚀 Portuguese BERT for the Legal Domain

This is a sentence-transformers model that maps sentences & paragraphs to a 1024-dimensional dense vector space, suitable for tasks like clustering or semantic search.

✨ Features

Based on the BERTimbau architecture, specifically tailored for the legal domain in Portuguese.
Trained on multiple datasets, including legal sentences and semantic similarity benchmarks, to achieve high performance in semantic search tasks.

📦 Installation

To use this model, you need to install sentence-transformers:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]

model = SentenceTransformer('stjiris/bert-large-portuguese-cased-legal-mlm-nli-sts-v1')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-nli-sts-v1')
model = AutoModel.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-nli-sts-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Model Details

Model Type: Sentence-transformers based on BERTimbau architecture.
Training Data:
- stjiris/portuguese-legal-sentences-v0
- assin
- assin2
- stsb_multi_mt
- stjiris/IRIS_sts

Performance Metrics

Metric	Dataset	Value
Pearson Correlation	assin Dataset	0.7774097897260964
Pearson Correlation	assin2 Dataset	0.8097518625809903
Pearson Correlation	stsb_multi_mt pt Dataset	0.8358844307795662
Pearson Correlation	IRIS STS Dataset	0.7856746037418626

🔧 Technical Details

The model was trained in multiple stages:

MLM Training: Using the MLM technique with a learning rate of 1e-5 on legal sentences from approximately 30000 documents for 15000 training steps.
NLI Training: Presented to NLI data with a batch size of 16 and a learning rate of 2e-5.
Fine-tuning for STS: Fine-tuned on the assin, assin2, stsb_multi_mt pt, and IRIS STS datasets with a learning rate of 1e-5.

📄 License

This project is licensed under the MIT license.

Citing & Authors

Contributions

@rufimelo99

If you use this work, please cite:

@InProceedings{MeloSemantic,
  author="Melo, Rui
  and Santos, Pedro A.
  and Dias, Jo{\~a}o",
  editor="Moniz, Nuno
  and Vale, Zita
  and Cascalho, Jos{\'e}
  and Silva, Catarina
  and Sebasti{\~a}o, Raquel",
  title="A Semantic Search System for the Supremo Tribunal de Justi{\c{c}}a",
  booktitle="Progress in Artificial Intelligence",
  year="2023",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="142--154",
  abstract="Many information retrieval systems use lexical approaches to retrieve information. Such approaches have multiple limitations, and these constraints are exacerbated when tied to specific domains, such as the legal one. Large language models, such as BERT, deeply understand a language and may overcome the limitations of older methodologies, such as BM25. This work investigated and developed a prototype of a Semantic Search System to assist the Supremo Tribunal de Justi{\c{c}}a (Portuguese Supreme Court of Justice) in its decision-making process. We built a Semantic Search System that uses specially trained BERT models (Legal-BERTimbau variants) and a Hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau. In this context, we obtained a {\$}{\$}335{\backslash}{\%}{\$}{\$}335{\%}increase on the discovery metric when compared to BM25 for the first query result. This work also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence and introduces a new technique of Metadata Knowledge Distillation.",
  isbn="978-3-031-49011-8"
}


@inproceedings{souza2020bertimbau,
  author    = {F{\'a}bio Souza and
               Rodrigo Nogueira and
               Roberto Lotufo},
  title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
  booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
  year      = {2020}
}

@inproceedings{fonseca2016assin,
  title={ASSIN: Avaliacao de similaridade semantica e inferencia textual},
  author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S},
  booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal},
  pages={13--15},
  year={2016}
}

@inproceedings{real2020assin,
  title={The assin 2 shared task: a quick overview},
  author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
  booktitle={International Conference on Computational Processing of the Portuguese Language},
  pages={406--412},
  year={2020},
  organization={Springer}
}
@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}

Work developed as part of Project IRIS.

Thesis: A Semantic Search System for Supremo Tribunal de Justiça

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご