bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v1 Open Source Model - Handling Portuguese Legal Semantic Similarity and Clustering

Bert Large Portuguese Cased Legal Tsdae Gpl Nli Sts MetaKD V1

Developed by stjiris

A Portuguese sentence transformer specialized for the legal domain based on the BERTimbau large model, suitable for semantic similarity calculation and clustering tasks

Text Embedding

Transformers

OtherOpen Source License:MIT #Portuguese legal semantic analysis #High-precision sentence similarity #Judicial document retrieval

Downloads 74

Release Time : 3/3/2023

Model Overview

This is a sentence transformer model optimized for Portuguese legal texts, capable of mapping sentences to a 1024-dimensional dense vector space, particularly suitable for semantic search and similarity calculation tasks involving legal documents.

Model Features

Legal Domain Optimization

Trained on approximately 30,000 legal documents, excelling in legal text processing

Advanced Training Techniques

Utilizes TSDAE technology and metadata knowledge distillation to enhance semantic representation capabilities

Multi-Dataset Fine-Tuning

Optimized on multiple Portuguese datasets including assin, assin2, and stsb_multi_mt

High-Dimensional Vector Space

Maps text to a 1024-dimensional dense vector space, suitable for complex semantic analysis

Model Capabilities

Semantic similarity calculation

Legal text clustering

Information retrieval

Sentence vectorization

Use Cases

Legal Document Processing

Legal Document Similarity Analysis

Calculate semantic similarity between different legal documents

Performs exceptionally well on STJ judicial documents

Legal Semantic Search System

Build a semantic-based legal document retrieval system

Applied in Supreme Court document retrieval

Text Analysis

Legal Text Clustering

Automatically classify and cluster large volumes of legal documents

🚀 stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v1 (Legal BERTimbau)

This is a sentence-transformers model that maps sentences and paragraphs to a 1024-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]

model = SentenceTransformer('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v1')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v1')
model = AutoModel.from_pretrained('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Model Information: stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-v0 derives from stjiris/bert-large-portuguese-cased-legal-tsdae (legal variant of BERTimbau large).
Training Process:
- It was trained using the TSDAE technique with a learning rate 1e - 5 on Legal Sentences from + - 30000 documents for 212k training steps (best performance for our semantic search system implementation).
- It was presented to Generative Pseudo Labeling training.
- The model was presented to NLI data with a 16 batch size and 2e - 5 lr.
- It was trained for Semantic Textual Similarity, being submitted to a fine - tuning stage with the assin, assin2, stsb_multi_mt pt datasets with 'lr': 1e - 5.
- This model was subjected to Metadata Knowledge Distillation. Repository

🔧 Technical Details

SentenceTransformer(
  (0): Transformer({'max_seq_length': 514, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1028, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

📄 License

This project is licensed under the MIT license.

👨‍💻 Citing & Authors

Contributions

@rufimelo99

If you use this work, please cite:

@InProceedings{MeloSemantic,
  author="Melo, Rui
  and Santos, Pedro A.
  and Dias, Jo{\~a}o",
  editor="Moniz, Nuno
  and Vale, Zita
  and Cascalho, Jos{\'e}
  and Silva, Catarina
  and Sebasti{\~a}o, Raquel",
  title="A Semantic Search System for the Supremo Tribunal de Justi{\c{c}}a",
  booktitle="Progress in Artificial Intelligence",
  year="2023",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="142--154",
  abstract="Many information retrieval systems use lexical approaches to retrieve information. Such approaches have multiple limitations, and these constraints are exacerbated when tied to specific domains, such as the legal one. Large language models, such as BERT, deeply understand a language and may overcome the limitations of older methodologies, such as BM25. This work investigated and developed a prototype of a Semantic Search System to assist the Supremo Tribunal de Justi{\c{c}}a (Portuguese Supreme Court of Justice) in its decision-making process. We built a Semantic Search System that uses specially trained BERT models (Legal-BERTimbau variants) and a Hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau. In this context, we obtained a {\$}{\$}335{\backslash}{\%}{\$}{\$}335{\%}increase on the discovery metric when compared to BM25 for the first query result. This work also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence and introduces a new technique of Metadata Knowledge Distillation.",
  isbn="978-3-031-49011-8"
}


@inproceedings{souza2020bertimbau,
  author    = {F{\'a}bio Souza and
               Rodrigo Nogueira and
               Roberto Lotufo},
  title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
  booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
  year      = {2020}
}

@inproceedings{fonseca2016assin,
  title={ASSIN: Avaliacao de similaridade semantica e inferencia textual},
  author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S},
  booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal},
  pages={13--15},
  year={2016}
}

@inproceedings{real2020assin,
  title={The assin 2 shared task: a quick overview},
  author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
  booktitle={International Conference on Computational Processing of the Portuguese Language},
  pages={406--412},
  year={2020},
  organization={Springer}
}
@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}

🔗 Links

📊 Model Metrics

Property	Details
Model Type	Sentence-transformers model
Training Data	stjiris/portuguese-legal-sentences-v0, assin, assin2, stsb_multi_mt pt
Pearson Correlation - assin Dataset	0.8054285867337523
Pearson Correlation - assin2 Dataset	0.834663784004652
Pearson Correlation - stsb_multi_mt pt Dataset	0.7774871148943012
Pearson Correlation - IRIS sts pt Dataset	0.8054285867337523

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご