bert-large-portuguese-cased-legal-mlm-sts-v1.0开源模型 - 支持葡语法律句子相似度计算

首页

Bert Large Portuguese Cased Legal Mlm Sts V1.0

由 stjiris 开发

基于BERTimbau大模型开发的法律领域专用葡萄牙语句子转换模型，支持句子相似度计算

文本嵌入

Transformers

其他#葡萄牙法律文本 #句子相似度计算 #1024维向量

下载量 880

发布时间 : 11/22/2022

模型简介

这是一个sentence-transformers模型，能将句子和段落映射到1024维向量空间，适用于聚类或语义搜索等任务。该模型专门针对葡萄牙法律领域优化，并在多个葡萄牙语句子相似度数据集上训练。

模型特点

法律领域优化

专门针对葡萄牙法律领域进行训练和优化，使用约3万份法律文书中的句子作为训练数据

高性能句子嵌入

能将句子和段落映射到1024维密集向量空间，支持语义搜索和聚类任务

多数据集训练

在assin、assin2和stsb_multi_mt葡萄牙语子集等多个数据集上进行训练

模型能力

句子嵌入生成

语义相似度计算

法律文本处理

葡萄牙语文本分析

使用案例

法律文本处理

法律文书相似度分析

比较不同法律文书之间的语义相似度

法律案例检索

基于语义相似度的法律案例检索系统

通用文本处理

文档聚类

将相似内容的葡萄牙语文档自动分组

语义搜索

构建基于语义而非关键词的葡萄牙语搜索系统

🚀 stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0 (Legal BERTimbau)

这是一个sentence-transformers模型，它可以将句子和段落映射到一个1024维的密集向量空间，可用于聚类或语义搜索等任务。stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0 基于BERTimbau大模型衍生而来。

该模型使用MLM技术，学习率为3e - 5，在约30000篇文档中的法律句子上进行了130k个训练步骤（在我们的语义搜索系统实现中表现最佳）。它适用于葡萄牙语法律领域，并在葡萄牙语数据集上进行了STS训练，这些数据集包括assin、assin2和stsb_multi_mt的葡萄牙语子数据集。

📦 安装指南

若要使用此模型，需先安装sentence-transformers：

pip install -U sentence-transformers

💻 使用示例

基础用法

from sentence_transformers import SentenceTransformer
sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]

model = SentenceTransformer('stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0')
embeddings = model.encode(sentences)
print(embeddings)

高级用法

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0')
model = AutoModel.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

🔧 技术细节

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 514, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1028, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

📚 详细文档

模型信息

属性	详情
模型类型	sentence-transformers模型，可将句子和段落映射到1024维密集向量空间
训练数据	assin、assin2、stjiris/portuguese-legal-sentences-v1.0等数据集
训练技术	MLM技术，学习率3e - 5，130k训练步骤

模型评估结果

评估指标	数据集	值
Pearson Correlation	assin Dataset	0.7716333759993093
Pearson Correlation	assin2 Dataset	0.8403302138785704
Pearson Correlation	stsb_multi_mt pt Dataset	0.8249826985133595

引用信息

如果使用此模型，请引用以下文献：

@InProceedings{MeloSemantic,
  author="Melo, Rui
  and Santos, Pedro A.
  and Dias, Jo{\~a}o",
  editor="Moniz, Nuno
  and Vale, Zita
  and Cascalho, Jos{\'e}
  and Silva, Catarina
  and Sebasti{\~a}o, Raquel",
  title="A Semantic Search System for the Supremo Tribunal de Justi{\c{c}}a",
  booktitle="Progress in Artificial Intelligence",
  year="2023",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="142--154",
  abstract="Many information retrieval systems use lexical approaches to retrieve information. Such approaches have multiple limitations, and these constraints are exacerbated when tied to specific domains, such as the legal one. Large language models, such as BERT, deeply understand a language and may overcome the limitations of older methodologies, such as BM25. This work investigated and developed a prototype of a Semantic Search System to assist the Supremo Tribunal de Justi{\c{c}}a (Portuguese Supreme Court of Justice) in its decision-making process. We built a Semantic Search System that uses specially trained BERT models (Legal-BERTimbau variants) and a Hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau. In this context, we obtained a {\$}{\$}335{\backslash}{\%}{\$}{\$}335{\%}increase on the discovery metric when compared to BM25 for the first query result. This work also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence and introduces a new technique of Metadata Knowledge Distillation.",
  isbn="978-3-031-49011-8"
}


@inproceedings{souza2020bertimbau,
  author    = {F{\'a}bio Souza and
               Rodrigo Nogueira and
               Roberto Lotufo},
  title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
  booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
  year      = {2020}
}

@inproceedings{fonseca2016assin,
  title={ASSIN: Avaliacao de similaridade semantica e inferencia textual},
  author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S},
  booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal},
  pages={13--15},
  year={2016}
}

@inproceedings{real2020assin,
  title={The assin 2 shared task: a quick overview},
  author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
  booktitle={International Conference on Computational Processing of the Portuguese Language},
  pages={406--412},
  year={2020},
  organization={Springer}
}
@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}