bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-v1オープンソースモデル - ポルトガル语の法律的语义类似度计算をサポート

ホーム

Bert Large Portuguese Cased Legal Tsdae Gpl Nli Sts V1

stjirisによって開発

BERTimbau大規模モデルに基づく法律分野専用のポルトガル語文変換器で、意味的類似度計算をサポートします。

テキスト埋め込み

Transformers

その他オープンソースライセンス:MIT #ポルトガル語の法律テキスト #意味的類似度計算 #TSDAE強化トレーニング

ダウンロード数 17

リリース時間 : 1/5/2023

モデル概要

これはポルトガル語の法律テキストに最適化された文変換器モデルで、文を1024次元のベクトル空間にマッピングでき、法律分野の意味的検索、クラスタリング、テキスト類似度計算タスクに適しています。

モデル特徴

法律分野最適化

ポルトガル語の法律テキストに特化してトレーニングと最適化が行われ、約3万件の法律文書データを含んでいます。

先進的なトレーニング技術

TSDAE(Transformerベースの逐次ノイズ除去オートエンコーダ)技術を用いてトレーニングし、生成的擬似ラベル(GPL)による強化を組み合わせています。

多段階トレーニング

自然言語推論(NLI)と意味的テキスト類似度(STS)の多段階微調整を行っています。

高性能

複数のポルトガル語STSデータセットで優れた性能を発揮し、ピアソン相関係数が0.77 - 0.84に達しています。

モデル能力

文埋め込み生成

意味的類似度計算

法律テキスト分析

ポルトガル語処理

テキストクラスタリング

使用事例

法律テキスト処理

法律文書の意味的検索

法律文書ライブラリで意味に基づく検索機能を実現します。

最高裁判所の意味的検索システムで優れた性能を発揮します。

判例類似度分析

異なる判例文書間の意味的類似度を自動的に計算します。

汎用テキスト処理

テキストクラスタリング

類似した内容のポルトガル語文書を自動的にグループ化します。

🚀 スティリス/ベルト-ラージ-ポルトガル語-ケースド-法務用-TSDAE-GPL-NLI-STS-V1（法務用ベルティンバウ）

このモデルは、文章や段落を1024次元の密ベクトル空間にマッピングし、クラスタリングや意味検索などのタスクに使用できます。

🚀 クイックスタート

このモデルは、sentence-transformers をインストールすると簡単に使用できます。

✨ 主な機能

文章や段落を1024次元の密ベクトル空間にマッピングします。
クラスタリングや意味検索などのタスクに使用できます。

📦 インストール

pip install -U sentence-transformers

💻 使用例

基本的な使用法

from sentence_transformers import SentenceTransformer
sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]

model = SentenceTransformer('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-v1')
embeddings = model.encode(sentences)
print(embeddings)

高度な使用法

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-v1')
model = AutoModel.from_pretrained('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

📚 ドキュメント

モデル情報

属性	详情
モデルタイプ	sentence-transformers
訓練データ	stjiris/portuguese-legal-sentences-v0、assin、assin2、stsb_multi_mt、stjiris/IRIS_sts

モデルアーキテクチャ

SentenceTransformer(
  (0): Transformer({'max_seq_length': 514, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1028, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

引用情報

貢献者

@rufimelo99

BibTeX引用

@InProceedings{MeloSemantic,
  author="Melo, Rui
  and Santos, Pedro A.
  and Dias, Jo{\~a}o",
  editor="Moniz, Nuno
  and Vale, Zita
  and Cascalho, Jos{\'e}
  and Silva, Catarina
  and Sebasti{\~a}o, Raquel",
  title="A Semantic Search System for the Supremo Tribunal de Justi{\c{c}}a",
  booktitle="Progress in Artificial Intelligence",
  year="2023",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="142--154",
  abstract="Many information retrieval systems use lexical approaches to retrieve information. Such approaches have multiple limitations, and these constraints are exacerbated when tied to specific domains, such as the legal one. Large language models, such as BERT, deeply understand a language and may overcome the limitations of older methodologies, such as BM25. This work investigated and developed a prototype of a Semantic Search System to assist the Supremo Tribunal de Justi{\c{c}}a (Portuguese Supreme Court of Justice) in its decision-making process. We built a Semantic Search System that uses specially trained BERT models (Legal-BERTimbau variants) and a Hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau. In this context, we obtained a {\$}{\$}335{\backslash}{\%}{\$}{\$}335{\%}increase on the discovery metric when compared to BM25 for the first query result. This work also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence and introduces a new technique of Metadata Knowledge Distillation.",
  isbn="978-3-031-49011-8"
}


@inproceedings{souza2020bertimbau,
  author    = {F{\'a}bio Souza and
               Rodrigo Nogueira and
               Roberto Lotufo},
  title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
  booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
  year      = {2020}
}

@inproceedings{fonseca2016assin,
  title={ASSIN: Avaliacao de similaridade semantica e inferencia textual},
  author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S},
  booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal},
  pages={13--15},
  year={2016}
}

@inproceedings{real2020assin,
  title={The assin 2 shared task: a quick overview},
  author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
  booktitle={International Conference on Computational Processing of the Portuguese Language},
  pages={406--412},
  year={2020},
  organization={Springer}
}
@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}

📄 ライセンス

このモデルはMITライセンスの下で提供されています。