bert-large-portuguese-cased-legal-mlm-nli-sts-v1オープンソースモデル - ポルトガル語の法的文の類似度計算と意味検索をサポート

ホーム

Bert Large Portuguese Cased Legal Mlm Nli Sts V1

stjirisによって開発

BERTimbau大規模モデルに基づく法律分野専用のポルトガル語BERTモデルで、文の類似度計算と意味検索をサポートします。

テキスト埋め込み

Transformers

その他オープンソースライセンス:MIT #ポルトガル語の法律テキスト #意味的類似度計算 #司法分野のBERT

ダウンロード数 331

リリース時間 : 1/6/2023

モデル概要

これはポルトガル語の法律テキストに対して最適化されたBERTモデルで、文や段落を1024次元のベクトル空間にマッピングでき、クラスタリングや意味検索などの自然言語処理タスクに適しています。

モデル特徴

法律分野の最適化

約3万件の法律文書に基づいて専用の訓練を行い、法律テキスト処理で優れた性能を発揮します。

多段階訓練

MLM事前訓練、NLI微調整、STS専用最適化の3段階の訓練プロセスを経ています。

高次元ベクトル空間

1024次元の稠密ベクトルを生成し、法律テキストの意味的特徴をよりよく捉えることができます。

モデル能力

文のベクトル化

意味的類似度計算

法律テキスト分析

意味検索

テキストクラスタリング

使用事例

司法システム

法律文書の意味検索

法律文書データベースで意味に基づく類似事例の検索を実現します。

IRISプロジェクトで実際に適用され、法律検索の効率が向上しました。

判決書分析

判決書の中の重要な文の類似度を分析します。

自然言語処理

テキスト類似度計算

2つのポルトガル語の文の間の意味的類似度を計算します。

assin2データセットで0.81のピアソン相関係数を達成しました。

🚀 stjiris/bert-large-portuguese-cased-legal-mlm-nli-sts-v1 (Legal BERTimbau)

このモデルは、文章や段落を1024次元の密ベクトル空間にマッピングし、クラスタリングや意味検索などのタスクに使用できます。

🚀 クイックスタート

このモデルは、法的ドメインのポルトガル語用のBERTで、文の類似度を計算するために使用できます。

モデル情報

属性	詳情
モデルタイプ	sentence-similarity
学習データ	stjiris/portuguese-legal-sentences-v0, assin, assin2, stsb_multi_mt, stjiris/IRIS_sts
ライセンス	MIT

ウィジェット例

ソース文: "O advogado apresentou as provas ao juíz."
比較文:
- "O juíz leu as provas."
- "O juíz leu o recurso."
- "O juíz atirou uma pedra."

モデルの評価結果

モデル名: BERTimbau
- タスク: STS
- 評価指標:
  - Pearson Correlation - assin Dataset: 0.7774097897260964
  - Pearson Correlation - assin2 Dataset: 0.8097518625809903
  - Pearson Correlation - stsb_multi_mt pt Dataset: 0.8358844307795662
  - Pearson Correlation - IRIS STS Dataset: 0.7856746037418626

✨ 主な機能

法的ドメインのポルトガル語文の類似度を計算することができます。
文章や段落を1024次元の密ベクトル空間にマッピングします。

📦 インストール

pip install -U sentence-transformers

💻 使用例

基本的な使用法 (Sentence-Transformers)

from sentence_transformers import SentenceTransformer
sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]

model = SentenceTransformer('stjiris/bert-large-portuguese-cased-legal-mlm-nli-sts-v1')
embeddings = model.encode(sentences)
print(embeddings)

高度な使用法 (HuggingFace Transformers)

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-nli-sts-v1')
model = AutoModel.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-nli-sts-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

🔧 技術詳細

モデルアーキテクチャ

SentenceTransformer(
  (0): Transformer({'max_seq_length': 514, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1028, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

📄 ライセンス

このモデルはMITライセンスの下で提供されています。

📚 引用と著者

貢献者

@rufimelo99

引用情報

@InProceedings{MeloSemantic,
  author="Melo, Rui
  and Santos, Pedro A.
  and Dias, Jo{\~a}o",
  editor="Moniz, Nuno
  and Vale, Zita
  and Cascalho, Jos{\'e}
  and Silva, Catarina
  and Sebasti{\~a}o, Raquel",
  title="A Semantic Search System for the Supremo Tribunal de Justi{\c{c}}a",
  booktitle="Progress in Artificial Intelligence",
  year="2023",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="142--154",
  abstract="Many information retrieval systems use lexical approaches to retrieve information. Such approaches have multiple limitations, and these constraints are exacerbated when tied to specific domains, such as the legal one. Large language models, such as BERT, deeply understand a language and may overcome the limitations of older methodologies, such as BM25. This work investigated and developed a prototype of a Semantic Search System to assist the Supremo Tribunal de Justi{\c{c}}a (Portuguese Supreme Court of Justice) in its decision-making process. We built a Semantic Search System that uses specially trained BERT models (Legal-BERTimbau variants) and a Hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau. In this context, we obtained a {\$}{\$}335{\backslash}{\%}{\$}{\$}335{\%}increase on the discovery metric when compared to BM25 for the first query result. This work also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence and introduces a new technique of Metadata Knowledge Distillation.",
  isbn="978-3-031-49011-8"
}


@inproceedings{souza2020bertimbau,
  author    = {F{\'a}bio Souza and
               Rodrigo Nogueira and
               Roberto Lotufo},
  title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
  booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
  year      = {2020}
}

@inproceedings{fonseca2016assin,
  title={ASSIN: Avaliacao de similaridade semantica e inferencia textual},
  author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S},
  booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal},
  pages={13--15},
  year={2016}
}

@inproceedings{real2020assin,
  title={The assin 2 shared task: a quick overview},
  author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
  booktitle={International Conference on Computational Processing of the Portuguese Language},
  pages={406--412},
  year={2020},
  organization={Springer}
}
@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}