Legalbert-large-1.7M-2オープンソースモデル - 法務分野の英語テキストの言語理解タスクを支援する

ホーム

Legalbert Large 1.7M 2

pile-of-lawによって開発

英語の法律および行政テキストで事前学習されたRoBERTaモデルで、法律分野の言語理解タスクに特化

大規模言語モデル

Transformers

英語#法律テキスト事前学習 #英語法律分析 #マスク言語モデリング

ダウンロード数 701

リリース時間 : 4/29/2022

モデル概要

これはBERT大型アーキテクチャに基づくtransformersモデルで、Pile of Lawデータセット（約256GBの英語法律テキスト）で事前学習されており、法律関連の下流タスクに適しています

モデル特徴

法律分野専門化

法律および行政テキストに特化した事前学習で、法律用語や表現方法を含む

RoBERTa事前学習目標

RoBERTaのマスク言語モデリング目標を採用し、従来のBERTの学習方法を最適化

大規模訓練データ

約256GBのPile of Lawデータセットを使用して訓練され、35種類の法律関連データソースを含む

法律テキスト最適化処理

LexNLPセンテンスセグメンターを使用して法律引用を処理し、法律テキストの前処理フローを最適化

モデル能力

法律テキスト理解

マスク言語モデリング

法律文書分析

法律用語識別

使用事例

法律テキスト処理

法律条項補完

法律文書の欠落部分を自動補完

例では'An exception is a request...'などの法律用語を正確に予測

法律文書分類

法律文書を自動分類

法律研究支援

法律概念説明

法律用語や概念を説明

🚀 Pile of Law BERT large model 2 (uncased)

英語の法務および行政テキストを対象に、RoBERTaの事前学習目的を用いて事前学習されたモデルです。このモデルは、pile-of-law/legalbert-large-1.7M-1と同じ設定で学習されていますが、シードが異なります。

🚀 クイックスタート

Pile of Law BERT large 2は、英語の法務および行政テキストに特化した事前学習済みモデルです。マスク付き言語モデリングや下流タスクの微調整に利用できます。

✨ 主な機能

英語の法務および行政テキストに対する事前学習が行われています。
BERT large model (uncased)のアーキテクチャを採用しています。
マスク付き言語モデリングや下流タスクの微調整に利用可能です。

📦 インストール

このモデルを使用するには、transformersライブラリが必要です。以下のコマンドでインストールできます。

pip install transformers

💻 使用例

基本的な使用法

>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
>>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")

[{'sequence': 'an exception is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.5218929052352905, 
  'token': 4028, 
  'token_str': 'exception'}, 
  {'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.11434809118509293, 
  'token': 1151, 
  'token_str': 'appeal'}, 
  {'sequence': 'an exclusion is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.06454459577798843, 
  'token': 5345, 
  'token_str': 'exclusion'}, 
  {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.043593790382146835, 
  'token': 3677, 
  'token_str': 'example'}, 
  {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.03758585825562477, 
  'token': 3542, 
  'token_str': 'objection'}]

高度な使用法

PyTorchでの特徴抽出

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

TensorFlowでの特徴抽出

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 ドキュメント

モデルの説明

Pile of Law BERT large 2は、BERT large model (uncased)のアーキテクチャを持つトランスフォーマーモデルで、Pile of Lawという、言語モデルの事前学習用の英語の法務および行政テキスト約256GBからなるデータセットで事前学習されています。

想定される用途と制限

このモデルは、マスク付き言語モデリングにそのまま使用することも、下流タスクに微調整することもできます。このモデルは英語の法務および行政テキストコーパスで事前学習されているため、法務関連の下流タスクにおいてより適している可能性が高いです。

制限とバイアス

データセットとモデルの使用に関する著作権の制限については、Pile of Law論文の付録Gを参照してください。

このモデルは、偏った予測をする可能性があります。以下の例では、マスク付き言語モデリングのパイプラインでモデルを使用していますが、犯罪者の人種を表す記述について、モデルは「black」の方が「white」よりも高いスコアを予測しています。

>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
>>> pipe("The transcript of evidence reveals that at approximately 7:30 a. m. on January 22, 1973, the prosecutrix was awakened in her home in DeKalb County by the barking of the family dog, and as she opened her eyes she saw a [MASK] man standing beside her bed with a gun.", targets=["black", "white"])

[{'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a black man standing beside her bed with a gun.', 
  'score': 0.02685137465596199, 
  'token': 4311, 
  'token_str': 'black'}, 
  {'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a white man standing beside her bed with a gun.', 
  'score': 0.013632853515446186, 
  'token': 4249, 
  'token_str': 'white'}]

このバイアスは、このモデルのすべての微調整バージョンにも影響を与えます。

トレーニングデータ

Pile of Law BERT largeモデルは、言語モデルの事前学習用の英語の法務および行政テキスト約256GBからなるPile of Lawデータセットで事前学習されています。Pile of Lawは35のデータソースから構成されており、法務分析、裁判所の判決文や申立書、政府機関の出版物、契約書、法令、規則、判例集などが含まれます。データソースの詳細については、Pile of Law論文の付録Eに記載されています。Pile of Lawデータセットは、CreativeCommons Attribution - NonCommercial - ShareAlike 4.0 Internationalライセンスの下に置かれています。

トレーニング手順

前処理

モデルの語彙は、HuggingFace WordPiece tokenizerを使用してPile of Lawに適合させたカスタムワードピース語彙からの29,000トークンと、Black's Law Dictionaryからランダムにサンプリングされた3,000の法務用語から構成されており、語彙サイズは32,000トークンです。BERTと同様に、80 - 10 - 10のマスキング、破損、分割を行い、各コンテキストに対して異なるマスクを作成するために複製率20を使用します。シーケンスを生成するために、LexNLP sentence segmenterを使用しています。これは、法務引用（しばしば誤って文とみなされる）の文分割を処理します。入力は、256トークンになるまで文を埋め、その後[SEP]トークンを付け、全体が512トークン未満になるように文を埋めます。シリーズの次の文が大きすぎる場合は追加せず、残りのコンテキスト長をパディングトークンで埋めます。

事前学習

このモデルは、SambaNovaクラスタ上で8つのRDUを使用して170万ステップ学習されました。学習データのソースの多様性による学習の不安定性を軽減するために、5e - 6という小さい学習率とバッチサイズ128を使用しました。事前学習には、RoBERTaで説明されているように、NSP損失を含まないマスク付き言語モデリング（MLM）の目的を使用しました。モデルはすべてのステップでシーケンス長512で事前学習されました。

同じ設定で異なる乱数シードを使用して2つのモデルを並列に学習しました。実験には、最低の対数尤度を持つモデルであるpile-of-law/legalbert-large-1.7M-1（PoL - BERT - Largeと呼ばれる）を選択しましたが、2番目のモデルであるpile-of-law/legalbert-large-1.7M-2も公開しています。

評価結果

LexGLUE論文で提供されているCaseHOLDバリアントに対する微調整結果については、pile-of-law/legalbert-large-1.7M-1のモデルカードを参照してください。

BibTeXエントリと引用情報

@misc{hendersonkrass2022pileoflaw,
  url = {https://arxiv.org/abs/2207.00220},
  author = {Henderson, Peter and Krass, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
  publisher = {arXiv},
  year = {2022}
}