๐ Legal-HeBERT
Legal-HeBERT is a BERT model tailored for the Hebrew legal and legislative domains. It aims to enhance legal NLP research and facilitate the development of Hebrew legal tools. We've released two versions of Legal-HeBERT. The first one is a fine-tuned variant of HeBERT applied to legal and legislative documents. The second version builds a BERT model from scratch, following the architectural guidelines of HeBERT.
We're continuously collecting legal data, exploring different architectural designs, and creating tagged datasets and legal tasks. These efforts are aimed at evaluating and advancing Hebrew legal tools.
โจ Features
- Two versions of the model are available: a fine - tuned version of HeBERT and a model trained from scratch.
- Intended to boost legal NLP research and tool development in Hebrew.
๐ฆ Installation
No specific installation steps are provided in the original document.
๐ Documentation
๐ฆ Training Data
Our training datasets are as follows:
Property |
Details |
The Israeli Law Book |
ืกืคืจ ืืืืงืื ืืืฉืจืืื. Size: 0.05 GB, Documents: 2338, Sentences: 293352, Words: 4851063 |
Judgments of the Supreme Court |
ืืืืจ ืคืกืงื ืืืื ืฉื ืืืช ืืืฉืคื ืืขืืืื. Size: 0.7 GB, Documents: 212348, Sentences: 5790138, Words: 79672415 |
custody courts |
ืืืืืืช ืืชื ืืืื ืืืฉืืืจืช. Size: 2.46 GB, Documents: 169,708, Sentences: 8,555,893, Words: 213,050,492 |
Law memoranda, drafts of secondary legislation and drafts of support tests that have been distributed to the public for comment |
ืชืืืืจื ืืืง, ืืืืืืช ืืงืืงืช ืืฉื ื ืืืืืืืช ืืืื ื ืชืืืื ืฉืืืคืฆื ืืืขืจืืช ืืฆืืืืจ. Size: 0.4 GB, Documents: 3,291, Sentences: 294,752, Words: 7,218,960 |
Supervisors of Land Registration judgments |
ืืืืจ ืคืกืงื ืืื ืฉื ืืืคืงืืื ืขื ืจืืฉืื ืืืงืจืงืขืื. Size: 0.02 GB, Documents: 559, Sentences: 67,639, Words: 1,785,446 |
Decisions of the Labor Court - Corona |
ืืืืจ ืืืืืืช ืืืช ืืืื ืืขื ืืื ืฉืืจืืช ืืชืขืกืืงื โ ืงืืจืื ื. Size: 0.001 GB, Documents: 146, Sentences: 3505, Words: 60195 |
Decisions of the Israel Lands Council |
ืืืืืืช ืืืขืฆืช ืืงืจืงืขื ืืฉืจืื. Documents: 118, Sentences: 11283, Words: 162692. Aggregate file |
Judgments of the Disciplinary Tribunal and the Israel Police Appeals Tribunal |
ืคืกืงื ืืื ืฉื ืืืช ืืืื ืืืฉืืขืช ืืืืช ืืืื ืืขืจืขืืจืื ืฉื ืืฉืืจืช ืืฉืจืื. Size: 0.02 GB, Documents: 54, Sentences: 83724, Words: 1743419. Aggregate files |
Disciplinary Appeals Committee in the Ministry of Health |
ืืขืืช ืขืจืจ ืืืื ืืฉืืขืชื ืืืฉืจื ืืืจืืืืช. Size: 0.004 GB, Documents: 252, Sentences: 21010, Words: 429807. 465 files are scanned and didn't parser |
Attorney General's Positions |
ืืืืจ ืืชืืืฆืืืืืช ืืืืขืฅ ืืืฉืคืื ืืืืฉืื. Size: 0.008 GB, Documents: 281, Sentences: 32724, Words: 813877 |
Legal - Opinion of the Attorney General |
ืืืืจ ืืืืช ืืขืช ืืืืขืฅ ืืืฉืคืื ืืืืฉืื. Size: 0.002 GB, Documents: 44, Sentences: 7132, Words: 188053 |
Total |
Size: 3.665 GB, Documents: 389,139, Sentences: 15,161,152, Words: 309,976,419 |
We express our gratitude to Yair Gardin for referring to the governance data, Elhanan Schwarts for collecting and parsing The Israeli law book, and Jonathan Schler for collecting the judgments of the supreme court.
๐ง Technical Details
Training process
- Vocabulary size: 50,000 tokens
- 4 epochs (1M stepsยฑ)
- lr = 5e - 5
- mlm_probability = 0.15
- batch size = 32 (for each gpu)
- NVIDIA GeForce RTX 2080 TI + NVIDIA GeForce RTX 3090 (1 week training)
Additional training settings:
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
model_name = 'avichr/Legal-heBERT_ft'
model_name = 'avichr/Legal-heBERT'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model=model_name,
)
fill_mask("ืืงืืจืื ื ืืงืื ืืช [MASK] ืืื ื ืื ื ืฉืืจ ืืืจ.")
๐ License
No license information is provided in the original document.
Stay tuned!
We're still working on our models and datasets. We'll update this page as we make progress. We're open to collaborations.
Citation
If you use this model, please cite us as follows:
Chriqui, Avihay, Yahav, Inbal and Bar - Siman - Tov, Ittai, Legal HeBERT: A BERT - based NLP Model for Hebrew Legal, Judicial and Legislative Texts (June 27, 2022). Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4147127
@article{chriqui2021hebert,
title={Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts},
author={Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai},
journal={SSRN preprint:4147127},
year={2022}
}
Contact us
Avichay Chriqui, The Coller AI Lab
Inbal yahav, The Coller AI Lab
Ittai Bar - Siman - Tov, the BIU Innovation Lab for Law, Data - Science and Digital Ethics
Thank you, ืชืืื, ุดูุฑุง