๐ Legal-HeBERT
Legal-HeBERT is a BERT model tailored for the Hebrew legal and legislative domains. It aims to enhance legal NLP research and tool development in Hebrew.
๐ Quick Start
Legal-HeBERT offers two versions. The first one is a fine - tuned model of HeBERT applied to legal and legislative documents. The second version trains a BERT model from scratch following HeBERT's architecture guidelines. We are continuously collecting legal data, exploring different architectural designs, and conducting tagged datasets and legal tasks for evaluating and developing Hebrew legal tools.
โจ Features
- Domain - Specific: Specifically designed for Hebrew legal and legislative domains.
- Two Versions: A fine - tuned version and a model trained from scratch.
๐ฆ Training Data
Our training datasets are as follows:
Property |
Details |
The Israeli Law Book |
ืกืคืจ ืืืืงืื ืืืฉืจืืื. Size: 0.05 GB, Documents: 2338, Sentences: 293352, Words: 4851063 |
Judgments of the Supreme Court |
ืืืืจ ืคืกืงื ืืืื ืฉื ืืืช ืืืฉืคื ืืขืืืื. Size: 0.7 GB, Documents: 212348, Sentences: 5790138, Words: 79672415 |
custody courts |
ืืืืืืช ืืชื ืืืื ืืืฉืืืจืช. Size: 2.46 GB, Documents: 169,708, Sentences: 8,555,893, Words: 213,050,492 |
Law memoranda, drafts of secondary legislation and drafts of support tests that have been distributed to the public for comment |
ืชืืืืจื ืืืง, ืืืืืืช ืืงืืงืช ืืฉื ื ืืืืืืืช ืืืื ื ืชืืืื ืฉืืืคืฆื ืืืขืจืืช ืืฆืืืืจ. Size: 0.4 GB, Documents: 3,291, Sentences: 294,752, Words: 7,218,960 |
Supervisors of Land Registration judgments |
ืืืืจ ืคืกืงื ืืื ืฉื ืืืคืงืืื ืขื ืจืืฉืื ืืืงืจืงืขืื. Size: 0.02 GB, Documents: 559, Sentences: 67,639, Words: 1,785,446 |
Decisions of the Labor Court - Corona |
ืืืืจ ืืืืืืช ืืืช ืืืื ืืขื ืืื ืฉืืจืืช ืืชืขืกืืงื โ ืงืืจืื ื. Size: 0.001 GB, Documents: 146, Sentences: 3505, Words: 60195 |
Decisions of the Israel Lands Council |
ืืืืืืช ืืืขืฆืช ืืงืจืงืขื ืืฉืจืื. Documents: 118, Sentences: 11283, Words: 162692 (aggregate file) |
Judgments of the Disciplinary Tribunal and the Israel Police Appeals Tribunal |
ืคืกืงื ืืื ืฉื ืืืช ืืืื ืืืฉืืขืช ืืืืช ืืืื ืืขืจืขืืจืื ืฉื ืืฉืืจืช ืืฉืจืื. Size: 0.02 GB, Documents: 54, Sentences: 83724, Words: 1743419 (aggregate files) |
Disciplinary Appeals Committee in the Ministry of Health |
ืืขืืช ืขืจืจ ืืืื ืืฉืืขืชื ืืืฉืจื ืืืจืืืืช. Size: 0.004 GB, Documents: 252, Sentences: 21010, Words: 429807 (465 files are scanned and didn't parser) |
Attorney General's Positions |
ืืืืจ ืืชืืืฆืืืืืช ืืืืขืฅ ืืืฉืคืื ืืืืฉืื. Size: 0.008 GB, Documents: 281, Sentences: 32724, Words: 813877 |
Legal - Opinion of the Attorney General |
ืืืืจ ืืืืช ืืขืช ืืืืขืฅ ืืืฉืคืื ืืืืฉืื. Size: 0.002 GB, Documents: 44, Sentences: 7132, Words: 188053 |
Total |
Size: 3.665 GB, Documents: 389,139, Sentences: 15,161,152, Words: 309,976,419 |
We express our gratitude to Yair Gardin for referring to the governance data, Elhanan Schwarts for collecting and parsing The Israeli law book, and Jonathan Schler for collecting the judgments of the supreme court.
๐ง Training Process
- Vocabulary size: 50,000 tokens
- Epochs: 4 epochs (1M stepsยฑ)
- Learning rate: lr = 5e - 5
- MLM probability: mlm_probability = 0.15
- Batch size: 32 (for each gpu)
- Hardware: NVIDIA GeForce RTX 2080 TI + NVIDIA GeForce RTX 3090 (1 week training)
Additional training settings:
๐ป Usage Examples
Basic Usage
The models can be found in huggingface hub and can be fine - tuned to any down - stream task:
from transformers import AutoTokenizer, AutoModel
model_name = 'avichr/Legal-heBERT_ft'
model_name = 'avichr/Legal-heBERT'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model=model_name,
)
fill_mask("ืืงืืจืื ื ืืงืื ืืช [MASK] ืืื ื ืื ื ืฉืืจ ืืืจ.")
๐ Stay tuned!
We are still working on our models and the datasets. We will update this page as we make progress. We are open for collaborations.
๐ License
If you used this model, please cite us as follows:
Chriqui, Avihay, Yahav, Inbal and Bar - Siman - Tov, Ittai, Legal HeBERT: A BERT - based NLP Model for Hebrew Legal, Judicial and Legislative Texts (June 27, 2022). Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4147127
@article{chriqui2021hebert,
title={Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts},
author={Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai},
journal={SSRN preprint:4147127},
year={2022}
}
๐ Contact us
- Avichay Chriqui, The Coller AI Lab
- Inbal yahav, The Coller AI Lab
- [Ittai Bar - Siman - Tov](mailto:Ittai.Bar - Siman - Tov@biu.ac.il), the BIU Innovation Lab for Law, Data - Science and Digital Ethics
Thank you, ืชืืื, ุดูุฑุง