đ Open Australian Legal LLM ââī¸
The Open Australian Legal LLM is the largest open - source language model trained on Australian law. It can be used for various natural language processing tasks in the Australian legal domain, such as text generation, text completion, and question - answering.
đ Quick Start
The Open Australian Legal LLM is well - suited for fine - tuning on a wide range of downstream natural language processing tasks in the Australian legal domain. To access it, you can use the following code:
>>> from transformers import pipeline, set_seed
>>> set_seed(42)
>>> generator = pipeline('text - generation', model='umarbutler/open - australian - legal - llm')
>>> response = generator('Section 51 of the Constitution provides', max_length=55)
>>> print(response[0]['generated_text'])
⨠Features
- Large - scale: With over 1.5 billion parameters, it has a large capacity for handling complex legal language.
- Rich training data: Trained on roughly 70,000 laws, regulations, and decisions from six Australian jurisdictions in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open - australian - legal - corpus).
- Versatile: Suitable for multiple natural language processing tasks in the Australian legal field, including text generation, completion, and question - answering.
- Accessible: Issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE - 2.0.html) for wide accessibility.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
>>> from transformers import pipeline, set_seed
>>> set_seed(42)
>>> generator = pipeline('text - generation', model='umarbutler/open - australian - legal - llm')
>>> response = generator('Section 51 of the Constitution provides', max_length=55)
>>> print(response[0]['generated_text'])
đ Documentation
Creation
The model was created with the following steps:
- Data cleaning: Applied cleaning procedures to 218,340 laws, regulations, and decisions in version 4.2.0 of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open - australian - legal - corpus). After cleaning and removing short and duplicate texts, 218,207 documents remained.
- Tokenizer training: Pretrained a [GPT2](https://huggingface.co/gpt2 - xl) - like tokenizer on the cleaned documents.
- Data splitting: Split the documents into 512 - token - long blocks.
- Model training:
- Used [GPT2 - XL](https://huggingface.co/gpt2 - xl) as the base model.
- Trained for the first 100,290 steps with the following hyperparameters:
| Hyperparameter | Value |
| --- | --- |
| Sequence length | 512 |
| Epochs | 1 |
| Optimiser | AdamW |
| Learning rate | 1e - 4 |
| Learning rate scheduler | Linear with warmup |
| Batch size | 6 |
| Weight decay | 0.01 |
| Warmup ratio | 0.06 |
- After a crash at around 120,050 steps, resumed training on a new instance for 133,711 steps with adjusted hyperparameters:
| Hyperparameter | Value |
| --- | --- |
| Sequence length | 512 |
| Epochs | 1 |
| Optimiser | AdamW |
| Learning rate | 4.255e - 5 |
| Learning rate scheduler | Linear |
| Batch size | 3 |
| Weight decay | 0.01 |
| Warmup ratio | 0.00 |
- Achieved a validation loss of 2.04 after one epoch of training.
Benchmarks
Tested against version 2.0.0 of the [Open Australian Legal QA](https://huggingface.co/datasets/umarbutler/open - australian - legal - qa) dataset, the model achieved a perplexity of 8.01, outperforming other known language models for Australian law:
Model |
Parameters |
Perplexity |
Open Australian Legal LLM |
1.5B |
8.01 |
[Open Australian Legal Phi 1.5](https://huggingface.co/umarbutler/open - australian - legal - phi - 1_5) |
1.3B |
8.69 |
[Open Australian Legal GPT2](https://huggingface.co/umarbutler/open - australian - legal - gpt2) |
124M |
16.37 |
[Open Australian Legal DistilGPT2](https://huggingface.co/umarbutler/open - australian - legal - distilgpt2) |
88.2M |
23.9 |
Limitations
- Bias: Expected to have biases similar to [GPT2 - XL](https://huggingface.co/gpt2 - xl).
- Language bias: May be biased towards the language in laws, regulations, and decisions, as well as towards Commonwealth and New South Wales law.
- Knowledge limitation: May lack knowledge of Victorian, Northern Territory, and Australian Capital Territory law due to licensing restrictions in the training data.
đ License
The model is issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE - 2.0.html) to ensure wide accessibility.
đ Citation
If you've relied on the model for your work, please cite:
@misc{butler - 2023 - open - australian - legal - llm,
author = {Butler, Umar},
year = {2023},
title = {Open Australian Legal LLM},
publisher = {Hugging Face},
version = {1.0.0},
url = {https://huggingface.co/datasets/umarbutler/open - australian - legal - llm}
}
đ Acknowledgements
- The author acknowledges the Traditional Custodians of Country throughout Australia and pays respect to their Elders past and present.
- Thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open - australian - legal - corpus) for providing data under open licences.
- Acknowledges the developers of Python libraries used in training and the makers of [GPT2](https://huggingface.co/gpt2 - xl).
- Is grateful for the support of his wife.