Open - Australian - Legal - LLM Open - source Model - Suitable for Australian Legal Natural Language Processing Tasks

Open Australian Legal Llm

Developed by isaacus

The largest open-source language model trained on Australian law, with 1.5 billion parameters, suitable for natural language processing tasks in the Australian legal domain.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Specialized for Australian Law #1.5B Parameter Scale #Legal Text Generation

Downloads 185

Release Time : 11/29/2023

Model Overview

This model is the largest open-source language model trained on Australian law, suitable for fine-tuning on various downstream natural language processing tasks in the Australian legal domain, including text generation, text completion, and question answering.

Model Features

Large-scale Training Data

The model's training data includes approximately 70,000 legal texts, regulations, and judgments from six Australian jurisdictions in the Open Australian Legal Corpus.

High Performance

Tested on the Open Australian Legal QA dataset, the model achieves a perplexity of 8.01, outperforming all known language models in the Australian legal domain.

Broad Applicability

Suitable for fine-tuning on various downstream natural language processing tasks in the Australian legal domain, including text generation, text completion, and question answering.

Model Capabilities

Text Generation

Text Completion

Question Answering

Use Cases

Legal Domain

Legal Text Generation

Generate text that conforms to the style and content of Australian law.

Legal Question Answering

Answer questions related to Australian law.

Perplexity of 8.01

🚀 Open Australian Legal LLM ‍⚖️

The Open Australian Legal LLM is the largest open - source language model trained on Australian law. It can be used for various natural language processing tasks in the Australian legal domain, such as text generation, text completion, and question - answering.

🚀 Quick Start

The Open Australian Legal LLM is well - suited for fine - tuning on a wide range of downstream natural language processing tasks in the Australian legal domain. To access it, you can use the following code:

>>> from transformers import pipeline, set_seed

>>> set_seed(42) # We set a seed for reproducibility.
>>> generator = pipeline('text - generation', model='umarbutler/open - australian - legal - llm')

>>> response = generator('Section 51 of the Constitution provides', max_length=55)
>>> print(response[0]['generated_text'])

✨ Features

Large - scale: With over 1.5 billion parameters, it has a large capacity for handling complex legal language.
Rich training data: Trained on roughly 70,000 laws, regulations, and decisions from six Australian jurisdictions in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open - australian - legal - corpus).
Versatile: Suitable for multiple natural language processing tasks in the Australian legal field, including text generation, completion, and question - answering.
Accessible: Issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE - 2.0.html) for wide accessibility.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

>>> from transformers import pipeline, set_seed

>>> set_seed(42) # We set a seed for reproducibility.
>>> generator = pipeline('text - generation', model='umarbutler/open - australian - legal - llm')

>>> response = generator('Section 51 of the Constitution provides', max_length=55)
>>> print(response[0]['generated_text'])

📚 Documentation

Creation

The model was created with the following steps:

Data cleaning: Applied cleaning procedures to 218,340 laws, regulations, and decisions in version 4.2.0 of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open - australian - legal - corpus). After cleaning and removing short and duplicate texts, 218,207 documents remained.
Tokenizer training: Pretrained a [GPT2](https://huggingface.co/gpt2 - xl) - like tokenizer on the cleaned documents.
Data splitting: Split the documents into 512 - token - long blocks.
Model training:
- Used [GPT2 - XL](https://huggingface.co/gpt2 - xl) as the base model.
- Trained for the first 100,290 steps with the following hyperparameters: | Hyperparameter | Value | | --- | --- | | Sequence length | 512 | | Epochs | 1 | | Optimiser | AdamW | | Learning rate | 1e - 4 | | Learning rate scheduler | Linear with warmup | | Batch size | 6 | | Weight decay | 0.01 | | Warmup ratio | 0.06 |
- After a crash at around 120,050 steps, resumed training on a new instance for 133,711 steps with adjusted hyperparameters: | Hyperparameter | Value | | --- | --- | | Sequence length | 512 | | Epochs | 1 | | Optimiser | AdamW | | Learning rate | 4.255e - 5 | | Learning rate scheduler | Linear | | Batch size | 3 | | Weight decay | 0.01 | | Warmup ratio | 0.00 |
- Achieved a validation loss of 2.04 after one epoch of training.

Benchmarks

Tested against version 2.0.0 of the [Open Australian Legal QA](https://huggingface.co/datasets/umarbutler/open - australian - legal - qa) dataset, the model achieved a perplexity of 8.01, outperforming other known language models for Australian law:

Model	Parameters	Perplexity
Open Australian Legal LLM	1.5B	8.01
[Open Australian Legal Phi 1.5](https://huggingface.co/umarbutler/open - australian - legal - phi - 1_5)	1.3B	8.69
[Open Australian Legal GPT2](https://huggingface.co/umarbutler/open - australian - legal - gpt2)	124M	16.37
[Open Australian Legal DistilGPT2](https://huggingface.co/umarbutler/open - australian - legal - distilgpt2)	88.2M	23.9

Limitations

Bias: Expected to have biases similar to [GPT2 - XL](https://huggingface.co/gpt2 - xl).
Language bias: May be biased towards the language in laws, regulations, and decisions, as well as towards Commonwealth and New South Wales law.
Knowledge limitation: May lack knowledge of Victorian, Northern Territory, and Australian Capital Territory law due to licensing restrictions in the training data.

📄 License

The model is issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE - 2.0.html) to ensure wide accessibility.

🔖 Citation

If you've relied on the model for your work, please cite:

@misc{butler - 2023 - open - australian - legal - llm,
    author = {Butler, Umar},
    year = {2023},
    title = {Open Australian Legal LLM},
    publisher = {Hugging Face},
    version = {1.0.0},
    url = {https://huggingface.co/datasets/umarbutler/open - australian - legal - llm}
}

🙏 Acknowledgements

The author acknowledges the Traditional Custodians of Country throughout Australia and pays respect to their Elders past and present.
Thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open - australian - legal - corpus) for providing data under open licences.
Acknowledges the developers of Python libraries used in training and the makers of [GPT2](https://huggingface.co/gpt2 - xl).
Is grateful for the support of his wife.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご