InLegalBERT Open-Source Legal AI Model - Free for Natural Language Processing Tasks in the Indian Legal Domain

Inlegalbert

Developed by law-ai

InLegalBERT is a Transformer model pre-trained on Indian legal texts, specializing in natural language processing tasks for the legal domain.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Indian Legal Texts #Legal Provision Recognition #Court Judgment Prediction

Downloads 753.50k

Release Time : 9/11/2022

Model Overview

This model is a BERT model further pre-trained on Indian legal texts, specifically optimized for the Indian legal context, suitable for tasks such as legal text analysis, classification, and prediction.

Model Features

Optimized for Indian Legal Domain

Trained on 5.4 million Indian legal documents, making it particularly suitable for processing Indian legal texts.

Improved from LegalBERT

Based on the LegalBERT-SC model, trained for 300,000 steps on Indian legal data, delivering superior performance.

Multi-task Support

Supports Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.

Model Capabilities

Legal text classification

Legal provision recognition

Legal document semantic segmentation

Court judgment prediction

Legal text embedding generation

Use Cases

Legal Text Analysis

Legal Provision Recognition

Identify relevant legal provisions based on court case facts.

Outperforms other models on the ILSI dataset.

Document Semantic Segmentation

Segment legal documents into functional parts such as facts and arguments.

Excellent performance on the ISS dataset.

Court Judgment Prediction

Predict whether a court case's claim will be accepted or dismissed.

Best performance on the ILDC dataset.

🚀 InLegalBERT

Model and tokenizer files for the InLegalBERT model, which is pre - trained on Indian legal text.

🚀 Quick Start

InLegalBERT is a pre - trained model for the legal domain, specifically trained on Indian legal text. It can be used for various legal NLP tasks such as legal statute identification, semantic segmentation, and court judgment prediction.

✨ Features

Based on Indian Legal Text: Trained on a large corpus of Indian legal documents from 1950 - 2019, covering all legal domains.
Fine - tuned for Legal Tasks: Performs well on multiple legal tasks including legal statute identification, semantic segmentation, and court judgment prediction.
Comparison Advantage: Outperforms LegalBERT and other baselines on these tasks.

📦 Installation

No specific installation steps are provided in the original README. If you want to use the model, you can use the following Python code to load the model and tokenizer:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("law - ai/InLegalBERT")
model = AutoModel.from_pretrained("law - ai/InLegalBERT")

💻 Usage Examples

Basic Usage

Using the model to get embeddings/representations for a piece of text

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("law-ai/InLegalBERT")
text = "Replace this string with yours"
encoded_input = tokenizer(text, return_tensors="pt")
model = AutoModel.from_pretrained("law-ai/InLegalBERT")
output = model(**encoded_input)
last_hidden_state = output.last_hidden_state

📚 Documentation

Training Data

For building the pre - training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India. The court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as Civil, Criminal, Constitutional, and so on. In total, our dataset contains around 5.4 million Indian legal documents (all in the English language). The raw text corpus size is around 27 GB.

Training Setup

This model is initialized with the [LEGAL - BERT - SC model](https://huggingface.co/nlpaueb/legal - bert - base - uncased) from the paper [LEGAL - BERT: The Muppets straight out of Law School](https://aclanthology.org/2020.findings - emnlp.261/). In our work, we refer to this model as LegalBERT, and our re - trained model as InLegalBERT. We further train this model on our data for 300K steps on the Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.

Model Overview

This model uses the same tokenizer as [LegalBERT](https://huggingface.co/nlpaueb/legal - bert - base - uncased). This model has the same configuration as the [bert - base - uncased model](https://huggingface.co/bert - base - uncased): 12 hidden layers, 768 hidden dimensionality, 12 attention heads, ~110M parameters.

Fine - tuning Results

We have fine - tuned all pre - trained models on 3 legal tasks with Indian datasets:

Legal Statute Identification (ILSI Dataset)[Multi - label Text Classification]: Identifying relevant statutes (law articles) based on the facts of a court case
Semantic Segmentation (ISS Dataset)[Sentence Tagging]: Segmenting the document into 7 functional parts (semantic segments) such as Facts, Arguments, etc.
Court Judgment Prediction (ILDC Dataset)[Binary Text Classification]: Predicting whether the claims/petitions of a court case will be accepted/rejected

InLegalBERT beats LegalBERT as well as all other baselines/variants we have used, across all three tasks. For details, see our paper.

About Us

We are a group of researchers from the Department of Computer Science and Technology, Indian Insitute of Technology, Kharagpur. Our research interests are primarily ML and NLP applications for the legal domain, with a special focus on the challenges and oppurtunities for the Indian legal scenario. We have, and are currently working on several legal tasks such as:

named entity recognition, summarization of legal documents
semantic segmentation of legal documents
legal statute identification from facts, court judgment prediction
legal document matching

You can find our publicly available codes and datasets [here](https://github.com/Law - AI).

Citation

@inproceedings{paul-2022-pretraining,
  url = {https://arxiv.org/abs/2209.06049},
  author = {Paul, Shounak and Mandal, Arpan and Goyal, Pawan and Ghosh, Saptarshi},
  title = {Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law},
  booktitle = {Proceedings of 19th International Conference on Artificial Intelligence and Law - ICAIL 2023}
  year = {2023},
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご