Led-base-ilc Open-source Model - Freely Empowering Efficient Generation of Legal Document Summaries

Led Base Ilc

Developed by d0r1h

A Longformer Encoder-Decoder model fine-tuned on the ILC dataset, specifically designed for legal document summarization tasks

Text Generation

Transformers

OtherOpen Source License:Apache-2.0 #Long Document Summarization #Legal Document Processing #High ROUGE Scores

Downloads 28

Release Time : 5/5/2022

Model Overview

This model is a fine-tuned version of led-base-16384 on the ILC dataset, excelling in long document summarization tasks, particularly for legal domain texts.

Model Features

Long Document Handling

Capable of processing documents up to 16K tokens, suitable for summarizing lengthy legal texts

Legal Domain Optimization

Fine-tuned on the ILC legal dataset, providing better comprehension of legal texts

Efficient Attention Mechanism

Utilizes Longformer's sparse attention pattern for improved efficiency in processing long texts

Model Capabilities

Legal Document Summarization

Long Text Comprehension

Legal Terminology Recognition

Use Cases

Legal Document Processing

Court Case Summarization

Automatically generates concise summaries of court case documents

Significantly outperforms the base model in ROUGE scores

Legal Document Analysis

Extracts key information from lengthy legal documents

🚀 Longformer Encoder-Decoder (LED) fine-tuned on ILC

This project presents a fine-tuned version of the led-base-16384 model on the ILC dataset. It aims to provide an effective solution for document summarization tasks, leveraging the power of the Longformer architecture.

🚀 Quick Start

Prerequisites

Make sure you have the necessary libraries installed. You can use the following code to set up the model and tokenizer:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "CPU"

checkpoint = "d0r1h/led-base-ilc"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, return_dict_in_generate=True).to(device)

Basic Usage

Here is a basic example of using the model for summarization:

case = "......."
input_ids = tokenizer(case, return_tensors="pt").input_ids.to(device)
global_attention_mask = torch.zeros_like(input_ids)
global_attention_mask[:, 0] = 1
sequences = model.generate(input_ids, 
                           global_attention_mask=global_attention_mask).sequences
summary = tokenizer.batch_decode(sequences, 
                                 skip_special_tokens=True)
print(summary)

✨ Features

Fine-tuned on ILC: The model is specifically fine-tuned on the ILC dataset, which makes it well-suited for summarizing documents in the relevant domain.
Long-Document Handling: Based on the Longformer architecture, it can effectively process long documents with up to 16K tokens.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "CPU"

checkpoint = "d0r1h/led-base-ilc"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, return_dict_in_generate=True).to(device)

case = "......."
input_ids = tokenizer(case, return_tensors="pt").input_ids.to(device)
global_attention_mask = torch.zeros_like(input_ids)
global_attention_mask[:, 0] = 1
sequences = model.generate(input_ids, 
                           global_attention_mask=global_attention_mask).sequences
summary = tokenizer.batch_decode(sequences, 
                                 skip_special_tokens=True)
print(summary)

Advanced Usage

You can further customize the summarization process by adjusting the parameters in the generate method. For example, you can change the length of the generated summary or use different decoding strategies.

# Example of adjusting the maximum length of the summary
sequences = model.generate(input_ids, 
                           global_attention_mask=global_attention_mask,
                           max_length=200).sequences
summary = tokenizer.batch_decode(sequences, 
                                 skip_special_tokens=True)
print(summary)

📚 Documentation

As described in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan, led-base-16384 was initialized from bart-base since both models share the exact same architecture. To be able to process 16K tokens, bart-base's position embedding matrix was simply copied 16 times.

🔧 Technical Details

The model is based on the Longformer Encoder-Decoder architecture. The fine-tuning process on the ILC dataset helps it capture the specific patterns and characteristics of the documents in this dataset, resulting in better summarization performance.

📄 License

This project is licensed under the Apache-2.0 license.

📦 Installation

The model can be installed using the transformers library. You can install it via pip:

pip install transformers torch

📈 Evaluation results

When the model is used for summarizing ILC documents (10 samples), it achieves the following results:

Model	rouge1-f	rouge1-p	rouge2-f	rouge2-p	rougeL-f	rougeL-p
led-ilc	42	47	22	24	39	44
led-base	3	39	1	21	3	37

This notebook shows how led can effectively be used for downstream tasks such as summarization.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご