đ Longformer Encoder-Decoder (LED) fine-tuned on ILC
This project presents a fine-tuned version of the led-base-16384 model on the ILC dataset. It aims to provide an effective solution for document summarization tasks, leveraging the power of the Longformer architecture.
đ Quick Start
Prerequisites
Make sure you have the necessary libraries installed. You can use the following code to set up the model and tokenizer:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
device = "cuda" if torch.cuda.is_available() else "CPU"
checkpoint = "d0r1h/led-base-ilc"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, return_dict_in_generate=True).to(device)
Basic Usage
Here is a basic example of using the model for summarization:
case = "......."
input_ids = tokenizer(case, return_tensors="pt").input_ids.to(device)
global_attention_mask = torch.zeros_like(input_ids)
global_attention_mask[:, 0] = 1
sequences = model.generate(input_ids,
global_attention_mask=global_attention_mask).sequences
summary = tokenizer.batch_decode(sequences,
skip_special_tokens=True)
print(summary)
⨠Features
- Fine-tuned on ILC: The model is specifically fine-tuned on the ILC dataset, which makes it well-suited for summarizing documents in the relevant domain.
- Long-Document Handling: Based on the Longformer architecture, it can effectively process long documents with up to 16K tokens.
đģ Usage Examples
Basic Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
device = "cuda" if torch.cuda.is_available() else "CPU"
checkpoint = "d0r1h/led-base-ilc"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, return_dict_in_generate=True).to(device)
case = "......."
input_ids = tokenizer(case, return_tensors="pt").input_ids.to(device)
global_attention_mask = torch.zeros_like(input_ids)
global_attention_mask[:, 0] = 1
sequences = model.generate(input_ids,
global_attention_mask=global_attention_mask).sequences
summary = tokenizer.batch_decode(sequences,
skip_special_tokens=True)
print(summary)
Advanced Usage
You can further customize the summarization process by adjusting the parameters in the generate
method. For example, you can change the length of the generated summary or use different decoding strategies.
sequences = model.generate(input_ids,
global_attention_mask=global_attention_mask,
max_length=200).sequences
summary = tokenizer.batch_decode(sequences,
skip_special_tokens=True)
print(summary)
đ Documentation
As described in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan, led-base-16384 was initialized from bart-base since both models share the exact same architecture. To be able to process 16K tokens, bart-base's position embedding matrix was simply copied 16 times.
đ§ Technical Details
The model is based on the Longformer Encoder-Decoder architecture. The fine-tuning process on the ILC dataset helps it capture the specific patterns and characteristics of the documents in this dataset, resulting in better summarization performance.
đ License
This project is licensed under the Apache-2.0 license.
đĻ Installation
The model can be installed using the transformers
library. You can install it via pip
:
pip install transformers torch
đ Evaluation results
When the model is used for summarizing ILC documents (10 samples), it achieves the following results:
Model |
rouge1-f |
rouge1-p |
rouge2-f |
rouge2-p |
rougeL-f |
rougeL-p |
led-ilc |
42 |
47 |
22 |
24 |
39 |
44 |
led-base |
3 |
39 |
1 |
21 |
3 |
37 |
This notebook shows how led can effectively be used for downstream tasks such as summarization.