LEGIT-BART-LSG-16384 Open-source Model - Supports Long Text Processing and Document Analysis of Italian Law

LEGIT BART LSG 16384

Developed by morenolq

LEGIT-BART is a series of Italian legal text processing models based on BART-IT pre-training, supporting long text processing and legal document analysis.

Large Language Model

Transformers

OtherOpen Source License:MIT #Italian legal text generation #Long document processing (16k tokens)#Legal AI pre-training

Downloads 17

Release Time : 2/2/2025

Model Overview

This series of models is specifically optimized for Italian legal texts, suitable for tasks such as legal document summarization and contract analysis, supporting context processing of up to 16,384 tokens.

Model Features

Legal domain specialization

Pre-trained specifically on Italian legal corpora, optimizing the processing capabilities for legal texts, case law, and contracts.

Long text processing capability

Uses the LSG attention mechanism to extend context length, supporting long document processing of up to 16,384 tokens.

Multiple version options

Offers versions ranging from basic to ultra-long text support, catering to different scenario needs.

Model Capabilities

Italian legal text comprehension

Legal document summarization

Long text processing

Legal terminology recognition

Use Cases

Legal document processing

Contract summarization

Automatically generates summaries of key contract clauses

Quickly extracts core contract content, improving legal review efficiency

Case law analysis

Analyzes court judgments and extracts key legal viewpoints

Assists legal research by quickly understanding case law highlights

🚀 Model Card: LEGIT-BART Series

The LEGIT-BART series consists of pre-trained transformer-based models designed for Italian legal text processing. These models leverage the BART-IT architecture and are further pre-trained on Italian legal corpora, offering extended context length and the ability to handle various legal documents.

🚀 Quick Start

Here's a basic example of how to use the LEGIT-BART-LSG-16384 model for text summarization:

from transformers import BartForConditionalGeneration, AutoTokenizer

# Load tokenizer and model
model_name = "morenolq/LEGIT-BART-LSG-16384"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Example input
input_text = "<mask> 1234: Il contratto si intende concluso quando..."
inputs = tokenizer(input_text, return_tensors="pt", max_length=16384, truncation=True)

# Generate summary
summary_ids = model.generate(inputs.input_ids, max_length=150, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("📝 Summary:", summary)

✨ Features

Extended Context Length: Utilizes Local-Sparse-Global (LSG) Attention to support up to 16,384 tokens, enabling the processing of long legal documents.
Trained on Legal Documents: The models are pre-trained on a diverse range of legal texts, including statutes, case law, and contracts.
Flexible Adaptation: While not fine-tuned for specific tasks, the models can be further adapted to suit various legal NLP requirements.

📦 Available Models

Model	Description	Link
LEGIT-BART	Continued pre-training of `morenolq/bart-it` on Italian legal texts	🔗 Link
LEGIT-BART-LSG-4096	Continued pre-training of `morenolq/bart-it`, supporting 4,096 tokens	🔗 Link
LEGIT-BART-LSG-16384	Continued pre-training of `morenolq/bart-it`, supporting 16,384 tokens	🔗 Link
LEGIT-SCRATCH-BART	Trained from scratch on Italian legal texts	🔗 Link
LEGIT-SCRATCH-BART-LSG-4096	Trained from scratch with LSG attention, supporting 4,096 tokens	🔗 Link
LEGIT-SCRATCH-BART-LSG-16384	Trained from scratch with LSG attention, supporting 16,384 tokens	🔗 Link
BART-IT-LSG-4096	`morenolq/bart-it` with LSG attention, supporting 4,096 tokens (no legal adaptation)	🔗 Link
BART-IT-LSG-16384	`morenolq/bart-it` with LSG attention, supporting 16,384 tokens (no legal adaptation)	🔗 Link

🔧 Technical Details

Architecture

Base Model: morenolq/bart-it
Transformer Encoder-Decoder: Employs a standard encoder-decoder architecture for sequence-to-sequence tasks.
LSG Attention: Implements LSG Attention to handle long documents efficiently.
Specific Tokenizers: Models trained from scratch use specific tokenizers, although they may underperform compared to continual pre-training.

Training Data

Dataset: joelniklaus/Multi_Legal_Pile
Legal Text Types: The training data includes legislation, case law, and contracts.

📚 Documentation

The paper presenting the LEGIT-BART models is currently under review. Once published, the reference will be updated here:

@article{benedetto2025legitbart,
	title        = {LegItBART: a summarization model for Italian legal documents},
	author       = {Benedetto, Irene and La Quatra, Moreno and Cagliero, Luca},
	year         = 2025,
	journal      = {Artificial Intelligence and Law},
	publisher    = {Springer},
	pages        = {1--31},
	doi          = {10.1007/s10506-025-09436-y},
	url          = {doi.org/10.1007/s10506-025-09436-y}
}

📄 License

This project is licensed under the MIT License.

⚠️ Important Note

The models are not fine-tuned for specific tasks and may require further adaptation for legal NLP tasks. Additionally, legal texts may contain biases, so care should be taken to ensure fair and ethical use. The models are not a substitute for professional legal advice.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご