led-base-ilc开源模型 - 免费助力法律文档摘要高效生成

首页

Led Base Ilc

由 d0r1h 开发

基于ILC数据集微调的Longformer编码器-解码器模型，专门用于法律文档摘要生成任务

文本生成

Transformers

其他开源协议:Apache-2.0 #长文档摘要 #法律文书处理 #高ROUGE分数

下载量 28

发布时间 : 5/5/2022

模型简介

该模型是在ILC数据集上对led-base-16384进行微调的版本，擅长处理长文档摘要生成任务，特别是法律领域的长文本摘要。

模型特点

长文档处理能力

能够处理长达16K token的文档，适合法律文书等长文本摘要

法律领域优化

在ILC法律数据集上微调，对法律文本有更好的理解能力

高效注意力机制

采用Longformer的稀疏注意力模式，提高长文本处理效率

模型能力

法律文档摘要生成

长文本理解

法律术语识别

使用案例

法律文书处理

法院案件摘要

自动生成法院案件文档的简明摘要

ROUGE分数显著优于基础模型

法律文件分析

从冗长的法律文件中提取关键信息

🚀 长former编码器-解码器（LED）在ILC数据集上微调

本模型是 led-base-16384 在 ILC 数据集上的微调版本。它能够处理长文档的摘要任务，为相关领域提供了有效的解决方案。

🚀 快速开始

本模型是 led-base-16384 在 ILC 数据集上的微调版本。

正如 Iz Beltagy、Matthew E. Peters、Arman Cohan 在 Longformer: The Long-Document Transformer 中所描述的，led-base-16384 是从 bart-base 初始化而来的，因为这两个模型具有完全相同的架构。为了能够处理 16K 个标记，bart-base 的位置嵌入矩阵被简单地复制了 16 次。

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
device = "cuda" if torch.cuda.is_available() else "CPU"

checkpoint = "d0r1h/led-base-ilc"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, return_dict_in_generate=True).to(device)
case = "......."
input_ids = tokenizer(case, return_tensors="pt").input_ids.to(device)
global_attention_mask = torch.zeros_like(input_ids)
global_attention_mask[:, 0] = 1
sequences = model.generate(input_ids, 
                           global_attention_mask=global_attention_mask).sequences
summary = tokenizer.batch_decode(sequences, 
                                 skip_special_tokens=True)

💻 使用示例

基础用法

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
device = "cuda" if torch.cuda.is_available() else "CPU"

checkpoint = "d0r1h/led-base-ilc"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, return_dict_in_generate=True).to(device)
case = "......."
input_ids = tokenizer(case, return_tensors="pt").input_ids.to(device)
global_attention_mask = torch.zeros_like(input_ids)
global_attention_mask[:, 0] = 1
sequences = model.generate(input_ids, 
                           global_attention_mask=global_attention_mask).sequences
summary = tokenizer.batch_decode(sequences, 
                                 skip_special_tokens=True)