bart-base-cnn开源文本摘要模型 - 精准生成高质量文本摘要内容

首页

Bart Base Cnn

由 ainize 开发

本模型是在CNN/DailyMail摘要数据集上微调的bart-base模型，擅长文本摘要生成任务

文本生成

Transformers

英语开源协议:Apache-2.0 #新闻摘要生成 #BART微调 #ROUGE优化

下载量 749

发布时间 : 3/2/2022

模型简介

采用BART架构的序列到序列模型，专门针对新闻摘要生成任务进行微调，能够从长文本中提取关键信息生成简洁摘要

模型特点

双向编码器架构

结合BERT式双向编码器和GPT式自回归解码器，兼具理解与生成能力

创新预训练任务

采用文本填充方案和句子重排任务进行预训练，增强文本理解能力

高效摘要生成

在CNN/DailyMail数据集上微调，ROUGE指标最高提升6分（根据原论文）

模型能力

新闻摘要生成

长文本压缩

关键信息提取

使用案例

新闻媒体

新闻简报生成

自动从长篇新闻报道中生成要点摘要

生成符合人类写作习惯的简洁摘要

内容分析

文档摘要

对技术文档或报告生成执行摘要

保留原文关键信息的浓缩版本

🚀 BART基础模型在CNN Dailymail数据集上微调

本模型是在 CNN/Dailymail摘要数据集上使用 Ainize Teachable-NLP 对 bart-base模型进行微调得到的。

Bart模型由Mike Lewis、Yinhan Liu、Naman Goyal、Marjan Ghazvininejad、Abdelrahman Mohamed、Omer Levy、Ves Stoyanov和Luke Zettlemoyer于2019年10月29日提出。根据摘要所述，

Bart采用了标准的seq2seq/机器翻译架构，包含一个双向编码器（如BERT）和一个从左到右的解码器（如GPT）。

预训练任务包括随机打乱原始句子的顺序和一种新颖的填充方案，即文本片段被单个掩码标记替换。

BART在针对文本生成进行微调时特别有效，但在理解任务中也表现出色。在GLUE和SQuAD上，它在可比的训练资源下与RoBERTa的性能相当；在一系列抽象对话、问答和摘要任务中取得了新的最先进成果，ROUGE得分提升高达6分。

作者的代码可在此处找到： https://github.com/pytorch/fairseq/tree/master/examples/bart

🚀 快速开始

✨ 主要特性

基于标准的seq2seq/机器翻译架构，结合双向编码器和从左到右的解码器。
预训练任务采用随机打乱句子顺序和新颖的填充方案。
在文本生成和理解任务中均表现出色，在多项任务中取得新的最先进成果。

📦 安装指南

文档未提及安装步骤，跳过此章节。

💻 使用示例

基础用法

from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration

#  Load Model and Tokenize
tokenizer = PreTrainedTokenizerFast.from_pretrained("ainize/bart-base-cnn")
model = BartForConditionalGeneration.from_pretrained("ainize/bart-base-cnn")

# Encode Input Text
input_text = '(CNN) -- South Korea launched an investigation Tuesday into reports of toxic chemicals being dumped at a former U.S. military base, the Defense Ministry said. The tests follow allegations of American soldiers burying chemicals on Korean soil. The first tests are being carried out by a joint military, government and civilian task force at the site of what was Camp Mercer, west of Seoul. "Soil and underground water will be taken in the areas where toxic chemicals were allegedly buried," said the statement from the South Korean Defense Ministry. Once testing is finished, the government will decide on how to test more than 80 other sites -- all former bases. The alarm was raised this month when a U.S. veteran alleged barrels of the toxic herbicide Agent Orange were buried at an American base in South Korea in the late 1970s. Two of his fellow soldiers corroborated his story about Camp Carroll, about 185 miles (300 kilometers) southeast of the capital, Seoul. "We\'ve been working very closely with the Korean government since we had the initial claims," said Lt. Gen. John Johnson, who is heading the Camp Carroll Task Force. "If we get evidence that there is a risk to health, we are going to fix it." A joint U.S.- South Korean investigation is being conducted at Camp Carroll to test the validity of allegations. The U.S. military sprayed Agent Orange from planes onto jungles in Vietnam to kill vegetation in an effort to expose guerrilla fighters. Exposure to the chemical has been blamed for a wide variety of ailments, including certain forms of cancer and nerve disorders. It has also been linked to birth defects, according to the Department of Veterans Affairs. Journalist Yoonjung Seo contributed to this report.'

input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate Summary Text Ids
summary_text_ids = model.generate(
    input_ids=input_ids,
    bos_token_id=model.config.bos_token_id,
    eos_token_id=model.config.eos_token_id,
    length_penalty=2.0,
    max_length=142,
    min_length=56,
    num_beams=4,
)

# Decoding Text
print(tokenizer.decode(summary_text_ids[0], skip_special_tokens=True))