bart-base-cnn開源文本摘要模型 - 精準生成高質量文本摘要內容

首頁

Bart Base Cnn

由ainize開發

本模型是在CNN/DailyMail摘要數據集上微調的bart-base模型，擅長文本摘要生成任務

文本生成

Transformers

英語開源協議:Apache-2.0 #新聞摘要生成 #BART微調 #ROUGE優化

下載量 749

發布時間 : 3/2/2022

模型概述

採用BART架構的序列到序列模型，專門針對新聞摘要生成任務進行微調，能夠從長文本中提取關鍵信息生成簡潔摘要

模型特點

雙向編碼器架構

結合BERT式雙向編碼器和GPT式自迴歸解碼器，兼具理解與生成能力

創新預訓練任務

採用文本填充方案和句子重排任務進行預訓練，增強文本理解能力

高效摘要生成

在CNN/DailyMail數據集上微調，ROUGE指標最高提升6分（根據原論文）

模型能力

新聞摘要生成

長文本壓縮

關鍵信息提取

使用案例

新聞媒體

新聞簡報生成

自動從長篇新聞報道中生成要點摘要

生成符合人類寫作習慣的簡潔摘要

內容分析

文檔摘要

對技術文檔或報告生成執行摘要

保留原文關鍵信息的濃縮版本

🚀 BART基礎模型在CNN Dailymail數據集上微調

本模型是在 CNN/Dailymail摘要數據集上使用 Ainize Teachable-NLP 對 bart-base模型進行微調得到的。

Bart模型由Mike Lewis、Yinhan Liu、Naman Goyal、Marjan Ghazvininejad、Abdelrahman Mohamed、Omer Levy、Ves Stoyanov和Luke Zettlemoyer於2019年10月29日提出。根據摘要所述，

Bart採用了標準的seq2seq/機器翻譯架構，包含一個雙向編碼器（如BERT）和一個從左到右的解碼器（如GPT）。

預訓練任務包括隨機打亂原始句子的順序和一種新穎的填充方案，即文本片段被單個掩碼標記替換。

BART在針對文本生成進行微調時特別有效，但在理解任務中也表現出色。在GLUE和SQuAD上，它在可比的訓練資源下與RoBERTa的性能相當；在一系列抽象對話、問答和摘要任務中取得了新的最先進成果，ROUGE得分提升高達6分。

作者的代碼可在此處找到： https://github.com/pytorch/fairseq/tree/master/examples/bart

🚀 快速開始

✨ 主要特性

基於標準的seq2seq/機器翻譯架構，結合雙向編碼器和從左到右的解碼器。
預訓練任務採用隨機打亂句子順序和新穎的填充方案。
在文本生成和理解任務中均表現出色，在多項任務中取得新的最先進成果。

📦 安裝指南

文檔未提及安裝步驟，跳過此章節。

💻 使用示例

基礎用法

from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration

#  Load Model and Tokenize
tokenizer = PreTrainedTokenizerFast.from_pretrained("ainize/bart-base-cnn")
model = BartForConditionalGeneration.from_pretrained("ainize/bart-base-cnn")

# Encode Input Text
input_text = '(CNN) -- South Korea launched an investigation Tuesday into reports of toxic chemicals being dumped at a former U.S. military base, the Defense Ministry said. The tests follow allegations of American soldiers burying chemicals on Korean soil. The first tests are being carried out by a joint military, government and civilian task force at the site of what was Camp Mercer, west of Seoul. "Soil and underground water will be taken in the areas where toxic chemicals were allegedly buried," said the statement from the South Korean Defense Ministry. Once testing is finished, the government will decide on how to test more than 80 other sites -- all former bases. The alarm was raised this month when a U.S. veteran alleged barrels of the toxic herbicide Agent Orange were buried at an American base in South Korea in the late 1970s. Two of his fellow soldiers corroborated his story about Camp Carroll, about 185 miles (300 kilometers) southeast of the capital, Seoul. "We\'ve been working very closely with the Korean government since we had the initial claims," said Lt. Gen. John Johnson, who is heading the Camp Carroll Task Force. "If we get evidence that there is a risk to health, we are going to fix it." A joint U.S.- South Korean investigation is being conducted at Camp Carroll to test the validity of allegations. The U.S. military sprayed Agent Orange from planes onto jungles in Vietnam to kill vegetation in an effort to expose guerrilla fighters. Exposure to the chemical has been blamed for a wide variety of ailments, including certain forms of cancer and nerve disorders. It has also been linked to birth defects, according to the Department of Veterans Affairs. Journalist Yoonjung Seo contributed to this report.'

input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate Summary Text Ids
summary_text_ids = model.generate(
    input_ids=input_ids,
    bos_token_id=model.config.bos_token_id,
    eos_token_id=model.config.eos_token_id,
    length_penalty=2.0,
    max_length=142,
    min_length=56,
    num_beams=4,
)

# Decoding Text
print(tokenizer.decode(summary_text_ids[0], skip_special_tokens=True))