finetuned_bart開源文本摘要模型 - 基於CNN/DailyMail微調，高效提煉文本要點

首頁

Finetuned Bart

由Mousumi開發

基於BART架構的序列到序列模型，在CNN/DailyMail數據集上進行了微調，適用於文本摘要任務。

大型語言模型

Transformers

#新聞摘要生成 #中文文本處理 #微調BART模型

下載量 19

發布時間 : 3/2/2022

模型概述

該模型是基於BART架構的序列到序列模型，經過在CNN/DailyMail數據集上的微調，主要用於文本摘要任務。能夠將長文本壓縮為簡潔的摘要。

模型特點

序列到序列建模

能夠處理輸入序列並生成輸出序列，適用於文本摘要等任務。

雙向編碼器

結合了雙向編碼器和自迴歸解碼器，能夠更好地理解上下文。

微調優化

在CNN/DailyMail數據集上進行了微調，針對文本摘要任務進行了優化。

模型能力

文本摘要

序列生成

文本壓縮

使用案例

新聞摘要

新聞文章摘要

將長篇新聞文章壓縮為簡潔的摘要，保留關鍵信息。

生成高質量的新聞摘要，適合快速瀏覽。

內容生成

文本重寫

將長文本重寫為更簡潔的版本，保留核心內容。

生成簡潔且信息豐富的文本版本。

🚀 微調後的BART模型

該BART模型在CNN/DailyMail數據集上進行了微調，樣本量為10000。

🚀 快速開始

本模型是在CNN/DailyMail數據集上微調後的BART模型，可用於特定的文本處理任務。以下是使用該模型的示例代碼：

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch


src_text = [" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow.", "In the end, it played out like a movie. A tense, heartbreaking story, and then a surprise twist at the end. As eight of Mary Jane Veloso's fellow death row inmates -- mostly foreigners, like her -- were put to death by firing squad early Wednesday in a wooded grove on the Indonesian island of Nusa Kambangan, the Filipina maid and mother of two was spared, at least for now. Her family was returning from what they thought was their final visit to the prison on so-called \"execution island\" when a Philippine TV crew flagged their bus down to tell them of the decision to postpone her execution. Her ecstatic mother, Celia Veloso, told CNN: \"We are so happy, so happy. I thought I had lost my daughter already but God is so good. Thank you to everyone who helped us."]

torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = AutoTokenizer.from_pretrained("Mousumi/finetuned_bart")

model = AutoModelForSeq2SeqLM.from_pretrained("Mousumi/finetuned_bart").to(torch_device)

no_samples = len(src_text)
result = []

for i in range(no_samples):
    with tokenizer.as_target_tokenizer():
        tokenized_text = tokenizer([src_text[i]], return_tensors='pt', padding=True, truncation=True)
    batch = tokenized_text.to(torch_device)
    translated = model.generate(**batch)
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    result.append(tgt_text[0])

print(result)