🚀 BART基础模型在CNN Dailymail数据集上微调
本模型是在 CNN/Dailymail摘要数据集 上使用 Ainize Teachable-NLP 对 bart-base模型 进行微调得到的。
Bart模型由Mike Lewis、Yinhan Liu、Naman Goyal、Marjan Ghazvininejad、Abdelrahman Mohamed、Omer Levy、Ves Stoyanov和Luke Zettlemoyer于2019年10月29日提出。根据摘要所述,
Bart采用了标准的seq2seq/机器翻译架构,包含一个双向编码器(如BERT)和一个从左到右的解码器(如GPT)。
预训练任务包括随机打乱原始句子的顺序和一种新颖的填充方案,即文本片段被单个掩码标记替换。
BART在针对文本生成进行微调时特别有效,但在理解任务中也表现出色。在GLUE和SQuAD上,它在可比的训练资源下与RoBERTa的性能相当;在一系列抽象对话、问答和摘要任务中取得了新的最先进成果,ROUGE得分提升高达6分。
作者的代码可在此处找到:
https://github.com/pytorch/fairseq/tree/master/examples/bart
🚀 快速开始
✨ 主要特性
- 基于标准的seq2seq/机器翻译架构,结合双向编码器和从左到右的解码器。
- 预训练任务采用随机打乱句子顺序和新颖的填充方案。
- 在文本生成和理解任务中均表现出色,在多项任务中取得新的最先进成果。
📦 安装指南
文档未提及安装步骤,跳过此章节。
💻 使用示例
基础用法
from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration
tokenizer = PreTrainedTokenizerFast.from_pretrained("ainize/bart-base-cnn")
model = BartForConditionalGeneration.from_pretrained("ainize/bart-base-cnn")
input_text = '(CNN) -- South Korea launched an investigation Tuesday into reports of toxic chemicals being dumped at a former U.S. military base, the Defense Ministry said. The tests follow allegations of American soldiers burying chemicals on Korean soil. The first tests are being carried out by a joint military, government and civilian task force at the site of what was Camp Mercer, west of Seoul. "Soil and underground water will be taken in the areas where toxic chemicals were allegedly buried," said the statement from the South Korean Defense Ministry. Once testing is finished, the government will decide on how to test more than 80 other sites -- all former bases. The alarm was raised this month when a U.S. veteran alleged barrels of the toxic herbicide Agent Orange were buried at an American base in South Korea in the late 1970s. Two of his fellow soldiers corroborated his story about Camp Carroll, about 185 miles (300 kilometers) southeast of the capital, Seoul. "We\'ve been working very closely with the Korean government since we had the initial claims," said Lt. Gen. John Johnson, who is heading the Camp Carroll Task Force. "If we get evidence that there is a risk to health, we are going to fix it." A joint U.S.- South Korean investigation is being conducted at Camp Carroll to test the validity of allegations. The U.S. military sprayed Agent Orange from planes onto jungles in Vietnam to kill vegetation in an effort to expose guerrilla fighters. Exposure to the chemical has been blamed for a wide variety of ailments, including certain forms of cancer and nerve disorders. It has also been linked to birth defects, according to the Department of Veterans Affairs. Journalist Yoonjung Seo contributed to this report.'
input_ids = tokenizer.encode(input_text, return_tensors="pt")
summary_text_ids = model.generate(
input_ids=input_ids,
bos_token_id=model.config.bos_token_id,
eos_token_id=model.config.eos_token_id,
length_penalty=2.0,
max_length=142,
min_length=56,
num_beams=4,
)
print(tokenizer.decode(summary_text_ids[0], skip_special_tokens=True))
高级用法
文档未提及高级用法代码示例,跳过此部分。
API
你可以通过 ainize 体验此模型。
📚 详细文档
文档未提供详细说明,跳过此章节。
🔧 技术细节
文档未提供具体的技术实现细节(>50字),跳过此章节。
📄 许可证
本项目采用Apache-2.0许可证。