🚀 BART基礎模型在CNN Dailymail數據集上微調
本模型是在 CNN/Dailymail摘要數據集 上使用 Ainize Teachable-NLP 對 bart-base模型 進行微調得到的。
Bart模型由Mike Lewis、Yinhan Liu、Naman Goyal、Marjan Ghazvininejad、Abdelrahman Mohamed、Omer Levy、Ves Stoyanov和Luke Zettlemoyer於2019年10月29日提出。根據摘要所述,
Bart採用了標準的seq2seq/機器翻譯架構,包含一個雙向編碼器(如BERT)和一個從左到右的解碼器(如GPT)。
預訓練任務包括隨機打亂原始句子的順序和一種新穎的填充方案,即文本片段被單個掩碼標記替換。
BART在針對文本生成進行微調時特別有效,但在理解任務中也表現出色。在GLUE和SQuAD上,它在可比的訓練資源下與RoBERTa的性能相當;在一系列抽象對話、問答和摘要任務中取得了新的最先進成果,ROUGE得分提升高達6分。
作者的代碼可在此處找到:
https://github.com/pytorch/fairseq/tree/master/examples/bart
🚀 快速開始
✨ 主要特性
- 基於標準的seq2seq/機器翻譯架構,結合雙向編碼器和從左到右的解碼器。
- 預訓練任務採用隨機打亂句子順序和新穎的填充方案。
- 在文本生成和理解任務中均表現出色,在多項任務中取得新的最先進成果。
📦 安裝指南
文檔未提及安裝步驟,跳過此章節。
💻 使用示例
基礎用法
from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration
tokenizer = PreTrainedTokenizerFast.from_pretrained("ainize/bart-base-cnn")
model = BartForConditionalGeneration.from_pretrained("ainize/bart-base-cnn")
input_text = '(CNN) -- South Korea launched an investigation Tuesday into reports of toxic chemicals being dumped at a former U.S. military base, the Defense Ministry said. The tests follow allegations of American soldiers burying chemicals on Korean soil. The first tests are being carried out by a joint military, government and civilian task force at the site of what was Camp Mercer, west of Seoul. "Soil and underground water will be taken in the areas where toxic chemicals were allegedly buried," said the statement from the South Korean Defense Ministry. Once testing is finished, the government will decide on how to test more than 80 other sites -- all former bases. The alarm was raised this month when a U.S. veteran alleged barrels of the toxic herbicide Agent Orange were buried at an American base in South Korea in the late 1970s. Two of his fellow soldiers corroborated his story about Camp Carroll, about 185 miles (300 kilometers) southeast of the capital, Seoul. "We\'ve been working very closely with the Korean government since we had the initial claims," said Lt. Gen. John Johnson, who is heading the Camp Carroll Task Force. "If we get evidence that there is a risk to health, we are going to fix it." A joint U.S.- South Korean investigation is being conducted at Camp Carroll to test the validity of allegations. The U.S. military sprayed Agent Orange from planes onto jungles in Vietnam to kill vegetation in an effort to expose guerrilla fighters. Exposure to the chemical has been blamed for a wide variety of ailments, including certain forms of cancer and nerve disorders. It has also been linked to birth defects, according to the Department of Veterans Affairs. Journalist Yoonjung Seo contributed to this report.'
input_ids = tokenizer.encode(input_text, return_tensors="pt")
summary_text_ids = model.generate(
input_ids=input_ids,
bos_token_id=model.config.bos_token_id,
eos_token_id=model.config.eos_token_id,
length_penalty=2.0,
max_length=142,
min_length=56,
num_beams=4,
)
print(tokenizer.decode(summary_text_ids[0], skip_special_tokens=True))
高級用法
文檔未提及高級用法代碼示例,跳過此部分。
API
你可以通過 ainize 體驗此模型。
📚 詳細文檔
文檔未提供詳細說明,跳過此章節。
🔧 技術細節
文檔未提供具體的技術實現細節(>50字),跳過此章節。
📄 許可證
本項目採用Apache-2.0許可證。