🚀 BARTベースモデルをCNN Dailymailでファインチューニング
このモデルは、Ainize Teachable-NLPを使用して、CNN/Dailymail要約データセットでファインチューニングされたbart-baseモデルです。
Bartモデルは、2019年10月29日にMike Lewis、Yinhan Liu、Naman Goyal、Marjan Ghazvininejad、Abdelrahman Mohamed、Omer Levy、Ves Stoyanov、およびLuke Zettlemoyerによって提案されました。概要によると、
Bartは、双方向エンコーダ(BERTのような)と左から右へのデコーダ(GPTのような)を備えた標準的なseq2seq/機械翻訳アーキテクチャを使用しています。
事前学習タスクには、元の文の順序をランダムにシャッフルすることと、新しいインフィリング方式が含まれており、テキストのスパンが単一のマスクトークンに置き換えられます。
BARTは、テキスト生成のためにファインチューニングされた場合に特に効果的ですが、理解タスクにも適しています。GLUEとSQuADでは、同等のトレーニングリソースでRoBERTaの性能に匹敵し、一連の抽象的な対話、質問応答、および要約タスクで新しい最先端の結果を達成し、最大6 ROUGEの改善を達成します。
著者のコードはこちらにあります:
https://github.com/pytorch/fairseq/tree/master/examples/bart
🚀 クイックスタート
💻 使用例
基本的な使用法
from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration
tokenizer = PreTrainedTokenizerFast.from_pretrained("ainize/bart-base-cnn")
model = BartForConditionalGeneration.from_pretrained("ainize/bart-base-cnn")
input_text = '(CNN) -- South Korea launched an investigation Tuesday into reports of toxic chemicals being dumped at a former U.S. military base, the Defense Ministry said. The tests follow allegations of American soldiers burying chemicals on Korean soil. The first tests are being carried out by a joint military, government and civilian task force at the site of what was Camp Mercer, west of Seoul. "Soil and underground water will be taken in the areas where toxic chemicals were allegedly buried," said the statement from the South Korean Defense Ministry. Once testing is finished, the government will decide on how to test more than 80 other sites -- all former bases. The alarm was raised this month when a U.S. veteran alleged barrels of the toxic herbicide Agent Orange were buried at an American base in South Korea in the late 1970s. Two of his fellow soldiers corroborated his story about Camp Carroll, about 185 miles (300 kilometers) southeast of the capital, Seoul. "We\'ve been working very closely with the Korean government since we had the initial claims," said Lt. Gen. John Johnson, who is heading the Camp Carroll Task Force. "If we get evidence that there is a risk to health, we are going to fix it." A joint U.S.- South Korean investigation is being conducted at Camp Carroll to test the validity of allegations. The U.S. military sprayed Agent Orange from planes onto jungles in Vietnam to kill vegetation in an effort to expose guerrilla fighters. Exposure to the chemical has been blamed for a wide variety of ailments, including certain forms of cancer and nerve disorders. It has also been linked to birth defects, according to the Department of Veterans Affairs. Journalist Yoonjung Seo contributed to this report.'
input_ids = tokenizer.encode(input_text, return_tensors="pt")
summary_text_ids = model.generate(
input_ids=input_ids,
bos_token_id=model.config.bos_token_id,
eos_token_id=model.config.eos_token_id,
length_penalty=2.0,
max_length=142,
min_length=56,
num_beams=4,
)
print(tokenizer.decode(summary_text_ids[0], skip_special_tokens=True))
API
このモデルは、ainizeを通じて体験することができます。
📄 ライセンス
このプロジェクトは、Apache 2.0ライセンスの下で公開されています。