bart-large-mnli-yahoo-answers開源模型 - 免費用於雅虎問答主題零樣本分類

首頁

Bart Large Mnli Yahoo Answers

由joeddav開發

基於BART-large-MNLI微調的零樣本分類模型，專為雅虎問答主題分類優化

文本分類英語#零樣本分類 #主題識別 #雅虎問答優化

下載量 190.85k

發布時間 : 3/2/2022

模型概述

該模型針對雅虎問答主題分類任務進行優化，能夠預測文本序列是否適用於特定主題標籤，支持零樣本分類場景。

模型特點

零樣本分類能力

無需特定標籤的訓練數據即可對新標籤進行分類

主題分類優化

專門針對雅虎問答主題分類任務進行微調優化

模板適配

使用特定假設模板('這段文本是關於{}的。')提高分類準確性

模型能力

文本分類

零樣本學習

主題識別

使用案例

內容分類

問答主題分類

對雅虎問答內容進行主題分類

在已見標籤上F1值0.72，未見標籤上0.68

社交媒體內容分析

識別社交媒體帖子所屬主題類別

🚀 bart-lage-mnli-yahoo-answers

本模型基於facebook/bart-large-mnli在雅虎問答主題分類任務上進行微調。它可用於預測給定序列是否能分配某個主題標籤，無論該標籤是否曾見過。

你可以在這裡體驗此模型的零樣本技術交互式演示，同時也能體驗未微調的facebook/bart-large-mnli。

🚀 快速開始

本模型在主題分類任務上進行了微調，在零樣本主題分類任務中表現最佳。使用hypothesis_template="This text is about {}."，因為這是微調期間使用的模板。

對於主題分類以外的設置，你可以使用任何在MNLI上預訓練的模型，如facebook/bart-large-mnli或roberta-large-mnli，代碼如下。

💻 使用示例

基礎用法

使用zero-shot-classification管道調用模型：

from transformers import pipeline
nlp = pipeline("zero-shot-classification", model="joeddav/bart-large-mnli-yahoo-answers")

sequence_to_classify = "Who are you voting for in 2020?"
candidate_labels = ["Europe", "public health", "politics", "elections"]
hypothesis_template = "This text is about {}."
nlp(sequence_to_classify, candidate_labels, multi_class=True, hypothesis_template=hypothesis_template)

高級用法

手動使用PyTorch調用模型：

# pose sequence as a NLI premise and label as a hypothesis
from transformers import BartForSequenceClassification, BartTokenizer
nli_model = BartForSequenceClassification.from_pretrained('joeddav/bart-large-mnli-yahoo-answers')
tokenizer = BartTokenizer.from_pretrained('joeddav/bart-large-mnli-yahoo-answers')

premise = sequence
hypothesis = f'This text is about {label}.'

# run through model pre-trained on MNLI
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
                        max_length=tokenizer.max_len,
                        truncation_strategy='only_first')
logits = nli_model(x.to(device))[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true 
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,1]

🔧 技術細節

該模型是一個預訓練的MNLI分類器，按照Yin等人2019年和這篇博客文章中最初描述的方式，在雅虎問答主題分類任務上進一步微調。即，將每個序列作為前提輸入預訓練的NLI模型，每個候選標籤作為假設，格式如下：This text is about {class name}. 對於訓練集中的每個示例，將一個真實標籤假設和一個隨機選擇的錯誤標籤假設輸入模型，模型必須預測哪些標籤有效，哪些標籤錯誤。

由於該方法研究的是在一組不同標籤上訓練後對未見標籤進行分類的能力，因此該模型僅在雅虎問答的10個標籤中的5個上進行訓練。這些標籤是 “社會與文化”、“健康”、“計算機與互聯網”、“商業與金融” 以及 “家庭與人際關係”。

📚 詳細文檔

評估結果

該模型使用_已見_和_未見_標籤的標籤加權F1值進行評估。即，對於每個示例，模型必須從10個語料庫標籤中進行預測。報告了訓練期間已見標籤和未見標籤的F1值。我們發現未見標籤和已見標籤的F1分數分別為0.68和0.72。為了調整分佈內和分佈外標籤，我們從_已見_標籤的歸一化概率中減去固定的30%，如Yin等人2019年和我們的博客文章中所述。