xlm-roberta-large-it-mnli開源模型 - 支持多語言的意大利語零樣本文本分類

首頁

Xlm Roberta Large It Mnli

由Jiva開發

基於xlm-roberta-large微調的意大利語零樣本分類模型，支持多語言文本分類

文本分類

Transformers

其他開源協議:MIT #意大利語零樣本分類 #多語言NLI #自動翻譯微調

下載量 937

發布時間 : 3/2/2022

模型概述

該模型在從MNLI語料庫自動翻譯的意大利語子集上進行微調，專用於意大利語文本的零樣本分類，也可用於其他語言的分類任務。

模型特點

多語言支持

基於XLM-RoBERTa-large預訓練，支持100種語言的文本分類

零樣本分類

無需特定領域訓練即可對新類別進行分類

多標籤分類

支持同時為文本分配多個相關標籤

模型能力

意大利語文本分類

跨語言文本分類

多標籤分類

自然語言推理

使用案例

文本分類

歷史文本分類

對歷史相關文本進行分類，識別其主題

準確區分戰爭、歷史等類別

地理信息分類

對地理相關文本進行分類

準確識別地理相關內容

🚀 XLM-roBERTa-large-it-mnli

這個模型基於xlm-roberta-large，在MNLI語料庫的自動翻譯版本的NLI數據子集上進行微調。它主要用於零樣本文本分類任務，能對意大利語等多種語言的文本進行分類。

🚀 快速開始

零樣本分類管道加載模型

from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="Jiva/xlm-roberta-large-it-mnli", device=0, use_fast=True, multi_label=True)

分類示例

# 我們將對以下關於撒丁島的維基百科條目進行分類
sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
# 我們可以用意大利語指定候選標籤
candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
classifier(sequence_to_classify, candidate_labels)
# {'labels': ['geografia', 'moda', 'politica', 'macchine', 'cibo'],
# 'scores': [0.38871392607688904, 0.22633370757102966, 0.19398456811904907, 0.13735772669315338, 0.13708525896072388]}

指定假設模板

sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
hypothesis_template = "si parla di {}"
# classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)
# 'scores': [0.6068345904350281, 0.34715887904167175, 0.32433947920799255, 0.3068877160549164, 0.18744681775569916]}

手動使用PyTorch

# 將序列作為NLI前提，標籤作為假設
from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('Jiva/xlm-roberta-large-it-mnli')
tokenizer = AutoTokenizer.from_pretrained('Jiva/xlm-roberta-large-it-mnli')
premise = sequence
hypothesis = f'si parla di {}.'
# 通過在MNLI上預訓練的模型運行
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
                     truncation_strategy='only_first')
logits = nli_model(x.to(device))[0]
# 我們去掉“中立”（維度1），並將“蘊含”（2）的概率作為標籤為真的概率
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,1]

✨ 主要特性

多語言支持：基於預訓練的xlm-roberta-large模型，該模型在100種不同語言上進行了預訓練，因此除了意大利語，在其他語言的零樣本文本分類任務中也表現出一定的有效性。
零樣本分類：可用於零樣本的文本分類任務，無需針對特定任務進行大量的標註數據訓練。
微調優化：在MNLI語料庫的自動翻譯版本的NLI數據子集上進行微調，提高了在意大利語相關任務上的性能。

📦 安裝指南

文檔未提及具體安裝步驟，可參考Hugging Face Transformers庫的安裝方法：

pip install transformers

💻 使用示例

基礎用法

from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="Jiva/xlm-roberta-large-it-mnli", device=0, use_fast=True, multi_label=True)
sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
result = classifier(sequence_to_classify, candidate_labels)
print(result)

高級用法

# 指定假設模板
sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
hypothesis_template = "si parla di {}"
result = classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)
print(result)

📚 詳細文檔

模型描述

該模型以xlm-roberta-large為基礎，在從MNLI語料庫的自動翻譯版本中提取的NLI數據子集上進行微調。它旨在用於零樣本文本分類，例如使用Hugging Face的ZeroShotClassificationPipeline。

預期用途

此模型旨在用於意大利語文本的零樣本分類。由於基礎模型在100種不同語言上進行了預訓練，因此該模型在上述語言之外的其他語言中也顯示出一定的有效性。有關預訓練語言的完整列表，請參閱XLM Roberata論文的附錄A。對於僅英語的分類任務，建議使用bart-large-mnli或蒸餾的bart MNLI模型。

🔧 技術細節

版本0.1

該模型現在已在完整的訓練集上進行了重新訓練。由於翻譯模型的錯誤翻譯，大約1000個句子對已從數據集中移除。

指標	值
學習率	4e-6
優化器	AdamW
批量大小	80
MCC	0.77
訓練損失	0.34
評估損失	0.40
停止步驟	9754

版本0.0

該模型在100種語言的數據集上進行了預訓練，如原始論文所述。然後在MNLI數據集的意大利語翻譯版本上針對NLI任務進行了微調（到目前為止僅使用了訓練集的85%）。用於翻譯文本的模型是Helsinki-NLP/opus-mt-en-it，最大輸出序列長度為120。該模型以學習率4e-6和批量大小80進行了1個epoch的訓練，目前在剩餘15%的訓練集上的準確率為82%。