Bloomz-3b-nli開源模型 - 免費實現英法雙語語義關係自然語言推理

首頁

Bloomz 3b Nli

由cmarkea開發

基於Bloomz-3b-chat-dpo微調的自然語言推理模型，支持英法雙語語義關係判斷

大型語言模型

Transformers

支持多種語言開源協議:Openrail #零樣本分類 #多語言推理 #語義關係識別

下載量 22

發布時間 : 11/28/2023

模型概述

該模型專注於自然語言推理任務，能夠判斷兩個句子之間的邏輯關係（蘊含/矛盾/中立），並具備零樣本分類能力。採用語言無關方式訓練，支持英語和法語的任意組合輸入。

模型特點

雙語混合推理

支持英語和法語的任意組合輸入，在跨語言場景下保持高準確率

零樣本分類

無需特定訓練即可對任意文本進行多標籤分類，適用於開放域場景

長文本理解

相比傳統NLI模型，能更好處理複雜長文本結構的語義分析

模型能力

自然語言推理

跨語言文本分類

語義關係判斷

零樣本學習

使用案例

情感分析

影評情感分類

對電影評論進行積極/消極情感判斷

在Allociné數據集上達到89.06%準確率

內容分類

多語言新聞分類

對英法混合新聞進行主題分類（如政治/科技/體育等）

🚀 Bloomz-3b-NLI模型

Bloomz-3b-NLI模型是基於自然語言推理（NLI）任務訓練的模型，它從基礎模型 Bloomz-3b-chat-dpo 微調而來。該模型以與語言無關的方式進行訓練，能夠處理英語和法語的文本，在零樣本分類任務中表現出色。

🚀 快速開始

以下是使用 transformers 庫調用 Bloomz-3b-NLI 模型進行零樣本分類的示例代碼：

from transformers import pipeline

classifier = pipeline(
    task='zero-shot-classification',
    model="cmarkea/bloomz-3b-nli"
)
result = classifier (
    sequences="Le style très cinéphile de Quentin Tarantino "
    "se reconnaît entre autres par sa narration postmoderne "
    "et non linéaire, ses dialogues travaillés souvent "
    "émaillés de références à la culture populaire, et ses "
    "scènes hautement esthétiques mais d'une violence "
    "extrême, inspirées de films d'exploitation, d'arts "
    "martiaux ou de western spaghetti.",
    candidate_labels="cinéma, technologie, littérature, politique",
    hypothesis_template="Ce texte parle de {}."
)

result
{"labels": ["cinéma",
            "littérature",
            "technologie",
            "politique"],
 "scores": [0.8745610117912292,
            0.10403601825237274,
            0.014962797053158283,
            0.0064402492716908455]}

# 跨語言英法語境下的魯棒性
result = classifier (
    sequences="Quentin Tarantino's very cinephile style is "
    "recognized, among other things, by his postmodern and "
    "non-linear narration, his elaborate dialogues often "
    "peppered with references to popular culture, and his "
    "highly aesthetic but extremely violent scenes, inspired by "
    "exploitation films, martial arts or spaghetti western.",
    candidate_labels="cinéma, technologie, littérature, politique",
    hypothesis_template="Ce texte parle de {}."
)

result
{"labels": ["cinéma",
            "littérature",
            "technologie",
            "politique"],
 "scores": [0.9314399361610413,
            0.04960821941494942,
            0.013468802906572819,
            0.005483036395162344]}

✨ 主要特性

語言無關性：假設和前提在英語和法語之間隨機選擇，每種語言組合的概率為 25%。
零樣本分類能力：能夠對任何文本進行分類，無需特定訓練。
處理複雜文本：與 BERT、RoBERTa 或 CamemBERT 等模型相比，能夠從更復雜和冗長的文本結構中建模和提取信息。

📚 詳細文檔

模型介紹

Bloomz-3b-NLI 模型是從 Bloomz-3b-chat-dpo 基礎模型微調而來，用於自然語言推理（NLI）任務。NLI 任務旨在確定假設和一組前提之間的語義關係，通常表示為句子對。

語言無關性方法

假設和前提在英語和法語之間隨機選擇，每種語言組合的概率為 25%。

性能評估

自然語言推理任務

類別	準確率 (%)	F1 分數 (%)	樣本數
總體	81.96	81.07	5,010
矛盾	81.80	84.04	1,670
蘊含	84.82	81.96	1,670
中立	76.85	77.20	1,670

基準測試

假設和前提均為法語 | 模型 | 準確率 (%) | MCC (x100) | | ---- | ---- | ---- | | cmarkea/distilcamembert-base-nli | 77.45 | 66.24 | | BaptisteDoyen/camembert-base-xnli | 81.72 | 72.67 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 83.43 | 75.15 | | cmarkea/bloomz-560m-nli | 68.70 | 53.57 | | cmarkea/bloomz-3b-nli | 81.08 | 71.66 | | cmarkea/bloomz-7b1-mt-nli | 83.13 | 74.89 |
假設為法語，前提為英語（跨語言語境） | 模型 | 準確率 (%) | MCC (x100) | | ---- | ---- | ---- | | cmarkea/distilcamembert-base-nli | 16.89 | -26.82 | | BaptisteDoyen/camembert-base-xnli | 74.59 | 61.97 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 85.15 | 77.74 | | cmarkea/bloomz-560m-nli | 68.84 | 53.55 | | cmarkea/bloomz-3b-nli | 82.12 | 73.22 | | cmarkea/bloomz-7b1-mt-nli | 85.43 | 78.25 |

零樣本分類任務

零樣本分類任務可以總結為： $$P(hypothesis=i\in\mathcal{C}|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$ 其中，i 表示由模板（例如，“This text is about {}. ”）和 #C 候選標籤（“cinema”、“politics” 等）組成的假設。假設集由 {"This text is about cinema.", "This text is about politics.", ...} 組成。我們將這些假設與前提（即我們要分類的句子）進行比較。

零樣本分類性能

模型在法國電影評論網站 Allociné 上進行情感分析評估。數據集被標記為 2 類，即 20,000 條評論中的正面評論和負面評論。我們使用假設模板 “Ce commentaire est {}.” 和候選類別 “positif” 和 “negatif”。

模型	準確率 (%)	MCC (x100)
cmarkea/distilcamembert-base-nli	80.59	63.71
BaptisteDoyen/camembert-base-xnli	86.37	73.74
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli	84.97	70.05
cmarkea/bloomz-560m-nli	71.13	46.3
cmarkea/bloomz-3b-nli	89.06	78.10
cmarkea/bloomz-7b1-mt-nli	95.12	90.27

🔧 技術細節

自然語言推理任務

目標是預測文本蘊含關係（句子 A 是否蘊含/矛盾/中立於句子 B？），這是一個分類任務（給定兩個句子，預測三個標籤之一）。如果句子 A 稱為前提，句子 B 稱為假設，則建模的目標是估計以下概率： $$P(premise=c\in{contradiction, entailment, neutral}\vert hypothesis)$$

零樣本分類任務

零樣本分類任務可以通過以下公式總結： $$P(hypothesis=i\in\mathcal{C}|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$

📄 許可證

本模型使用的許可證為 bigscience-bloom-rail-1.0。

📖 引用

@online{DeBloomzNLI,
  AUTHOR = {Cyrile Delestre},
  URL = {https://huggingface.co/cmarkea/bloomz-3b-nli},
  YEAR = {2024},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}

📋 信息表格

屬性	詳情
模型類型	Bloomz-3b-NLI
訓練數據	xnli
基礎模型	cmarkea/bloomz-3b-dpo-chat
支持語言	法語、英語
任務類型	零樣本分類
許可證	bigscience-bloom-rail-1.0