distilcamembert-base-nli開源法語推理模型 - 輕量級設計推理速度快50%

首頁

Distilcamembert Base Nli

由cmarkea開發

基於DistilCamemBERT針對法語自然語言推理任務微調的輕量級模型，推理速度比原版CamemBERT快50%

文本分類

Transformers

支持多種語言開源協議:MIT #法語NLI #零樣本分類 #輕量推理

下載量 6,327

發布時間 : 3/2/2022

模型概述

該模型用於法語自然語言推理任務（NLI），判斷兩個句子之間的蘊含、矛盾或中立關係。通過蒸餾技術壓縮模型規模，保持較高準確率的同時顯著提升推理效率。

模型特點

高效推理

相比原版CamemBERT模型，推理時間縮短50%，適合生產環境部署

零樣本分類

無需微調即可實現文本分類任務，支持自定義標籤和模板

多場景適用

在影評情感分析和新聞分類等任務中表現良好

模型能力

自然語言推理

文本分類

零樣本學習

使用案例

文本分析

影評情感分析

使用零樣本分類判斷電影評論的情感傾向（正面/負面）

在allocine數據集上達到80.59%準確率

新聞分類

對新聞摘要進行主題分類（經濟/政治/體育/科學）

在mlsum數據集上達到79.30%準確率

語義理解

文本蘊含判斷

分析兩個句子之間的邏輯關係（蘊含/矛盾/中立）

在XNLI測試集上達到77.45% F1值

🚀 DistilCamemBERT-NLI

DistilCamemBERT-NLI 是基於 DistilCamemBERT 微調的模型，專為法語自然語言推理（NLI）任務打造，也用於識別文本蘊含關係（RTE）。該模型基於 XNLI 數據集構建，可判斷前提與假設之間是蘊含、矛盾還是無關係。

🚀 快速開始

安裝依賴

from transformers import pipeline

classifier = pipeline(
    task='zero-shot-classification',
    model="cmarkea/distilcamembert-base-nli",
    tokenizer="cmarkea/distilcamembert-base-nli"
)

示例代碼

result = classifier (
    sequences="Le style très cinéphile de Quentin Tarantino "
    "se reconnaît entre autres par sa narration postmoderne "
    "et non linéaire, ses dialogues travaillés souvent "
    "émaillés de références à la culture populaire, et ses "
    "scènes hautement esthétiques mais d'une violence "
    "extrême, inspirées de films d'exploitation, d'arts "
    "martiaux ou de western spaghetti.",
    candidate_labels="cinéma, technologie, littérature, politique",
    hypothesis_template="Ce texte parle de {}."
)

result
{"labels": ["cinéma",
            "littérature",
            "technologie",
            "politique"],
 "scores": [0.7164115309715271,
            0.12878799438476562,
            0.1092301607131958,
            0.0455702543258667]}

Optimum + ONNX

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

HUB_MODEL = "cmarkea/distilcamembert-base-nli"

tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
model = ORTModelForSequenceClassification.from_pretrained(HUB_MODEL)
onnx_qa = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

# 量化的 ONNX 模型
quantized_model = ORTModelForSequenceClassification.from_pretrained(
    HUB_MODEL, file_name="model_quantized.onnx"
)

✨ 主要特性

高效推理：藉助 DistilCamemBERT，在相同功耗下，推理時間減半。
零樣本分類：無需訓練即可進行文本分類。

📦 安裝指南

使用該模型，需安裝 transformers 庫：

pip install transformers

若使用 ONNX 優化版本，還需安裝 optimum 庫：

pip install optimum[onnxruntime]

💻 使用示例

基礎用法

from transformers import pipeline

classifier = pipeline(
    task='zero-shot-classification',
    model="cmarkea/distilcamembert-base-nli",
    tokenizer="cmarkea/distilcamembert-base-nli"
)
result = classifier (
    sequences="Le style très cinéphile de Quentin Tarantino "
    "se reconnaît entre autres par sa narration postmoderne "
    "et non linéaire, ses dialogues travaillés souvent "
    "émaillés de références à la culture populaire, et ses "
    "scènes hautement esthétiques mais d'une violence "
    "extrême, inspirées de films d'exploitation, d'arts "
    "martiaux ou de western spaghetti.",
    candidate_labels="cinéma, technologie, littérature, politique",
    hypothesis_template="Ce texte parle de {}."
)

result

高級用法

# 使用 ONNX 優化版本
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

HUB_MODEL = "cmarkea/distilcamembert-base-nli"

tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
model = ORTModelForSequenceClassification.from_pretrained(HUB_MODEL)
onnx_qa = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

# 量化的 ONNX 模型
quantized_model = ORTModelForSequenceClassification.from_pretrained(
    HUB_MODEL, file_name="model_quantized.onnx"
)

📚 詳細文檔

數據集

FLUE 中的 XNLI 數據集包含 392,702 個前提及其假設用於訓練，5,010 對用於測試。目標是預測文本蘊含關係（句子 A 是否蘊含/矛盾/無關係於句子 B？），這是一個分類任務（給定兩個句子，預測三個標籤之一）。句子 A 稱為前提，句子 B 稱為假設，模型目標如下： $$P(premise=c\in{contradiction, entailment, neutral}\vert hypothesis)$$

評估結果

類別	準確率 (%)	F1 分數 (%)	樣本數
總體	77.70	77.45	5,010
矛盾	78.00	79.54	1,670
蘊含	82.90	78.87	1,670
中立	72.18	74.04	1,670

基準測試

將 DistilCamemBERT 模型與另外兩個法語模型進行比較。第一個 BaptisteDoyen/camembert-base-xnli 基於 CamemBERT，第二個 MoritzLaurer/mDeBERTa-v3-base-mnli-xnli 基於 mDeBERTav3。使用準確率和 MCC（馬修斯相關係數）指標進行性能比較。使用 AMD Ryzen 5 4500U @ 2.3GHz 6 核 測量平均推理時間。

模型	時間 (ms)	準確率 (%)	MCC (x100)
cmarkea/distilcamembert-base-nli	51.35	77.45	66.24
BaptisteDoyen/camembert-base-xnli	105.0	81.72	72.67
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli	299.18	83.43	75.15

零樣本分類

此類模型的主要優勢是創建零樣本分類器，無需訓練即可進行文本分類。該任務可總結為： $$P(hypothesis=i\in\mathcal{C}|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$

Allocine 數據集

使用 allocine 數據集訓練情感分析模型。該數據集包含兩個類別：電影評論的“積極”和“消極”評價。使用 "Ce commentaire est {}." 作為假設模板，"積極" 和 "消極" 作為候選標籤。

模型	時間 (ms)	準確率 (%)	MCC (x100)
cmarkea/distilcamembert-base-nli	195.54	80.59	63.71
BaptisteDoyen/camembert-base-xnli	378.39	86.37	73.74
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli	520.58	84.97	70.05

MLSum 數據集

使用 mlsum 數據集訓練摘要模型。聚合子主題並選擇其中一些，使用文章摘要部分預測主題。使用 "C'est un article traitant de {}." 作為假設模板，候選標籤為："經濟"、"政治"、"體育" 和 "科學"。

模型	時間 (ms)	準確率 (%)	MCC (x100)
cmarkea/distilcamembert-base-nli	217.77	79.30	70.55
BaptisteDoyen/camembert-base-xnli	448.27	70.7	64.10
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli	591.34	64.45	58.67

🔧 技術細節

該模型基於 DistilCamemBERT 微調，通過減少模型參數和計算量，在保證性能的同時提高推理效率。在零樣本分類任務中，利用自然語言推理能力，通過計算前提與假設之間的蘊含概率進行分類。

📄 許可證

本項目採用 MIT 許可證。

📖 引用

@inproceedings{delestre:hal-03674695,
  TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
  AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
  URL = {https://hal.archives-ouvertes.fr/hal-03674695},
  BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
  ADDRESS = {Vannes, France},
  YEAR = {2022},
  MONTH = Jul,
  KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
  PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
  HAL_ID = {hal-03674695},
  HAL_VERSION = {v1},
}