distilcamembert-base-nli开源法语推理模型 - 轻量级设计推理速度快50%

首页

Distilcamembert Base Nli

由 cmarkea 开发

基于DistilCamemBERT针对法语自然语言推理任务微调的轻量级模型，推理速度比原版CamemBERT快50%

文本分类

Transformers

支持多种语言开源协议:MIT #法语NLI #零样本分类 #轻量推理

下载量 6,327

发布时间 : 3/2/2022

模型简介

该模型用于法语自然语言推理任务（NLI），判断两个句子之间的蕴含、矛盾或中立关系。通过蒸馏技术压缩模型规模，保持较高准确率的同时显著提升推理效率。

模型特点

高效推理

相比原版CamemBERT模型，推理时间缩短50%，适合生产环境部署

零样本分类

无需微调即可实现文本分类任务，支持自定义标签和模板

多场景适用

在影评情感分析和新闻分类等任务中表现良好

模型能力

自然语言推理

文本分类

零样本学习

使用案例

文本分析

影评情感分析

使用零样本分类判断电影评论的情感倾向（正面/负面）

在allocine数据集上达到80.59%准确率

新闻分类

对新闻摘要进行主题分类（经济/政治/体育/科学）

在mlsum数据集上达到79.30%准确率

语义理解

文本蕴含判断

分析两个句子之间的逻辑关系（蕴含/矛盾/中立）

在XNLI测试集上达到77.45% F1值

🚀 DistilCamemBERT-NLI

DistilCamemBERT-NLI 是基于 DistilCamemBERT 微调的模型，专为法语自然语言推理（NLI）任务打造，也用于识别文本蕴含关系（RTE）。该模型基于 XNLI 数据集构建，可判断前提与假设之间是蕴含、矛盾还是无关系。

🚀 快速开始

安装依赖

from transformers import pipeline

classifier = pipeline(
    task='zero-shot-classification',
    model="cmarkea/distilcamembert-base-nli",
    tokenizer="cmarkea/distilcamembert-base-nli"
)

示例代码

result = classifier (
    sequences="Le style très cinéphile de Quentin Tarantino "
    "se reconnaît entre autres par sa narration postmoderne "
    "et non linéaire, ses dialogues travaillés souvent "
    "émaillés de références à la culture populaire, et ses "
    "scènes hautement esthétiques mais d'une violence "
    "extrême, inspirées de films d'exploitation, d'arts "
    "martiaux ou de western spaghetti.",
    candidate_labels="cinéma, technologie, littérature, politique",
    hypothesis_template="Ce texte parle de {}."
)

result
{"labels": ["cinéma",
            "littérature",
            "technologie",
            "politique"],
 "scores": [0.7164115309715271,
            0.12878799438476562,
            0.1092301607131958,
            0.0455702543258667]}

Optimum + ONNX

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

HUB_MODEL = "cmarkea/distilcamembert-base-nli"

tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
model = ORTModelForSequenceClassification.from_pretrained(HUB_MODEL)
onnx_qa = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

# 量化的 ONNX 模型
quantized_model = ORTModelForSequenceClassification.from_pretrained(
    HUB_MODEL, file_name="model_quantized.onnx"
)

✨ 主要特性

高效推理：借助 DistilCamemBERT，在相同功耗下，推理时间减半。
零样本分类：无需训练即可进行文本分类。

📦 安装指南

使用该模型，需安装 transformers 库：

pip install transformers

若使用 ONNX 优化版本，还需安装 optimum 库：

pip install optimum[onnxruntime]

💻 使用示例

基础用法

from transformers import pipeline

classifier = pipeline(
    task='zero-shot-classification',
    model="cmarkea/distilcamembert-base-nli",
    tokenizer="cmarkea/distilcamembert-base-nli"
)
result = classifier (
    sequences="Le style très cinéphile de Quentin Tarantino "
    "se reconnaît entre autres par sa narration postmoderne "
    "et non linéaire, ses dialogues travaillés souvent "
    "émaillés de références à la culture populaire, et ses "
    "scènes hautement esthétiques mais d'une violence "
    "extrême, inspirées de films d'exploitation, d'arts "
    "martiaux ou de western spaghetti.",
    candidate_labels="cinéma, technologie, littérature, politique",
    hypothesis_template="Ce texte parle de {}."
)

result

高级用法

# 使用 ONNX 优化版本
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

HUB_MODEL = "cmarkea/distilcamembert-base-nli"

tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
model = ORTModelForSequenceClassification.from_pretrained(HUB_MODEL)
onnx_qa = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

# 量化的 ONNX 模型
quantized_model = ORTModelForSequenceClassification.from_pretrained(
    HUB_MODEL, file_name="model_quantized.onnx"
)

📚 详细文档

数据集

FLUE 中的 XNLI 数据集包含 392,702 个前提及其假设用于训练，5,010 对用于测试。目标是预测文本蕴含关系（句子 A 是否蕴含/矛盾/无关系于句子 B？），这是一个分类任务（给定两个句子，预测三个标签之一）。句子 A 称为前提，句子 B 称为假设，模型目标如下： $$P(premise=c\in{contradiction, entailment, neutral}\vert hypothesis)$$

评估结果

类别	准确率 (%)	F1 分数 (%)	样本数
总体	77.70	77.45	5,010
矛盾	78.00	79.54	1,670
蕴含	82.90	78.87	1,670
中立	72.18	74.04	1,670

基准测试

将 DistilCamemBERT 模型与另外两个法语模型进行比较。第一个 BaptisteDoyen/camembert-base-xnli 基于 CamemBERT，第二个 MoritzLaurer/mDeBERTa-v3-base-mnli-xnli 基于 mDeBERTav3。使用准确率和 MCC（马修斯相关系数）指标进行性能比较。使用 AMD Ryzen 5 4500U @ 2.3GHz 6 核 测量平均推理时间。

模型	时间 (ms)	准确率 (%)	MCC (x100)
cmarkea/distilcamembert-base-nli	51.35	77.45	66.24
BaptisteDoyen/camembert-base-xnli	105.0	81.72	72.67
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli	299.18	83.43	75.15

零样本分类

此类模型的主要优势是创建零样本分类器，无需训练即可进行文本分类。该任务可总结为： $$P(hypothesis=i\in\mathcal{C}|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$

Allocine 数据集

使用 allocine 数据集训练情感分析模型。该数据集包含两个类别：电影评论的“积极”和“消极”评价。使用 "Ce commentaire est {}." 作为假设模板，"积极" 和 "消极" 作为候选标签。

模型	时间 (ms)	准确率 (%)	MCC (x100)
cmarkea/distilcamembert-base-nli	195.54	80.59	63.71
BaptisteDoyen/camembert-base-xnli	378.39	86.37	73.74
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli	520.58	84.97	70.05

MLSum 数据集

使用 mlsum 数据集训练摘要模型。聚合子主题并选择其中一些，使用文章摘要部分预测主题。使用 "C'est un article traitant de {}." 作为假设模板，候选标签为："经济"、"政治"、"体育" 和 "科学"。

模型	时间 (ms)	准确率 (%)	MCC (x100)
cmarkea/distilcamembert-base-nli	217.77	79.30	70.55
BaptisteDoyen/camembert-base-xnli	448.27	70.7	64.10
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli	591.34	64.45	58.67

🔧 技术细节

该模型基于 DistilCamemBERT 微调，通过减少模型参数和计算量，在保证性能的同时提高推理效率。在零样本分类任务中，利用自然语言推理能力，通过计算前提与假设之间的蕴含概率进行分类。

📄 许可证

本项目采用 MIT 许可证。

📖 引用

@inproceedings{delestre:hal-03674695,
  TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
  AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
  URL = {https://hal.archives-ouvertes.fr/hal-03674695},
  BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
  ADDRESS = {Vannes, France},
  YEAR = {2022},
  MONTH = Jul,
  KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
  PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
  HAL_ID = {hal-03674695},
  HAL_VERSION = {v1},
}