setfit-model-paraphrase-MiniLM-L6-v2开源模型 - 免费用于文本分类少样本学习

首页

Setfit Model Paraphrase MiniLM L6 V2

由 hleAtKeeper 开发

这是一个基于SetFit的高效少样本学习模型，用于文本分类任务，采用sentence-transformers/paraphrase-MiniLM-L6-v2作为句子嵌入模型和LogisticRegression进行分类。

文本分类 #少样本学习 #高效文本分类 #命令行安全分析

下载量 418

发布时间 : 4/15/2025

模型简介

该模型结合SetFit框架和预训练句子嵌入模型，专注于文本分类任务，特别适合少样本学习场景。

模型特点

高效少样本学习

采用独特的对比学习技术，在少量样本上也能高效学习。

精准分类

在文本分类任务中表现出较高的准确率（评估准确率达99.15%）。

两阶段训练

先微调句子嵌入模型，再训练分类头，提升模型性能。

模型能力

文本分类

少样本学习

命令语句分类

使用案例

系统命令分类

命令风险等级分类

对Linux系统命令进行风险等级分类（Critical/High/Medium/Low）

准确率99.15%

🚀 SetFit与sentence-transformers/paraphrase-MiniLM-L6-v2结合模型

这是一个基于SetFit的模型，可用于文本分类任务。该SetFit模型采用sentence-transformers/paraphrase-MiniLM-L6-v2作为句子嵌入模型，并使用LogisticRegression进行分类。

该模型运用了一种高效的少样本学习技术进行训练，具体步骤如下：

通过对比学习对Sentence Transformer进行微调。
利用微调后的Sentence Transformer提取的特征训练分类头。

✨ 主要特性

高效少样本学习：采用独特的训练技术，在少量样本上也能高效学习。
精准分类：在文本分类任务中表现出较高的准确率。

📦 安装指南

首先，你需要安装SetFit库：

pip install setfit

💻 使用示例

基础用法

from setfit import SetFitModel

# 从Hugging Face Hub下载模型
model = SetFitModel.from_pretrained("setfit_model_id")
# 进行推理
preds = model("systemctl stop apache2")

📚 详细文档

模型详情

模型描述

属性	详情
模型类型	SetFit
句子嵌入模型	sentence-transformers/paraphrase-MiniLM-L6-v2
分类头	LogisticRegression实例
最大序列长度	128个词元
类别数量	4个类别

模型来源

代码仓库：GitHub上的SetFit
论文：Efficient Few-Shot Learning Without Prompts
博客文章：SetFit: Efficient Few-Shot Learning Without Prompts

模型标签

标签	示例
Medium	'chmod 777 /tmp' 'nmap -p 22,80,443 192.168.1.1' "grep -r 'root' /etc"
Low	'reboot' 'apt-get update' 'cd /home/user'
Critical	'history -c' "echo 'export HISTFILE=/dev/null' >> ~/.bashrc" "ssh-keygen -t rsa -f ~/.ssh/id_rsa -q -N ''"
High	"echo 'export HISTFILE=/dev/null' >> ~/.bashrc" 'bash /tmp/malicious.sh' 'bash /tmp/exploit.sh'

评估

评估指标

标签	准确率
all	0.9915

训练详情

训练集指标

训练集指标	最小值	中位数	最大值
词数	1	3.1356	11

标签	训练样本数量
Low	42
Medium	17
High	40
Critical	19

训练超参数

batch_size: (16, 16)
num_epochs: (1, 1)
max_steps: -1
sampling_strategy: oversampling
body_learning_rate: (2e-05, 1e-05)
head_learning_rate: 0.01
loss: CosineSimilarityLoss
distance_metric: cosine_distance
margin: 0.25
end_to_end: False
use_amp: False
warmup_proportion: 0.1
l2_weight: 0.01
seed: 42
eval_max_steps: -1
load_best_model_at_end: True

训练结果

轮次	步数	训练损失	验证损失
0.0016	1	0.4702	-
0.0806	50	0.2501	-
0.1613	100	0.1859	-
0.2419	150	0.1318	-
0.3226	200	0.1157	-
0.4032	250	0.095	-
0.4839	300	0.0902	-
0.5645	350	0.0796	-
0.6452	400	0.0663	-
0.7258	450	0.0539	-
0.8065	500	0.045	-
0.8871	550	0.0378	-
0.9677	600	0.0332	-
1.0	620	-	0.1862

框架版本

Python: 3.13.2
SetFit: 1.1.2
Sentence Transformers: 4.0.2
Transformers: 4.51.0
PyTorch: 2.6.0
Datasets: 3.5.0
Tokenizers: 0.21.1

🔧 技术细节

引用

@article{https://doi.org/10.48550/arxiv.2209.11055,
    doi = {10.48550/ARXIV.2209.11055},
    url = {https://arxiv.org/abs/2209.11055},
    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {Efficient Few-Shot Learning Without Prompts},
    publisher = {arXiv},
    year = {2022},
    copyright = {Creative Commons Attribution 4.0 International}
}