gliclass-large-v1.0-init开源零样本分类器 - 适用主题分类等多场景分析

首页

Gliclass Large V1.0 Init

由 knowledgator 开发

GLiClass是一款高效零样本分类器，基于合成数据训练，适用于主题分类、情感分析及RAG流程中的重排序任务。

文本分类

Transformers

英语开源协议:Apache-2.0 #零样本分类 #高效单次推理 #多标签分类

下载量 85

发布时间 : 6/3/2024

模型简介

受GLiNER启发的轻量级序列分类模型，支持零样本学习，在保持与交叉编码器相同性能的同时计算效率更高。

模型特点

高效零样本分类

单次前向传播即可完成分类，计算效率优于传统交叉编码器

多任务适用性

支持主题分类、情感分析及RAG重排序等多种文本处理任务

商业友好

基于合成数据训练，可安全应用于商业场景

模型能力

零样本文本分类

多标签分类

情感分析

检索增强生成（RAG）重排序

使用案例

内容分类

新闻主题分类

对新闻文本进行多主题自动标注

在AG_NEWS数据集上F1达0.7516

情感分析

评论情感识别

识别用户评论中的情感倾向

在IMDB数据集上F1达0.9404

信息检索

RAG结果重排序

优化检索增强生成流程中的文档排序

🚀 ⭐ GLiClass：用于序列分类的通用轻量级模型

GLiClass 是一个高效的零样本分类器，其灵感源自 GLiNER 的研究工作。它在性能上与交叉编码器相当，但计算效率更高，因为分类仅需一次前向传播即可完成。该模型可用于 主题分类、情感分析，还能在 RAG 管道中作为重排器使用。模型基于合成数据进行训练，可用于商业应用，且除了初始数据集（MoritzLaurer/synthetic_zeroshot_mixtral_v0.1）外，未在其他任何数据集上进行额外的微调。

🚀 快速开始

安装

首先，你需要安装 GLiClass 库：

pip install gliclass

初始化模型和管道

from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer

model = GLiClassModel.from_pretrained("knowledgator/gliclass-large-v1.0-init")
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-large-v1.0-init")

pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

text = "One day I will see the world!"
labels = ["travel", "dreams", "sport", "science", "politics"]
results = pipeline(text, labels, threshold=0.5)[0] #because we have one text

for result in results:
 print(result["label"], "=>", result["score"])

✨ 主要特性

高效零样本分类：受 GLiNER 启发，在性能与交叉编码器相当的情况下，计算效率更高。
多任务应用：可用于主题分类、情感分析以及 RAG 管道中的重排。
合成数据训练：基于合成数据训练，可用于商业应用。

📦 安装指南

若要使用 GLiClass，可通过以下命令安装：

pip install gliclass

💻 使用示例

基础用法

from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer

model = GLiClassModel.from_pretrained("knowledgator/gliclass-large-v1.0-init")
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-large-v1.0-init")

pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

text = "One day I will see the world!"
labels = ["travel", "dreams", "sport", "science", "politics"]
results = pipeline(text, labels, threshold=0.5)[0] #because we have one text

for result in results:
 print(result["label"], "=>", result["score"])

📚 详细文档

基准测试

以下是该模型在几个文本分类数据集上的 F1 分数。所有测试模型均未在这些数据集上进行微调，且在零样本设置下进行测试。

模型	IMDB	AG_NEWS	Emotions
gliclass-large-v1.0 (438 M)	0.9404	0.7516	0.4874
gliclass-base-v1.0 (186 M)	0.8650	0.6837	0.4749
gliclass-small-v1.0 (144 M)	0.8650	0.6805	0.4664
Bart-large-mnli (407 M)	0.89	0.6887	0.3765
Deberta-base-v3 (184 M)	0.85	0.6455	0.5095
Comprehendo (184M)	0.90	0.7982	0.5660
SetFit BAAI/bge-small-en-v1.5 (33.4M)	0.86	0.5636	0.5754