dunzhang-stella_en_400M_v5开源英语文本处理模型

首页

Dunzhang Stella En 400M V5

由 Marqo 开发

Stella 400M 是一个中等规模的英语文本处理模型，专注于分类和信息检索任务。

文本分类

Transformers

其他开源协议:MIT #高精度文本分类 #电商评论分析 #多任务评估

下载量 17.20k

发布时间 : 9/25/2024

模型简介

该模型主要用于文本分类和信息检索任务，在多个标准数据集上表现出色。

模型特点

高性能分类

在Amazon产品评论分类任务中达到97.19%的准确率

多任务能力

支持多种文本处理任务，包括分类和信息检索

中等规模

400M参数的平衡设计，兼顾性能和效率

模型能力

文本分类

情感分析

信息检索

文本相似度计算

使用案例

电子商务

产品评论分类

自动分类Amazon产品评论的情感倾向

在Amazon极性分类任务中达到97.19%准确率

评论多分类

对Amazon评论进行多星级分类

在Amazon评论多分类任务中达到59.53%准确率

信息检索

论点检索

在ArguAna数据集上进行论点匹配检索

达到64.24的主要评分

🚀 Marqo Stella v2

Marqo Stella v2 是一个与原始 Dunzhang stella 400m 模型相似的模型，它融合了一个俄罗斯套娃层（Matryoshka Layer）。这种层级结构能够在不改变相关性指标的前提下，降低生成嵌入向量时的计算开销。

🚀 快速开始

环境准备

确保你已经安装了必要的库，如 transformers、torch 和 sklearn。可以使用以下命令进行安装：

pip install transformers torch sklearn

代码示例

以下是一个使用该模型进行查询和文档嵌入，并计算相似度的示例代码：

import os
import torch
from transformers import AutoModel, AutoTokenizer, AutoConfig
from sklearn.preprocessing import normalize

# 定义查询提示
query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
# 定义查询列表
queries = [
    "What are some ways to reduce stress?",
    "What are the benefits of drinking green tea?",
]
# 为每个查询添加提示
queries = [query_prompt + query for query in queries]
# 定义文档列表，文档不需要提示
docs = [
    "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
    "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]

# 克隆模型后的本地路径
model_dir = "Marqo/dunzhang-stella_en_400M_v5"
# 加载模型并将其移动到 GPU 上进行评估
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).cuda().eval()
# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

# 对查询进行嵌入
with torch.no_grad():
    input_data = tokenizer(queries, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    input_data = {k: v.cuda() for k, v in input_data.items()}
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    query_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    query_vectors = normalize(query_vectors.cpu().numpy())

# 对文档进行嵌入
with torch.no_grad():
    input_data = tokenizer(docs, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    input_data = {k: v.cuda() for k, v in input_data.items()}
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    docs_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    docs_vectors = normalize(docs_vectors.cpu().numpy())

# 打印查询向量和文档向量的形状
print(query_vectors.shape, docs_vectors.shape)
# (2, 1024) (2, 1024)

# 计算查询向量和文档向量之间的相似度
similarities = query_vectors @ docs_vectors.T
print(similarities)
# [[0.8397531  0.29900077]
#  [0.32818374 0.80954516]]

💻 使用示例

基础用法

上述代码展示了如何使用该模型进行查询和文档的嵌入，并计算它们之间的相似度。具体步骤如下：

定义查询和文档：准备需要处理的查询和文档列表。
加载模型和分词器：从指定路径加载模型和分词器。
对查询和文档进行嵌入：使用分词器对查询和文档进行分词，并通过模型生成嵌入向量。
计算相似度：使用矩阵乘法计算查询向量和文档向量之间的相似度。

高级用法

你可以根据实际需求对代码进行扩展，例如：

批量处理：处理更多的查询和文档，提高效率。
不同的相似度计算方法：除了矩阵乘法，还可以使用其他相似度计算方法，如余弦相似度。
与其他模型结合使用：将该模型的输出与其他模型的输出进行融合，以获得更好的性能。

📄 许可证

本项目采用 MIT 许可证。

模型评估结果

以下是该模型在多个数据集上的评估结果：

数据集名称	任务类型	主要得分
MTEB AmazonCounterfactualClassification (en)	分类	92.35820895522387
MTEB AmazonPolarityClassification	分类	97.1945
MTEB AmazonReviewsClassification (en)	分类	59.528000000000006
MTEB ArguAna	检索	64.24
MTEB ArxivClusteringP2P	聚类	55.1564333205451
MTEB ArxivClusteringS2S	聚类	49.823698316694795
MTEB AskUbuntuDupQuestions	重排序	66.15294503553424
MTEB BIOSSES	语义文本相似度	83.29587385660628
MTEB Banking77Classification	分类	89.30194805194806
MTEB BiorxivClusteringP2P	聚类	50.67972171889736
MTEB BiorxivClusteringS2S	聚类	45.80539715556144
MTEB CQADupstackRetrieval	检索	44.361250000000005
MTEB ClimateFEVER	检索	43.525999999999996
MTEB DBPedia	检索	49.884
MTEB EmotionClassification	分类	78.77499999999999
MTEB FEVER	检索	90.986
MTEB FiQA2018	检索	56.056
MTEB HotpotQA	检索	71.74199999999999
MTEB ImdbClassification	分类	96.4904
MTEB MSMARCO	检索	43.692
MTEB MTOPDomainClassification (en)	分类	98.82580939352485
MTEB MTOPIntentClassification (en)	分类	92.29822161422709
MTEB MassiveIntentClassification (en)	分类	85.17484868863484
MTEB MassiveScenarioClassification (en)	分类	89.61667787491594
MTEB MedrxivClusteringP2P	聚类	46.318282423948574
MTEB MedrxivClusteringS2S	聚类	44.29033625273981
MTEB MindSmallReranking	重排序	33.0526129239962
MTEB NFCorpus	检索	41.486000000000004
MTEB NQ	检索	69.072
MTEB QuoraRetrieval	检索	89.58
MTEB RedditClustering	聚类	71.18966762070158
MTEB RedditClusteringP2P	聚类	74.42014716862516
MTEB SCIDOCS	检索	25.041999999999998
MTEB SICK - R	语义文本相似度	82.20531642680812
MTEB STS12	语义文本相似度	79.51504881448884
MTEB STS13	语义文本相似度	89.18936052329725
MTEB STS14	语义文本相似度	85.14654611519086
MTEB STS15	语义文本相似度	89.10215217191254
MTEB STS16	语义文本相似度	87.14066355879785
MTEB STS17 (en - en)	语义文本相似度	90.97082650129164
MTEB STS22 (en)	语义文本相似度	67.82870469746828
MTEB STSBenchmark	语义文本相似度	87.7360146030987
MTEB SciDocsRR	重排序	88.43547871921146
MTEB SciFact	检索	78.233
MTEB SprintDuplicateQuestions	成对分类	95.7485189884476
MTEB StackExchangeClustering	聚类	78.49205191950675
MTEB StackExchangeClusteringP2P	聚类	48.90421736513028
MTEB StackOverflowDupQuestions	重排序	52.9874730481696
MTEB SummEval	摘要	31.66058223980157
MTEB TRECCOVID	检索	85.206
MTEB Touche2020	检索	31.455
MTEB ToxicConversationsClassification	分类	86.9384765625
MTEB TweetSentimentExtractionClassification	分类	73.57668364459535
MTEB TwentyNewsgroupsClustering	聚类	58.574148097494685
MTEB TwitterSemEval2015	成对分类	80.18603932881858
MTEB TwitterURLCorpus	成对分类	87.46554314325058