Stella 400M v5开源英语文本嵌入模型 - 免费部署助文本分类检索

首页

Stella En 400M V5

由 billatsectorflow 开发

Stella 400M v5 是一个英语文本嵌入模型，在多个文本分类和检索任务上表现出色。

大型语言模型

Transformers

其他开源协议:MIT #高精度文本分类 #多任务评估 #英语NLP

下载量 7,630

发布时间 : 1/22/2025

模型简介

该模型是一个英语文本嵌入模型，主要用于文本分类和检索任务，在多个标准数据集上展示了优秀的性能。

模型特点

高性能文本分类

在Amazon产品评论分类任务上达到97.19%的准确率

强大的文本检索能力

在ArguAna检索任务上达到64.24的NDCG@10分数

多任务适应性

在多种文本处理任务上表现均衡，包括分类和检索

模型能力

文本分类

文本检索

语义相似度计算

文本嵌入生成

使用案例

电子商务

产品评论分类

对Amazon产品评论进行正面/负面分类

准确率97.19%

产品评论多分类

对Amazon产品评论进行星级分类

准确率59.53%

信息检索

论点检索

在ArguAna数据集上进行论点检索

NDCG@10 64.24

🚀 stella_en_400M_v5模型

本项目基于特定基础模型训练出了一系列具有不同维度的模型，简化了提示词的使用，在多个任务上有良好表现，且后续会将核心训练代码集成到相关库中，同时还提供了不同库的使用示例及常见问题解答。

🚀 快速开始

模型更新

大家好，感谢使用stella模型。经过六个月的努力，我在stella模型的基础上训练了jasper模型，这是一个多模态模型，在MTEB中可排第2名（于2024年12月11日提交结果，可能需要官方审核，详情见链接）。

模型链接：jasper_en_vision_language_v1

我将专注于技术报告、训练数据和相关代码，希望我使用的技巧能对大家有所帮助！

核心训练代码将在近期集成到rag - retrieval库（链接）中。（欢迎star）

这项工作是我利用业余时间完成的，纯属个人爱好。一个人的时间和精力有限，欢迎大家做出任何贡献！

你也可以在我的主页上找到这些模型。

模型介绍

这些模型基于Alibaba-NLP/gte-large-en-v1.5和Alibaba-NLP/gte-Qwen2-1.5B-instruct进行训练。感谢他们的贡献！

我们简化了提示词的使用，为大多数通用任务提供了两个提示词，一个用于s2p任务，另一个用于s2s任务。

s2p任务（如检索任务）的提示词：

Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: {query}

s2s任务（如语义文本相似度任务）的提示词：

Instruct: Retrieve semantically similar text.\nQuery: {query}

这些模型最终通过MRL进行训练，因此具有多个维度：512、768、1024、2048、4096、6144和8192。

维度越高，性能越好。一般来说，1024d就足够了。1024d的MTEB得分仅比8192d低0.001。

模型目录结构

模型目录结构非常简单，它是一个标准的SentenceTransformer目录，带有一系列2_Dense_{dims}文件夹，其中dims表示最终的向量维度。

例如，2_Dense_256文件夹存储将向量维度转换为256维的线性权重。具体使用说明请参考以下章节。

使用方法

你可以使用SentenceTransformers或transformers库对文本进行编码。

Sentence Transformers

from sentence_transformers import SentenceTransformer

# 此模型支持两种提示词："s2p_query"和"s2s_query"，分别用于句子到段落和句子到句子的任务。
# 它们在`config_sentence_transformers.json`中定义
query_prompt_name = "s2p_query"
queries = [
    "What are some ways to reduce stress?",
    "What are the benefits of drinking green tea?",
]
# 文档不需要任何提示词
docs = [
    "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
    "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]

# ！默认维度为1024，如果你需要其他维度，请克隆模型并修改`modules.json`，将`2_Dense_1024`替换为其他维度，例如`2_Dense_256`或`2_Dense_8192` ！
# 在GPU上运行
model = SentenceTransformer("dunzhang/stella_en_400M_v5", trust_remote_code=True).cuda()
# 你也可以在不使用`use_memory_efficient_attention`和`unpad_inputs`功能的情况下使用此模型。它可以在CPU上运行。
# model = SentenceTransformer(
#     "dunzhang/stella_en_400M_v5",
#     trust_remote_code=True,
#     device="cpu",
#     config_kwargs={"use_memory_efficient_attention": False, "unpad_inputs": False}
# )
query_embeddings = model.encode(queries, prompt_name=query_prompt_name)
doc_embeddings = model.encode(docs)
print(query_embeddings.shape, doc_embeddings.shape)
# (2, 1024) (2, 1024)

similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.8398, 0.2990],
#         [0.3282, 0.8095]])

Transformers

import os
import torch
from transformers import AutoModel, AutoTokenizer
from sklearn.preprocessing import normalize

query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
queries = [
    "What are some ways to reduce stress?",
    "What are the benefits of drinking green tea?",
]
queries = [query_prompt + query for query in queries]
# 文档不需要任何提示词
docs = [
    "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
    "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]

# 克隆模型后的模型路径
model_dir = "{Your MODEL_PATH}"

vector_dim = 1024
vector_linear_directory = f"2_Dense_{vector_dim}"
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).cuda().eval()
# 你也可以在不使用`use_memory_efficient_attention`和`unpad_inputs`功能的情况下使用此模型。它可以在CPU上运行。
# model = AutoModel.from_pretrained(model_dir, trust_remote_code=True,use_memory_efficient_attention=False,unpad_inputs=False).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
vector_linear = torch.nn.Linear(in_features=model.config.hidden_size, out_features=vector_dim)
vector_linear_dict = {
    k.replace("linear.", ""): v for k, v in
    torch.load(os.path.join(model_dir, f"{vector_linear_directory}/pytorch_model.bin")).items()
}
vector_linear.load_state_dict(vector_linear_dict)
vector_linear.cuda()

# 嵌入查询
with torch.no_grad():
    input_data = tokenizer(queries, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    input_data = {k: v.cuda() for k, v in input_data.items()}
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    query_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    query_vectors = normalize(vector_linear(query_vectors).cpu().numpy())

# 嵌入文档
with torch.no_grad():
    input_data = tokenizer(docs, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    input_data = {k: v.cuda() for k, v in input_data.items()}
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    docs_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    docs_vectors = normalize(vector_linear(docs_vectors).cpu().numpy())

print(query_vectors.shape, docs_vectors.shape)
# (2, 1024) (2, 1024)

similarities = query_vectors @ docs_vectors.T
print(similarities)
# [[0.8397531  0.29900077]
#  [0.32818374 0.80954516]]

infinity_emb

通过infinity, MIT许可使用。

docker run \
--gpus all -p "7997":"7997" \
michaelf34/infinity:0.0.69 \
v2 --model-id dunzhang/stella_en_400M_v5 --revision "refs/pr/24" --dtype bfloat16 --batch-size 16 --device cuda --engine torch --port 7997 --no-bettertransformer

💻 使用示例

基础用法

上述使用SentenceTransformers和transformers库对文本进行编码的示例，展示了如何使用模型对查询和文档进行嵌入，并计算相似度，这是模型在常见任务中的基础使用方式。

高级用法

使用infinity_emb的示例，通过docker运行特定镜像，指定模型相关参数，实现更高效的使用方式，适用于对性能和效率有较高要求的场景。

📚 详细文档

常见问题解答

Q: 训练的详细信息？

A: 训练方法和数据集将在未来发布。（具体时间未知，可能会在论文中提供）

Q: 如何为自己的任务选择合适的提示词？

A: 在大多数情况下，请使用s2p和s2s提示词。这两种提示词在训练数据中占了绝大多数。

Q: 如何复现MTEB结果？

A: 请使用Alibaba-NLP/gte-Qwen2-1.5B-instruct或intfloat/e5-mistral-7b-instruct中的评估脚本。

Q: 为什么每个维度都有一个线性权重？

A: MRL有多种训练方法，我们选择了性能最佳的这种方法。

Q: 模型的序列长度是多少？

A: 建议使用512。在我们的实验中，几乎所有模型在专门的长文本检索数据集上的表现都不佳。此外，模型是在长度为512的数据集上进行训练的。这可能是一个需要优化的点。

如果你有任何问题，请在社区发起讨论。

📄 许可证

本项目采用MIT许可证。

🔧 技术细节

模型评估结果

数据集名称	任务类型	主要得分	其他指标详情
MTEB AmazonCounterfactualClassification (en)	Classification	92.35820895522387	accuracy: 92.35820895522387 ap: 70.81322736988783 ap_weighted: 70.81322736988783 f1: 88.9505466159595 f1_weighted: 92.68630932872613
MTEB AmazonPolarityClassification	Classification	97.1945	accuracy: 97.1945 ap: 96.08192192244094 ap_weighted: 96.08192192244094 f1: 97.1936887167346 f1_weighted: 97.1936887167346
...	...	...	...