Stella 400M v5開源英語文本嵌入模型 - 免費部署助文本分類檢索

首頁

Stella En 400M V5

由billatsectorflow開發

Stella 400M v5 是一個英語文本嵌入模型，在多個文本分類和檢索任務上表現出色。

大型語言模型

Transformers

其他開源協議:MIT #高精度文本分類 #多任務評估 #英語NLP

下載量 7,630

發布時間 : 1/22/2025

模型概述

該模型是一個英語文本嵌入模型，主要用於文本分類和檢索任務，在多個標準數據集上展示了優秀的性能。

模型特點

高性能文本分類

在Amazon產品評論分類任務上達到97.19%的準確率

強大的文本檢索能力

在ArguAna檢索任務上達到64.24的NDCG@10分數

多任務適應性

在多種文本處理任務上表現均衡，包括分類和檢索

模型能力

文本分類

文本檢索

語義相似度計算

文本嵌入生成

使用案例

電子商務

產品評論分類

對Amazon產品評論進行正面/負面分類

準確率97.19%

產品評論多分類

對Amazon產品評論進行星級分類

準確率59.53%

信息檢索

論點檢索

在ArguAna數據集上進行論點檢索

NDCG@10 64.24

🚀 stella_en_400M_v5模型

本項目基於特定基礎模型訓練出了一系列具有不同維度的模型，簡化了提示詞的使用，在多個任務上有良好表現，且後續會將核心訓練代碼集成到相關庫中，同時還提供了不同庫的使用示例及常見問題解答。

🚀 快速開始

模型更新

大家好，感謝使用stella模型。經過六個月的努力，我在stella模型的基礎上訓練了jasper模型，這是一個多模態模型，在MTEB中可排第2名（於2024年12月11日提交結果，可能需要官方審核，詳情見鏈接）。

模型鏈接：jasper_en_vision_language_v1

我將專注於技術報告、訓練數據和相關代碼，希望我使用的技巧能對大家有所幫助！

核心訓練代碼將在近期集成到rag - retrieval庫（鏈接）中。（歡迎star）

這項工作是我利用業餘時間完成的，純屬個人愛好。一個人的時間和精力有限，歡迎大家做出任何貢獻！

你也可以在我的主頁上找到這些模型。

模型介紹

這些模型基於Alibaba-NLP/gte-large-en-v1.5和Alibaba-NLP/gte-Qwen2-1.5B-instruct進行訓練。感謝他們的貢獻！

我們簡化了提示詞的使用，為大多數通用任務提供了兩個提示詞，一個用於s2p任務，另一個用於s2s任務。

s2p任務（如檢索任務）的提示詞：

Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: {query}

s2s任務（如語義文本相似度任務）的提示詞：

Instruct: Retrieve semantically similar text.\nQuery: {query}

這些模型最終通過MRL進行訓練，因此具有多個維度：512、768、1024、2048、4096、6144和8192。

維度越高，性能越好。一般來說，1024d就足夠了。1024d的MTEB得分僅比8192d低0.001。

模型目錄結構

模型目錄結構非常簡單，它是一個標準的SentenceTransformer目錄，帶有一系列2_Dense_{dims}文件夾，其中dims表示最終的向量維度。

例如，2_Dense_256文件夾存儲將向量維度轉換為256維的線性權重。具體使用說明請參考以下章節。

使用方法

你可以使用SentenceTransformers或transformers庫對文本進行編碼。

Sentence Transformers

from sentence_transformers import SentenceTransformer

# 此模型支持兩種提示詞："s2p_query"和"s2s_query"，分別用於句子到段落和句子到句子的任務。
# 它們在`config_sentence_transformers.json`中定義
query_prompt_name = "s2p_query"
queries = [
    "What are some ways to reduce stress?",
    "What are the benefits of drinking green tea?",
]
# 文檔不需要任何提示詞
docs = [
    "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
    "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]

# ！默認維度為1024，如果你需要其他維度，請克隆模型並修改`modules.json`，將`2_Dense_1024`替換為其他維度，例如`2_Dense_256`或`2_Dense_8192` ！
# 在GPU上運行
model = SentenceTransformer("dunzhang/stella_en_400M_v5", trust_remote_code=True).cuda()
# 你也可以在不使用`use_memory_efficient_attention`和`unpad_inputs`功能的情況下使用此模型。它可以在CPU上運行。
# model = SentenceTransformer(
#     "dunzhang/stella_en_400M_v5",
#     trust_remote_code=True,
#     device="cpu",
#     config_kwargs={"use_memory_efficient_attention": False, "unpad_inputs": False}
# )
query_embeddings = model.encode(queries, prompt_name=query_prompt_name)
doc_embeddings = model.encode(docs)
print(query_embeddings.shape, doc_embeddings.shape)
# (2, 1024) (2, 1024)

similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.8398, 0.2990],
#         [0.3282, 0.8095]])

Transformers

import os
import torch
from transformers import AutoModel, AutoTokenizer
from sklearn.preprocessing import normalize

query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
queries = [
    "What are some ways to reduce stress?",
    "What are the benefits of drinking green tea?",
]
queries = [query_prompt + query for query in queries]
# 文檔不需要任何提示詞
docs = [
    "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
    "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]

# 克隆模型後的模型路徑
model_dir = "{Your MODEL_PATH}"

vector_dim = 1024
vector_linear_directory = f"2_Dense_{vector_dim}"
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).cuda().eval()
# 你也可以在不使用`use_memory_efficient_attention`和`unpad_inputs`功能的情況下使用此模型。它可以在CPU上運行。
# model = AutoModel.from_pretrained(model_dir, trust_remote_code=True,use_memory_efficient_attention=False,unpad_inputs=False).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
vector_linear = torch.nn.Linear(in_features=model.config.hidden_size, out_features=vector_dim)
vector_linear_dict = {
    k.replace("linear.", ""): v for k, v in
    torch.load(os.path.join(model_dir, f"{vector_linear_directory}/pytorch_model.bin")).items()
}
vector_linear.load_state_dict(vector_linear_dict)
vector_linear.cuda()

# 嵌入查詢
with torch.no_grad():
    input_data = tokenizer(queries, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    input_data = {k: v.cuda() for k, v in input_data.items()}
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    query_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    query_vectors = normalize(vector_linear(query_vectors).cpu().numpy())

# 嵌入文檔
with torch.no_grad():
    input_data = tokenizer(docs, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    input_data = {k: v.cuda() for k, v in input_data.items()}
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    docs_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    docs_vectors = normalize(vector_linear(docs_vectors).cpu().numpy())

print(query_vectors.shape, docs_vectors.shape)
# (2, 1024) (2, 1024)

similarities = query_vectors @ docs_vectors.T
print(similarities)
# [[0.8397531  0.29900077]
#  [0.32818374 0.80954516]]

infinity_emb

通過infinity, MIT許可使用。

docker run \
--gpus all -p "7997":"7997" \
michaelf34/infinity:0.0.69 \
v2 --model-id dunzhang/stella_en_400M_v5 --revision "refs/pr/24" --dtype bfloat16 --batch-size 16 --device cuda --engine torch --port 7997 --no-bettertransformer

💻 使用示例

基礎用法

上述使用SentenceTransformers和transformers庫對文本進行編碼的示例，展示瞭如何使用模型對查詢和文檔進行嵌入，並計算相似度，這是模型在常見任務中的基礎使用方式。

高級用法

使用infinity_emb的示例，通過docker運行特定鏡像，指定模型相關參數，實現更高效的使用方式，適用於對性能和效率有較高要求的場景。

📚 詳細文檔

常見問題解答

Q: 訓練的詳細信息？

A: 訓練方法和數據集將在未來發布。（具體時間未知，可能會在論文中提供）

Q: 如何為自己的任務選擇合適的提示詞？

A: 在大多數情況下，請使用s2p和s2s提示詞。這兩種提示詞在訓練數據中佔了絕大多數。

Q: 如何復現MTEB結果？

A: 請使用Alibaba-NLP/gte-Qwen2-1.5B-instruct或intfloat/e5-mistral-7b-instruct中的評估腳本。

Q: 為什麼每個維度都有一個線性權重？

A: MRL有多種訓練方法，我們選擇了性能最佳的這種方法。

Q: 模型的序列長度是多少？

A: 建議使用512。在我們的實驗中，幾乎所有模型在專門的長文本檢索數據集上的表現都不佳。此外，模型是在長度為512的數據集上進行訓練的。這可能是一個需要優化的點。

如果你有任何問題，請在社區發起討論。

📄 許可證

本項目採用MIT許可證。

🔧 技術細節

模型評估結果

數據集名稱	任務類型	主要得分	其他指標詳情
MTEB AmazonCounterfactualClassification (en)	Classification	92.35820895522387	accuracy: 92.35820895522387 ap: 70.81322736988783 ap_weighted: 70.81322736988783 f1: 88.9505466159595 f1_weighted: 92.68630932872613
MTEB AmazonPolarityClassification	Classification	97.1945	accuracy: 97.1945 ap: 96.08192192244094 ap_weighted: 96.08192192244094 f1: 97.1936887167346 f1_weighted: 97.1936887167346
...	...	...	...