dunzhang-stella_en_400M_v5開源英語文本處理模型

首頁

Dunzhang Stella En 400M V5

由Marqo開發

Stella 400M 是一箇中等規模的英語文本處理模型，專注於分類和信息檢索任務。

文本分類

Transformers

其他開源協議:MIT #高精度文本分類 #電商評論分析 #多任務評估

下載量 17.20k

發布時間 : 9/25/2024

模型概述

該模型主要用於文本分類和信息檢索任務，在多個標準數據集上表現出色。

模型特點

高性能分類

在Amazon產品評論分類任務中達到97.19%的準確率

多任務能力

支持多種文本處理任務，包括分類和信息檢索

中等規模

400M參數的平衡設計，兼顧性能和效率

模型能力

文本分類

情感分析

信息檢索

文本相似度計算

使用案例

電子商務

產品評論分類

自動分類Amazon產品評論的情感傾向

在Amazon極性分類任務中達到97.19%準確率

評論多分類

對Amazon評論進行多星級分類

在Amazon評論多分類任務中達到59.53%準確率

信息檢索

論點檢索

在ArguAna數據集上進行論點匹配檢索

達到64.24的主要評分

🚀 Marqo Stella v2

Marqo Stella v2 是一個與原始 Dunzhang stella 400m 模型相似的模型，它融合了一個俄羅斯套娃層（Matryoshka Layer）。這種層級結構能夠在不改變相關性指標的前提下，降低生成嵌入向量時的計算開銷。

🚀 快速開始

環境準備

確保你已經安裝了必要的庫，如 transformers、torch 和 sklearn。可以使用以下命令進行安裝：

pip install transformers torch sklearn

代碼示例

以下是一個使用該模型進行查詢和文檔嵌入，並計算相似度的示例代碼：

import os
import torch
from transformers import AutoModel, AutoTokenizer, AutoConfig
from sklearn.preprocessing import normalize

# 定義查詢提示
query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
# 定義查詢列表
queries = [
    "What are some ways to reduce stress?",
    "What are the benefits of drinking green tea?",
]
# 為每個查詢添加提示
queries = [query_prompt + query for query in queries]
# 定義文檔列表，文檔不需要提示
docs = [
    "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
    "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]

# 克隆模型後的本地路徑
model_dir = "Marqo/dunzhang-stella_en_400M_v5"
# 加載模型並將其移動到 GPU 上進行評估
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).cuda().eval()
# 加載分詞器
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

# 對查詢進行嵌入
with torch.no_grad():
    input_data = tokenizer(queries, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    input_data = {k: v.cuda() for k, v in input_data.items()}
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    query_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    query_vectors = normalize(query_vectors.cpu().numpy())

# 對文檔進行嵌入
with torch.no_grad():
    input_data = tokenizer(docs, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    input_data = {k: v.cuda() for k, v in input_data.items()}
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    docs_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    docs_vectors = normalize(docs_vectors.cpu().numpy())

# 打印查詢向量和文檔向量的形狀
print(query_vectors.shape, docs_vectors.shape)
# (2, 1024) (2, 1024)

# 計算查詢向量和文檔向量之間的相似度
similarities = query_vectors @ docs_vectors.T
print(similarities)
# [[0.8397531  0.29900077]
#  [0.32818374 0.80954516]]

💻 使用示例

基礎用法

上述代碼展示瞭如何使用該模型進行查詢和文檔的嵌入，並計算它們之間的相似度。具體步驟如下：

定義查詢和文檔：準備需要處理的查詢和文檔列表。
加載模型和分詞器：從指定路徑加載模型和分詞器。
對查詢和文檔進行嵌入：使用分詞器對查詢和文檔進行分詞，並通過模型生成嵌入向量。
計算相似度：使用矩陣乘法計算查詢向量和文檔向量之間的相似度。

高級用法

你可以根據實際需求對代碼進行擴展，例如：

批量處理：處理更多的查詢和文檔，提高效率。
不同的相似度計算方法：除了矩陣乘法，還可以使用其他相似度計算方法，如餘弦相似度。
與其他模型結合使用：將該模型的輸出與其他模型的輸出進行融合，以獲得更好的性能。

📄 許可證

本項目採用 MIT 許可證。

模型評估結果

以下是該模型在多個數據集上的評估結果：

數據集名稱	任務類型	主要得分
MTEB AmazonCounterfactualClassification (en)	分類	92.35820895522387
MTEB AmazonPolarityClassification	分類	97.1945
MTEB AmazonReviewsClassification (en)	分類	59.528000000000006
MTEB ArguAna	檢索	64.24
MTEB ArxivClusteringP2P	聚類	55.1564333205451
MTEB ArxivClusteringS2S	聚類	49.823698316694795
MTEB AskUbuntuDupQuestions	重排序	66.15294503553424
MTEB BIOSSES	語義文本相似度	83.29587385660628
MTEB Banking77Classification	分類	89.30194805194806
MTEB BiorxivClusteringP2P	聚類	50.67972171889736
MTEB BiorxivClusteringS2S	聚類	45.80539715556144
MTEB CQADupstackRetrieval	檢索	44.361250000000005
MTEB ClimateFEVER	檢索	43.525999999999996
MTEB DBPedia	檢索	49.884
MTEB EmotionClassification	分類	78.77499999999999
MTEB FEVER	檢索	90.986
MTEB FiQA2018	檢索	56.056
MTEB HotpotQA	檢索	71.74199999999999
MTEB ImdbClassification	分類	96.4904
MTEB MSMARCO	檢索	43.692
MTEB MTOPDomainClassification (en)	分類	98.82580939352485
MTEB MTOPIntentClassification (en)	分類	92.29822161422709
MTEB MassiveIntentClassification (en)	分類	85.17484868863484
MTEB MassiveScenarioClassification (en)	分類	89.61667787491594
MTEB MedrxivClusteringP2P	聚類	46.318282423948574
MTEB MedrxivClusteringS2S	聚類	44.29033625273981
MTEB MindSmallReranking	重排序	33.0526129239962
MTEB NFCorpus	檢索	41.486000000000004
MTEB NQ	檢索	69.072
MTEB QuoraRetrieval	檢索	89.58
MTEB RedditClustering	聚類	71.18966762070158
MTEB RedditClusteringP2P	聚類	74.42014716862516
MTEB SCIDOCS	檢索	25.041999999999998
MTEB SICK - R	語義文本相似度	82.20531642680812
MTEB STS12	語義文本相似度	79.51504881448884
MTEB STS13	語義文本相似度	89.18936052329725
MTEB STS14	語義文本相似度	85.14654611519086
MTEB STS15	語義文本相似度	89.10215217191254
MTEB STS16	語義文本相似度	87.14066355879785
MTEB STS17 (en - en)	語義文本相似度	90.97082650129164
MTEB STS22 (en)	語義文本相似度	67.82870469746828
MTEB STSBenchmark	語義文本相似度	87.7360146030987
MTEB SciDocsRR	重排序	88.43547871921146
MTEB SciFact	檢索	78.233
MTEB SprintDuplicateQuestions	成對分類	95.7485189884476
MTEB StackExchangeClustering	聚類	78.49205191950675
MTEB StackExchangeClusteringP2P	聚類	48.90421736513028
MTEB StackOverflowDupQuestions	重排序	52.9874730481696
MTEB SummEval	摘要	31.66058223980157
MTEB TRECCOVID	檢索	85.206
MTEB Touche2020	檢索	31.455
MTEB ToxicConversationsClassification	分類	86.9384765625
MTEB TweetSentimentExtractionClassification	分類	73.57668364459535
MTEB TwentyNewsgroupsClustering	聚類	58.574148097494685
MTEB TwitterSemEval2015	成對分類	80.18603932881858
MTEB TwitterURLCorpus	成對分類	87.46554314325058