SFR-Embedding-Mistral開源文本嵌入模型 - 免費部署助力文本檢索任務

首頁

SFR Embedding Mistral

由Salesforce開發

由Salesforce Research研發的文本嵌入模型，基於E5-mistral-7b-instruct和Mistral-7B-v0.1訓練，主要用於文本檢索任務。

文本嵌入

Transformers

英語#長文本嵌入 #檢索增強生成 #指令微調

下載量 34.75k

發布時間 : 1/24/2024

模型概述

該模型是一個基於Mistral架構的文本嵌入模型，專注於提升文本檢索任務的性能。通過遷移學習技術增強，適用於多種信息檢索場景。

模型特點

高效檢索能力

針對文本檢索任務優化，能夠高效匹配查詢與相關文檔

長文本處理

支持最長4096個token的文本輸入，適合處理較長文檔

遷移學習增強

基於E5-mistral-7b-instruct和Mistral-7B-v0.1訓練，性能更優

模型能力

文本嵌入生成

語義相似度計算

信息檢索

文檔匹配

使用案例

信息檢索

網頁搜索

將用戶查詢與網頁內容進行匹配，返回最相關結果

在示例中顯示查詢與相關段落的高匹配分數

知識庫問答

從知識庫中檢索與問題最相關的答案段落

內容推薦

相關文章推薦

根據用戶閱讀內容推薦語義相似的其他文章

🚀 SFR-Embedding-Mistral

SFR-Embedding-Mistral 由 Salesforce Research 研發。該模型基於 E5-mistral-7b-instruct 和 Mistral-7B-v0.1 進行訓練，主要用於文本檢索等研究場景。

🚀 快速開始

模型基礎信息

本項目僅用於研究目的。第三方數據集可能在其相關許可下受到額外的條款和條件約束。更多詳細信息請參考以下論文：

倫理考量

此版本僅用於支持學術論文的研究目的。我們的模型、數據集和代碼並非專門為所有下游用途而設計或評估。強烈建議用戶在部署此模型之前，評估並解決與準確性、安全性和公平性相關的潛在問題。鼓勵用戶考慮人工智能的常見侷限性，遵守適用法律，並在選擇用例時採用最佳實踐，尤其是在錯誤或濫用可能對人們的生活、權利或安全產生重大影響的高風險場景中。有關用例的更多指導，請參考我們的 AUP 和 AI AUP。

✨ 主要特性

模型架構：基於 E5-mistral-7b-instruct 和 Mistral-7B-v0.1 進行訓練，具備強大的文本理解和處理能力。
應用場景：適用於文本檢索、分類、聚類等多種自然語言處理任務。

📦 安裝指南

本項目未提供具體的安裝步驟，你可以根據以下使用示例中的代碼，安裝所需的依賴庫。例如，使用 transformers 和 sentence-transformers 庫時，可以使用以下命令進行安裝：

pip install transformers sentence-transformers

💻 使用示例

基礎用法

Transformers 庫使用示例

import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'How to bake a chocolate cake'),
    get_detailed_instruct(task, 'Symptoms of the flu')
]
# No need to add instruction for retrieval documents
passages = [
    "To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar, cocoa powder, baking powder, baking soda, salt, eggs, milk, vegetable oil, and vanilla extract. Start by preheating your oven to 350°F (175°C). In a mixing bowl, combine the dry ingredients (flour, sugar, cocoa powder, baking powder, baking soda, and salt). In a separate bowl, whisk together the wet ingredients (eggs, milk, vegetable oil, and vanilla extract). Gradually add the wet mixture to the dry ingredients, stirring until well combined. Pour the batter into a greased cake pan and bake for 30-35 minutes. Let it cool before frosting with your favorite chocolate frosting. Enjoy your homemade chocolate cake!",
    "The flu, or influenza, is an illness caused by influenza viruses. Common symptoms of the flu include a high fever, chills, cough, sore throat, runny or stuffy nose, body aches, headache, fatigue, and sometimes nausea and vomiting. These symptoms can come on suddenly and are usually more severe than the common cold. It's important to get plenty of rest, stay hydrated, and consult a healthcare professional if you suspect you have the flu. In some cases, antiviral medications can help alleviate symptoms and reduce the duration of the illness."
]

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Salesforce/SFR-Embedding-Mistral')
model = AutoModel.from_pretrained('Salesforce/SFR-Embedding-Mistral')

# get the embeddings
max_length = 4096
input_texts = queries + passages
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors="pt")
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
# [[86.7153549194336, 36.64569091796875], [35.00493621826172, 82.0738525390625]]

Sentence Transformers 庫使用示例

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("Salesforce/SFR-Embedding-Mistral")

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'How to bake a chocolate cake'),
    get_detailed_instruct(task, 'Symptoms of the flu')
]
# No need to add instruction for retrieval documents
passages = [
    "To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar, cocoa powder, baking powder, baking soda, salt, eggs, milk, vegetable oil, and vanilla extract. Start by preheating your oven to 350°F (175°C). In a mixing bowl, combine the dry ingredients (flour, sugar, cocoa powder, baking powder, baking soda, and salt). In a separate bowl, whisk together the wet ingredients (eggs, milk, vegetable oil, and vanilla extract). Gradually add the wet mixture to the dry ingredients, stirring until well combined. Pour the batter into a greased cake pan and bake for 30-35 minutes. Let it cool before frosting with your favorite chocolate frosting. Enjoy your homemade chocolate cake!",
    "The flu, or influenza, is an illness caused by influenza viruses. Common symptoms of the flu include a high fever, chills, cough, sore throat, runny or stuffy nose, body aches, headache, fatigue, and sometimes nausea and vomiting. These symptoms can come on suddenly and are usually more severe than the common cold. It's important to get plenty of rest, stay hydrated, and consult a healthcare professional if you suspect you have the flu. In some cases, antiviral medications can help alleviate symptoms and reduce the duration of the illness."
]

embeddings = model.encode(queries + passages)
scores = util.cos_sim(embeddings[:2], embeddings[2:]) * 100
print(scores.tolist())
# [[86.71537780761719, 36.645721435546875], [35.00497055053711, 82.07388305664062]]

高級用法

MTEB 基準評估

你可以參考 unilm/e5 項目，在 BEIR 和 MTEB 基準上覆現評估結果。

📚 詳細文檔

模型評估指標

任務類型	數據集名稱	評估指標	指標值
分類	MTEB AmazonCounterfactualClassification (en)	準確率	77.92537313432834
分類	MTEB AmazonCounterfactualClassification (en)	AP	40.86767661556651
分類	MTEB AmazonCounterfactualClassification (en)	F1	71.65758897929837
...	...	...	...

團隊成員

SFR-Embedding 團隊（∗ 表示主要貢獻者）：

Rui Meng∗
Ye Liu∗
Shafiq Rayhan Joty
Caiming Xiong
Yingbo Zhou
Semih Yavuz

引用信息

@misc{SFRAIResearch2024,
  title={SFR-Embedding-Mistral:Enhance Text Retrieval with Transfer Learning},
  author={Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, Semih Yavuz},
  howpublished={Salesforce AI Research Blog},
  year={2024},
  url={https://www.salesforce.com/blog/sfr-embedding/}
}