SFR Embedding Mistral
S
SFR Embedding Mistral
由Salesforce開發
由Salesforce Research研發的文本嵌入模型,基於E5-mistral-7b-instruct和Mistral-7B-v0.1訓練,主要用於文本檢索任務。
下載量 34.75k
發布時間 : 1/24/2024
模型概述
該模型是一個基於Mistral架構的文本嵌入模型,專注於提升文本檢索任務的性能。通過遷移學習技術增強,適用於多種信息檢索場景。
模型特點
高效檢索能力
針對文本檢索任務優化,能夠高效匹配查詢與相關文檔
長文本處理
支持最長4096個token的文本輸入,適合處理較長文檔
遷移學習增強
基於E5-mistral-7b-instruct和Mistral-7B-v0.1訓練,性能更優
模型能力
文本嵌入生成
語義相似度計算
信息檢索
文檔匹配
使用案例
信息檢索
網頁搜索
將用戶查詢與網頁內容進行匹配,返回最相關結果
在示例中顯示查詢與相關段落的高匹配分數
知識庫問答
從知識庫中檢索與問題最相關的答案段落
內容推薦
相關文章推薦
根據用戶閱讀內容推薦語義相似的其他文章
🚀 SFR-Embedding-Mistral
SFR-Embedding-Mistral 由 Salesforce Research 研發。該模型基於 E5-mistral-7b-instruct 和 Mistral-7B-v0.1 進行訓練,主要用於文本檢索等研究場景。
🚀 快速開始
模型基礎信息
本項目僅用於研究目的。第三方數據集可能在其相關許可下受到額外的條款和條件約束。更多詳細信息請參考以下論文:
倫理考量
此版本僅用於支持學術論文的研究目的。我們的模型、數據集和代碼並非專門為所有下游用途而設計或評估。強烈建議用戶在部署此模型之前,評估並解決與準確性、安全性和公平性相關的潛在問題。鼓勵用戶考慮人工智能的常見侷限性,遵守適用法律,並在選擇用例時採用最佳實踐,尤其是在錯誤或濫用可能對人們的生活、權利或安全產生重大影響的高風險場景中。有關用例的更多指導,請參考我們的 AUP 和 AI AUP。
✨ 主要特性
- 模型架構:基於 E5-mistral-7b-instruct 和 Mistral-7B-v0.1 進行訓練,具備強大的文本理解和處理能力。
- 應用場景:適用於文本檢索、分類、聚類等多種自然語言處理任務。
📦 安裝指南
本項目未提供具體的安裝步驟,你可以根據以下使用示例中的代碼,安裝所需的依賴庫。例如,使用 transformers
和 sentence-transformers
庫時,可以使用以下命令進行安裝:
pip install transformers sentence-transformers
💻 使用示例
基礎用法
Transformers 庫使用示例
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def last_token_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'How to bake a chocolate cake'),
get_detailed_instruct(task, 'Symptoms of the flu')
]
# No need to add instruction for retrieval documents
passages = [
"To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar, cocoa powder, baking powder, baking soda, salt, eggs, milk, vegetable oil, and vanilla extract. Start by preheating your oven to 350°F (175°C). In a mixing bowl, combine the dry ingredients (flour, sugar, cocoa powder, baking powder, baking soda, and salt). In a separate bowl, whisk together the wet ingredients (eggs, milk, vegetable oil, and vanilla extract). Gradually add the wet mixture to the dry ingredients, stirring until well combined. Pour the batter into a greased cake pan and bake for 30-35 minutes. Let it cool before frosting with your favorite chocolate frosting. Enjoy your homemade chocolate cake!",
"The flu, or influenza, is an illness caused by influenza viruses. Common symptoms of the flu include a high fever, chills, cough, sore throat, runny or stuffy nose, body aches, headache, fatigue, and sometimes nausea and vomiting. These symptoms can come on suddenly and are usually more severe than the common cold. It's important to get plenty of rest, stay hydrated, and consult a healthcare professional if you suspect you have the flu. In some cases, antiviral medications can help alleviate symptoms and reduce the duration of the illness."
]
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Salesforce/SFR-Embedding-Mistral')
model = AutoModel.from_pretrained('Salesforce/SFR-Embedding-Mistral')
# get the embeddings
max_length = 4096
input_texts = queries + passages
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors="pt")
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
# [[86.7153549194336, 36.64569091796875], [35.00493621826172, 82.0738525390625]]
Sentence Transformers 庫使用示例
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("Salesforce/SFR-Embedding-Mistral")
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'How to bake a chocolate cake'),
get_detailed_instruct(task, 'Symptoms of the flu')
]
# No need to add instruction for retrieval documents
passages = [
"To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar, cocoa powder, baking powder, baking soda, salt, eggs, milk, vegetable oil, and vanilla extract. Start by preheating your oven to 350°F (175°C). In a mixing bowl, combine the dry ingredients (flour, sugar, cocoa powder, baking powder, baking soda, and salt). In a separate bowl, whisk together the wet ingredients (eggs, milk, vegetable oil, and vanilla extract). Gradually add the wet mixture to the dry ingredients, stirring until well combined. Pour the batter into a greased cake pan and bake for 30-35 minutes. Let it cool before frosting with your favorite chocolate frosting. Enjoy your homemade chocolate cake!",
"The flu, or influenza, is an illness caused by influenza viruses. Common symptoms of the flu include a high fever, chills, cough, sore throat, runny or stuffy nose, body aches, headache, fatigue, and sometimes nausea and vomiting. These symptoms can come on suddenly and are usually more severe than the common cold. It's important to get plenty of rest, stay hydrated, and consult a healthcare professional if you suspect you have the flu. In some cases, antiviral medications can help alleviate symptoms and reduce the duration of the illness."
]
embeddings = model.encode(queries + passages)
scores = util.cos_sim(embeddings[:2], embeddings[2:]) * 100
print(scores.tolist())
# [[86.71537780761719, 36.645721435546875], [35.00497055053711, 82.07388305664062]]
高級用法
MTEB 基準評估
你可以參考 unilm/e5 項目,在 BEIR 和 MTEB 基準上覆現評估結果。
📚 詳細文檔
模型評估指標
任務類型 | 數據集名稱 | 評估指標 | 指標值 |
---|---|---|---|
分類 | MTEB AmazonCounterfactualClassification (en) | 準確率 | 77.92537313432834 |
分類 | MTEB AmazonCounterfactualClassification (en) | AP | 40.86767661556651 |
分類 | MTEB AmazonCounterfactualClassification (en) | F1 | 71.65758897929837 |
... | ... | ... | ... |
團隊成員
SFR-Embedding 團隊(∗ 表示主要貢獻者):
- Rui Meng∗
- Ye Liu∗
- Shafiq Rayhan Joty
- Caiming Xiong
- Yingbo Zhou
- Semih Yavuz
引用信息
@misc{SFRAIResearch2024,
title={SFR-Embedding-Mistral:Enhance Text Retrieval with Transfer Learning},
author={Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, Semih Yavuz},
howpublished={Salesforce AI Research Blog},
year={2024},
url={https://www.salesforce.com/blog/sfr-embedding/}
}
📄 許可證
本項目採用 CC BY-NC 4.0 許可證。
Jina Embeddings V3
Jina Embeddings V3 是一個多語言句子嵌入模型,支持超過100種語言,專注於句子相似度和特徵提取任務。
文本嵌入
Transformers 支持多種語言

J
jinaai
3.7M
911
Ms Marco MiniLM L6 V2
Apache-2.0
基於MS Marco段落排序任務訓練的交叉編碼器模型,用於信息檢索中的查詢-段落相關性評分
文本嵌入 英語
M
cross-encoder
2.5M
86
Opensearch Neural Sparse Encoding Doc V2 Distill
Apache-2.0
基於蒸餾技術的稀疏檢索模型,專為OpenSearch優化,支持免推理文檔編碼,在搜索相關性和效率上優於V1版本
文本嵌入
Transformers 英語

O
opensearch-project
1.8M
7
Sapbert From PubMedBERT Fulltext
Apache-2.0
基於PubMedBERT的生物醫學實體表徵模型,通過自對齊預訓練優化語義關係捕捉
文本嵌入 英語
S
cambridgeltl
1.7M
49
Gte Large
MIT
GTE-Large 是一個強大的句子轉換器模型,專注於句子相似度和文本嵌入任務,在多個基準測試中表現出色。
文本嵌入 英語
G
thenlper
1.5M
278
Gte Base En V1.5
Apache-2.0
GTE-base-en-v1.5 是一個英文句子轉換器模型,專注於句子相似度任務,在多個文本嵌入基準測試中表現優異。
文本嵌入
Transformers 支持多種語言

G
Alibaba-NLP
1.5M
63
Gte Multilingual Base
Apache-2.0
GTE Multilingual Base 是一個多語言的句子嵌入模型,支持超過50種語言,適用於句子相似度計算等任務。
文本嵌入
Transformers 支持多種語言

G
Alibaba-NLP
1.2M
246
Polybert
polyBERT是一個化學語言模型,旨在實現完全由機器驅動的超快聚合物信息學。它將PSMILES字符串映射為600維密集指紋,以數值形式表示聚合物化學結構。
文本嵌入
Transformers

P
kuelumbus
1.0M
5
Bert Base Turkish Cased Mean Nli Stsb Tr
Apache-2.0
基於土耳其語BERT的句子嵌入模型,專為語義相似度任務優化
文本嵌入
Transformers 其他

B
emrecan
1.0M
40
GIST Small Embedding V0
MIT
基於BAAI/bge-small-en-v1.5模型微調的文本嵌入模型,通過MEDI數據集與MTEB分類任務數據集訓練,優化了檢索任務的查詢編碼能力。
文本嵌入
Safetensors 英語
G
avsolatorio
945.68k
29
精選推薦AI模型
Llama 3 Typhoon V1.5x 8b Instruct
專為泰語設計的80億參數指令模型,性能媲美GPT-3.5-turbo,優化了應用場景、檢索增強生成、受限生成和推理任務
大型語言模型
Transformers 支持多種語言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型,專為邊緣設備推理設計,體積僅為Cosmo-3B模型的2%左右。
對話系統
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基於RoBERTa架構的中文抽取式問答模型,適用於從給定文本中提取答案的任務。
問答系統 中文
R
uer
2,694
98