SFR Embedding 2 R
S
SFR Embedding 2 R
由Salesforce開發
Salesforce開發的通用文本嵌入模型,在多種NLP任務上表現優異
下載量 26.90k
發布時間 : 6/14/2024
模型概述
這是一個高性能的文本嵌入模型,能夠將文本轉換為高質量的向量表示,適用於多種自然語言處理任務
模型特點
多任務高性能
在分類、聚類、檢索等多種NLP任務上表現出色
通用嵌入能力
能夠生成高質量的文本向量表示,適用於多種下游任務
語義理解
能夠準確捕捉文本的語義信息,在語義相似度任務上表現優異
模型能力
文本分類
文本聚類
信息檢索
語義相似度計算
文本重排序
使用案例
電子商務
產品評論分類
對亞馬遜產品評論進行分類
在AmazonPolarity數據集上達到97.31%準確率
反事實評論檢測
識別亞馬遜上的反事實評論
在AmazonCounterfactual數據集上達到92.72%準確率
金融
銀行客戶諮詢分類
對銀行客戶諮詢進行分類
在Banking77數據集上達到90.02%準確率
學術研究
學術論文聚類
對arXiv和biorxiv論文進行聚類
在arXivClusteringP2P上v_measure達到54.02
🚀 Salesforce/SFR-Embedding-2_R
由Salesforce Research推出的SFR-Embedding模型,僅用於研究目的。
本模型專為研究設計,更多技術細節後續將會更新。在此期間,您可以參考我們之前的工作 SFR-Embedding 獲取詳細信息。
🚀 快速開始
使用 Transformers 庫
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def last_token_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'How to bake a chocolate cake'),
get_detailed_instruct(task, 'Symptoms of the flu')
]
# No need to add instruction for retrieval documents
passages = [
"To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar, cocoa powder, baking powder, baking soda, salt, eggs, milk, vegetable oil, and vanilla extract. Start by preheating your oven to 350°F (175°C). In a mixing bowl, combine the dry ingredients (flour, sugar, cocoa powder, baking powder, baking soda, and salt). In a separate bowl, whisk together the wet ingredients (eggs, milk, vegetable oil, and vanilla extract). Gradually add the wet mixture to the dry ingredients, stirring until well combined. Pour the batter into a greased cake pan and bake for 30-35 minutes. Let it cool before frosting with your favorite chocolate frosting. Enjoy your homemade chocolate cake!",
"The flu, or influenza, is an illness caused by influenza viruses. Common symptoms of the flu include a high fever, chills, cough, sore throat, runny or stuffy nose, body aches, headache, fatigue, and sometimes nausea and vomiting. These symptoms can come on suddenly and are usually more severe than the common cold. It's important to get plenty of rest, stay hydrated, and consult a healthcare professional if you suspect you have the flu. In some cases, antiviral medications can help alleviate symptoms and reduce the duration of the illness."
]
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Salesforce/SFR-Embedding-2_R')
model = AutoModel.from_pretrained('Salesforce/SFR-Embedding-2_R')
# get the embeddings
max_length = 4096
input_texts = queries + passages
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors="pt")
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
# [[40.132083892822266, 25.032529830932617], [15.006855010986328, 39.93733215332031]]
使用 Sentence Transformers 庫
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Salesforce/SFR-Embedding-2_R")
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'How to bake a chocolate cake'),
get_detailed_instruct(task, 'Symptoms of the flu')
]
# No need to add instruction for retrieval documents
passages = [
"To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar, cocoa powder, baking powder, baking soda, salt, eggs, milk, vegetable oil, and vanilla extract. Start by preheating your oven to 350°F (175°C). In a mixing bowl, combine the dry ingredients (flour, sugar, cocoa powder, baking powder, baking soda, and salt). In a separate bowl, whisk together the wet ingredients (eggs, milk, vegetable oil, and vanilla extract). Gradually add the wet mixture to the dry ingredients, stirring until well combined. Pour the batter into a greased cake pan and bake for 30-35 minutes. Let it cool before frosting with your favorite chocolate frosting. Enjoy your homemade chocolate cake!",
"The flu, or influenza, is an illness caused by influenza viruses. Common symptoms of the flu include a high fever, chills, cough, sore throat, runny or stuffy nose, body aches, headache, fatigue, and sometimes nausea and vomiting. These symptoms can come on suddenly and are usually more severe than the common cold. It's important to get plenty of rest, stay hydrated, and consult a healthcare professional if you suspect you have the flu. In some cases, antiviral medications can help alleviate symptoms and reduce the duration of the illness."
]
embeddings = model.encode(queries + passages)
scores = model.similarity(embeddings[:2], embeddings[2:]) * 100
print(scores.tolist())
# [[40.13203811645508, 25.032546997070312], [15.00684642791748, 39.937339782714844]]
📚 詳細文檔
倫理考量
本次發佈僅用於支持學術論文的研究目的。我們的模型、數據集和代碼並非專門為所有下游應用而設計或評估。我們強烈建議用戶在部署此模型之前,評估並解決與準確性、安全性和公平性相關的潛在問題。我們鼓勵用戶考慮人工智能的常見侷限性,遵守適用法律,並在選擇用例時採用最佳實踐,特別是在錯誤或濫用可能對人們的生活、權利或安全產生重大影響的高風險場景中。有關用例的更多指導,請參考我們的 AUP 和 AI AUP。
團隊成員
SFR-Embedding 團隊(∗ 表示同等貢獻者,† 表示共同負責人):
- Rui Meng*
- Ye Liu*
- Tong Niu
- Shafiq Rayhan Joty
- Caiming Xiong †
- Yingbo Zhou †
- Semih Yavuz †
引用信息
@misc{SFR-embedding-2,
title={SFR-Embedding-2: Advanced Text Embedding with Multi-stage Training},
author={Rui Meng*, Ye Liu*, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, Semih Yavuz},
year={2024},
url={https://huggingface.co/Salesforce/SFR-Embedding-2_R}
}
📄 許可證
本項目採用 CC BY-NC 4.0 許可證。
📊 模型評估結果
任務類型 | 數據集名稱 | 指標類型 | 指標值 |
---|---|---|---|
Classification | MTEB AmazonCounterfactualClassification (en) | accuracy | 92.71641791044776 |
Classification | MTEB AmazonCounterfactualClassification (en) | ap | 69.47931007147756 |
Classification | MTEB AmazonCounterfactualClassification (en) | f1 | 88.0252625393374 |
... | ... | ... | ... |
Retrieval | MTEB Touche2020 | map_at_1 | 2.806 |
Retrieval | MTEB Touche2020 | map_at_10 | 11.369 |
Retrieval | MTEB Touche2020 | map_at_100 | 17.791 |
... | ... | ... | ... |
(由於表格內容過多,此處僅展示部分示例,完整內容請參考原文檔。)
Jina Embeddings V3
Jina Embeddings V3 是一個多語言句子嵌入模型,支持超過100種語言,專注於句子相似度和特徵提取任務。
文本嵌入
Transformers 支持多種語言

J
jinaai
3.7M
911
Ms Marco MiniLM L6 V2
Apache-2.0
基於MS Marco段落排序任務訓練的交叉編碼器模型,用於信息檢索中的查詢-段落相關性評分
文本嵌入 英語
M
cross-encoder
2.5M
86
Opensearch Neural Sparse Encoding Doc V2 Distill
Apache-2.0
基於蒸餾技術的稀疏檢索模型,專為OpenSearch優化,支持免推理文檔編碼,在搜索相關性和效率上優於V1版本
文本嵌入
Transformers 英語

O
opensearch-project
1.8M
7
Sapbert From PubMedBERT Fulltext
Apache-2.0
基於PubMedBERT的生物醫學實體表徵模型,通過自對齊預訓練優化語義關係捕捉
文本嵌入 英語
S
cambridgeltl
1.7M
49
Gte Large
MIT
GTE-Large 是一個強大的句子轉換器模型,專注於句子相似度和文本嵌入任務,在多個基準測試中表現出色。
文本嵌入 英語
G
thenlper
1.5M
278
Gte Base En V1.5
Apache-2.0
GTE-base-en-v1.5 是一個英文句子轉換器模型,專注於句子相似度任務,在多個文本嵌入基準測試中表現優異。
文本嵌入
Transformers 支持多種語言

G
Alibaba-NLP
1.5M
63
Gte Multilingual Base
Apache-2.0
GTE Multilingual Base 是一個多語言的句子嵌入模型,支持超過50種語言,適用於句子相似度計算等任務。
文本嵌入
Transformers 支持多種語言

G
Alibaba-NLP
1.2M
246
Polybert
polyBERT是一個化學語言模型,旨在實現完全由機器驅動的超快聚合物信息學。它將PSMILES字符串映射為600維密集指紋,以數值形式表示聚合物化學結構。
文本嵌入
Transformers

P
kuelumbus
1.0M
5
Bert Base Turkish Cased Mean Nli Stsb Tr
Apache-2.0
基於土耳其語BERT的句子嵌入模型,專為語義相似度任務優化
文本嵌入
Transformers 其他

B
emrecan
1.0M
40
GIST Small Embedding V0
MIT
基於BAAI/bge-small-en-v1.5模型微調的文本嵌入模型,通過MEDI數據集與MTEB分類任務數據集訓練,優化了檢索任務的查詢編碼能力。
文本嵌入
Safetensors 英語
G
avsolatorio
945.68k
29
精選推薦AI模型
Llama 3 Typhoon V1.5x 8b Instruct
專為泰語設計的80億參數指令模型,性能媲美GPT-3.5-turbo,優化了應用場景、檢索增強生成、受限生成和推理任務
大型語言模型
Transformers 支持多種語言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型,專為邊緣設備推理設計,體積僅為Cosmo-3B模型的2%左右。
對話系統
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基於RoBERTa架構的中文抽取式問答模型,適用於從給定文本中提取答案的任務。
問答系統 中文
R
uer
2,694
98