SFR-Embedding-2_R開源文本嵌入模型 - 免費使用解決多種NLP任務

首頁

SFR Embedding 2 R

由Salesforce開發

Salesforce開發的通用文本嵌入模型，在多種NLP任務上表現優異

文本嵌入

Transformers

英語#高精度文本分類 #多任務嵌入模型 #語義檢索優化

下載量 26.90k

發布時間 : 6/14/2024

模型概述

這是一個高性能的文本嵌入模型，能夠將文本轉換為高質量的向量表示，適用於多種自然語言處理任務

模型特點

多任務高性能

在分類、聚類、檢索等多種NLP任務上表現出色

通用嵌入能力

能夠生成高質量的文本向量表示，適用於多種下游任務

語義理解

能夠準確捕捉文本的語義信息，在語義相似度任務上表現優異

模型能力

文本分類

文本聚類

信息檢索

語義相似度計算

文本重排序

使用案例

電子商務

產品評論分類

對亞馬遜產品評論進行分類

在AmazonPolarity數據集上達到97.31%準確率

反事實評論檢測

識別亞馬遜上的反事實評論

在AmazonCounterfactual數據集上達到92.72%準確率

金融

銀行客戶諮詢分類

對銀行客戶諮詢進行分類

在Banking77數據集上達到90.02%準確率

學術研究

學術論文聚類

對arXiv和biorxiv論文進行聚類

在arXivClusteringP2P上v_measure達到54.02

🚀 Salesforce/SFR-Embedding-2_R

由Salesforce Research推出的SFR-Embedding模型，僅用於研究目的。

本模型專為研究設計，更多技術細節後續將會更新。在此期間，您可以參考我們之前的工作 SFR-Embedding 獲取詳細信息。

🚀 快速開始

使用 Transformers 庫

import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'How to bake a chocolate cake'),
    get_detailed_instruct(task, 'Symptoms of the flu')
]
# No need to add instruction for retrieval documents
passages = [
    "To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar, cocoa powder, baking powder, baking soda, salt, eggs, milk, vegetable oil, and vanilla extract. Start by preheating your oven to 350°F (175°C). In a mixing bowl, combine the dry ingredients (flour, sugar, cocoa powder, baking powder, baking soda, and salt). In a separate bowl, whisk together the wet ingredients (eggs, milk, vegetable oil, and vanilla extract). Gradually add the wet mixture to the dry ingredients, stirring until well combined. Pour the batter into a greased cake pan and bake for 30-35 minutes. Let it cool before frosting with your favorite chocolate frosting. Enjoy your homemade chocolate cake!",
    "The flu, or influenza, is an illness caused by influenza viruses. Common symptoms of the flu include a high fever, chills, cough, sore throat, runny or stuffy nose, body aches, headache, fatigue, and sometimes nausea and vomiting. These symptoms can come on suddenly and are usually more severe than the common cold. It's important to get plenty of rest, stay hydrated, and consult a healthcare professional if you suspect you have the flu. In some cases, antiviral medications can help alleviate symptoms and reduce the duration of the illness."
]

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Salesforce/SFR-Embedding-2_R')
model = AutoModel.from_pretrained('Salesforce/SFR-Embedding-2_R')

# get the embeddings
max_length = 4096
input_texts = queries + passages
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors="pt")
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
# [[40.132083892822266, 25.032529830932617], [15.006855010986328, 39.93733215332031]]

使用 Sentence Transformers 庫

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Salesforce/SFR-Embedding-2_R")

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'How to bake a chocolate cake'),
    get_detailed_instruct(task, 'Symptoms of the flu')
]
# No need to add instruction for retrieval documents
passages = [
    "To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar, cocoa powder, baking powder, baking soda, salt, eggs, milk, vegetable oil, and vanilla extract. Start by preheating your oven to 350°F (175°C). In a mixing bowl, combine the dry ingredients (flour, sugar, cocoa powder, baking powder, baking soda, and salt). In a separate bowl, whisk together the wet ingredients (eggs, milk, vegetable oil, and vanilla extract). Gradually add the wet mixture to the dry ingredients, stirring until well combined. Pour the batter into a greased cake pan and bake for 30-35 minutes. Let it cool before frosting with your favorite chocolate frosting. Enjoy your homemade chocolate cake!",
    "The flu, or influenza, is an illness caused by influenza viruses. Common symptoms of the flu include a high fever, chills, cough, sore throat, runny or stuffy nose, body aches, headache, fatigue, and sometimes nausea and vomiting. These symptoms can come on suddenly and are usually more severe than the common cold. It's important to get plenty of rest, stay hydrated, and consult a healthcare professional if you suspect you have the flu. In some cases, antiviral medications can help alleviate symptoms and reduce the duration of the illness."
]

embeddings = model.encode(queries + passages)
scores = model.similarity(embeddings[:2], embeddings[2:]) * 100
print(scores.tolist())
# [[40.13203811645508, 25.032546997070312], [15.00684642791748, 39.937339782714844]]

📚 詳細文檔

倫理考量

本次發佈僅用於支持學術論文的研究目的。我們的模型、數據集和代碼並非專門為所有下游應用而設計或評估。我們強烈建議用戶在部署此模型之前，評估並解決與準確性、安全性和公平性相關的潛在問題。我們鼓勵用戶考慮人工智能的常見侷限性，遵守適用法律，並在選擇用例時採用最佳實踐，特別是在錯誤或濫用可能對人們的生活、權利或安全產生重大影響的高風險場景中。有關用例的更多指導，請參考我們的 AUP 和 AI AUP。

團隊成員

SFR-Embedding 團隊（∗ 表示同等貢獻者，† 表示共同負責人）：

Rui Meng*
Ye Liu*
Tong Niu
Shafiq Rayhan Joty
Caiming Xiong †
Yingbo Zhou †
Semih Yavuz †

引用信息

@misc{SFR-embedding-2,
  title={SFR-Embedding-2: Advanced Text Embedding with Multi-stage Training},
  author={Rui Meng*, Ye Liu*, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, Semih Yavuz},
  year={2024},
  url={https://huggingface.co/Salesforce/SFR-Embedding-2_R}
}

📄 許可證

本項目採用 CC BY-NC 4.0 許可證。

📊 模型評估結果

任務類型	數據集名稱	指標類型	指標值
Classification	MTEB AmazonCounterfactualClassification (en)	accuracy	92.71641791044776
Classification	MTEB AmazonCounterfactualClassification (en)	ap	69.47931007147756
Classification	MTEB AmazonCounterfactualClassification (en)	f1	88.0252625393374
...	...	...	...
Retrieval	MTEB Touche2020	map_at_1	2.806
Retrieval	MTEB Touche2020	map_at_10	11.369
Retrieval	MTEB Touche2020	map_at_100	17.791
...	...	...	...