SFR-Embedding-2_R开源文本嵌入模型 - 免费使用解决多种NLP任务

首页

SFR Embedding 2 R

由 Salesforce 开发

Salesforce开发的通用文本嵌入模型，在多种NLP任务上表现优异

文本嵌入

Transformers

英语#高精度文本分类 #多任务嵌入模型 #语义检索优化

下载量 26.90k

发布时间 : 6/14/2024

模型简介

这是一个高性能的文本嵌入模型，能够将文本转换为高质量的向量表示，适用于多种自然语言处理任务

模型特点

多任务高性能

在分类、聚类、检索等多种NLP任务上表现出色

通用嵌入能力

能够生成高质量的文本向量表示，适用于多种下游任务

语义理解

能够准确捕捉文本的语义信息，在语义相似度任务上表现优异

模型能力

文本分类

文本聚类

信息检索

语义相似度计算

文本重排序

使用案例

电子商务

产品评论分类

对亚马逊产品评论进行分类

在AmazonPolarity数据集上达到97.31%准确率

反事实评论检测

识别亚马逊上的反事实评论

在AmazonCounterfactual数据集上达到92.72%准确率

金融

银行客户咨询分类

对银行客户咨询进行分类

在Banking77数据集上达到90.02%准确率

学术研究

学术论文聚类

对arXiv和biorxiv论文进行聚类

在arXivClusteringP2P上v_measure达到54.02

🚀 Salesforce/SFR-Embedding-2_R

由Salesforce Research推出的SFR-Embedding模型，仅用于研究目的。

本模型专为研究设计，更多技术细节后续将会更新。在此期间，您可以参考我们之前的工作 SFR-Embedding 获取详细信息。

🚀 快速开始

使用 Transformers 库

import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'How to bake a chocolate cake'),
    get_detailed_instruct(task, 'Symptoms of the flu')
]
# No need to add instruction for retrieval documents
passages = [
    "To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar, cocoa powder, baking powder, baking soda, salt, eggs, milk, vegetable oil, and vanilla extract. Start by preheating your oven to 350°F (175°C). In a mixing bowl, combine the dry ingredients (flour, sugar, cocoa powder, baking powder, baking soda, and salt). In a separate bowl, whisk together the wet ingredients (eggs, milk, vegetable oil, and vanilla extract). Gradually add the wet mixture to the dry ingredients, stirring until well combined. Pour the batter into a greased cake pan and bake for 30-35 minutes. Let it cool before frosting with your favorite chocolate frosting. Enjoy your homemade chocolate cake!",
    "The flu, or influenza, is an illness caused by influenza viruses. Common symptoms of the flu include a high fever, chills, cough, sore throat, runny or stuffy nose, body aches, headache, fatigue, and sometimes nausea and vomiting. These symptoms can come on suddenly and are usually more severe than the common cold. It's important to get plenty of rest, stay hydrated, and consult a healthcare professional if you suspect you have the flu. In some cases, antiviral medications can help alleviate symptoms and reduce the duration of the illness."
]

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Salesforce/SFR-Embedding-2_R')
model = AutoModel.from_pretrained('Salesforce/SFR-Embedding-2_R')

# get the embeddings
max_length = 4096
input_texts = queries + passages
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors="pt")
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
# [[40.132083892822266, 25.032529830932617], [15.006855010986328, 39.93733215332031]]

使用 Sentence Transformers 库

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Salesforce/SFR-Embedding-2_R")

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'How to bake a chocolate cake'),
    get_detailed_instruct(task, 'Symptoms of the flu')
]
# No need to add instruction for retrieval documents
passages = [
    "To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar, cocoa powder, baking powder, baking soda, salt, eggs, milk, vegetable oil, and vanilla extract. Start by preheating your oven to 350°F (175°C). In a mixing bowl, combine the dry ingredients (flour, sugar, cocoa powder, baking powder, baking soda, and salt). In a separate bowl, whisk together the wet ingredients (eggs, milk, vegetable oil, and vanilla extract). Gradually add the wet mixture to the dry ingredients, stirring until well combined. Pour the batter into a greased cake pan and bake for 30-35 minutes. Let it cool before frosting with your favorite chocolate frosting. Enjoy your homemade chocolate cake!",
    "The flu, or influenza, is an illness caused by influenza viruses. Common symptoms of the flu include a high fever, chills, cough, sore throat, runny or stuffy nose, body aches, headache, fatigue, and sometimes nausea and vomiting. These symptoms can come on suddenly and are usually more severe than the common cold. It's important to get plenty of rest, stay hydrated, and consult a healthcare professional if you suspect you have the flu. In some cases, antiviral medications can help alleviate symptoms and reduce the duration of the illness."
]

embeddings = model.encode(queries + passages)
scores = model.similarity(embeddings[:2], embeddings[2:]) * 100
print(scores.tolist())
# [[40.13203811645508, 25.032546997070312], [15.00684642791748, 39.937339782714844]]

📚 详细文档

伦理考量

本次发布仅用于支持学术论文的研究目的。我们的模型、数据集和代码并非专门为所有下游应用而设计或评估。我们强烈建议用户在部署此模型之前，评估并解决与准确性、安全性和公平性相关的潜在问题。我们鼓励用户考虑人工智能的常见局限性，遵守适用法律，并在选择用例时采用最佳实践，特别是在错误或滥用可能对人们的生活、权利或安全产生重大影响的高风险场景中。有关用例的更多指导，请参考我们的 AUP 和 AI AUP。

团队成员

SFR-Embedding 团队（∗ 表示同等贡献者，† 表示共同负责人）：

Rui Meng*
Ye Liu*
Tong Niu
Shafiq Rayhan Joty
Caiming Xiong †
Yingbo Zhou †
Semih Yavuz †

引用信息

@misc{SFR-embedding-2,
  title={SFR-Embedding-2: Advanced Text Embedding with Multi-stage Training},
  author={Rui Meng*, Ye Liu*, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, Semih Yavuz},
  year={2024},
  url={https://huggingface.co/Salesforce/SFR-Embedding-2_R}
}

📄 许可证

本项目采用 CC BY-NC 4.0 许可证。

📊 模型评估结果

任务类型	数据集名称	指标类型	指标值
Classification	MTEB AmazonCounterfactualClassification (en)	accuracy	92.71641791044776
Classification	MTEB AmazonCounterfactualClassification (en)	ap	69.47931007147756
Classification	MTEB AmazonCounterfactualClassification (en)	f1	88.0252625393374
...	...	...	...
Retrieval	MTEB Touche2020	map_at_1	2.806
Retrieval	MTEB Touche2020	map_at_10	11.369
Retrieval	MTEB Touche2020	map_at_100	17.791
...	...	...	...