e5-small开源句子转换器模型 - 高效处理句子相似度及文本嵌入任务

首页

E5 Small

由 intfloat 开发

E5-small 是一个小型句子转换器模型，专注于句子相似度和文本嵌入任务，在多个分类和检索任务上表现良好。

文本嵌入

Safetensors

英语开源协议:MIT #句子相似度计算 #多任务文本嵌入 #英文文本处理

下载量 16.96k

发布时间 : 12/7/2022

模型简介

该模型主要用于生成句子级别的嵌入表示，适用于文本相似度计算、信息检索和分类任务。

模型特点

多任务性能

在分类、聚类、检索等多种NLP任务上表现均衡

高效嵌入

生成高质量的句子级别嵌入表示

广泛适用性

支持多种文本处理任务，包括相似度计算和信息检索

模型能力

句子嵌入生成

文本相似度计算

文本分类

文本聚类

信息检索

重排序任务

使用案例

电子商务

产品评论情感分析

分析Amazon产品评论的情感极性

在AmazonPolarity数据集上达到87.53%准确率

反事实评论检测

识别Amazon平台上的反事实评论

在AmazonCounterfactual数据集上达到76.22%准确率

客户服务

银行咨询分类

对银行客户咨询进行自动分类

在Banking77数据集上达到81.87%准确率

学术研究

学术论文聚类

对arXiv和biorxiv论文进行主题聚类

在arXiv聚类任务上达到44.14 V-measure

🚀 E5-small

本模型通过弱监督对比预训练生成文本嵌入，有12层，嵌入大小为384。它能有效处理英文文本，在文本检索、语义相似度等任务中表现出色。不过，该模型仅支持英文，且长文本会被截断为最多512个标记。

🚀 快速开始

模型替换提示

2023年5月消息：请切换到 e5-small-v2，它性能更优且用法相同。

模型引用论文

Text Embeddings by Weakly-Supervised Contrastive Pre-training。作者包括Liang Wang、Nan Yang、Xiaolong Huang、Binxing Jiao、Linjun Yang、Daxin Jiang、Rangan Majumder、Furu Wei ，于2022年发表在arXiv上。

✨ 主要特性

特定前缀要求：每个输入文本应以 "query: " 或 "passage: " 开头。对于检索以外的任务，可仅使用 "query: " 前缀。
多任务适用性：适用于多种自然语言处理任务，如开放问答中的段落检索、即席信息检索、语义相似度、释义检索、线性探测分类和聚类等。
性能一致性：尽管不同版本的 transformers 和 pytorch 可能导致细微的性能差异，但整体表现稳定。
余弦相似度分布：由于使用低温0.01的InfoNCE对比损失，余弦相似度得分通常分布在0.7到1.0之间，但这对文本嵌入任务的相对排序无影响。

📦 安装指南

使用 sentence_transformers 库时，需安装指定版本：

pip install sentence_transformers~=2.2.2

💻 使用示例

基础用法

以下是对MS - MARCO段落排名数据集中的查询和段落进行编码的示例：

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = ['query: how much protein should a female eat',
               'query: summit define',
               "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
               "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."]

tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-small')
model = AutoModel.from_pretrained('intfloat/e5-small')

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

高级用法

使用 sentence_transformers 库的示例：

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-small')
input_texts = [
    'query: how much protein should a female eat',
    'query: summit define',
    "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
embeddings = model.encode(input_texts, normalize_embeddings=True)

📚 详细文档

训练详情

训练详情请参考论文 https://arxiv.org/pdf/2212.03533.pdf。

基准评估

可查看 unilm/e5 以复现该模型在 BEIR 和 MTEB基准上的评估结果。

常见问题解答

1. 是否需要在输入文本中添加 "query: " 和 "passage: " 前缀？ 是的，模型按此方式训练，否则性能会下降。以下是一些经验法则：

对于非对称任务（如开放问答中的段落检索、即席信息检索），分别使用 "query: " 和 "passage: " 前缀。
对于对称任务（如语义相似度、释义检索），使用 "query: " 前缀。
若将嵌入用作特征（如线性探测分类、聚类），使用 "query: " 前缀。

2. 为什么我复现的结果与模型卡片中报告的结果略有不同？ 不同版本的 transformers 和 pytorch 可能导致细微但非零的性能差异。

3. 为什么余弦相似度得分分布在0.7到1.0之间？ 这是已知且预期的行为，因为模型使用低温0.01的InfoNCE对比损失。对于文本检索或语义相似度等文本嵌入任务，重要的是得分的相对顺序而非绝对值，因此这不应成为问题。

🔧 技术细节

本模型有12层，嵌入大小为384。它基于弱监督对比预训练方法，通过在输入文本中添加特定前缀（"query: " 和 "passage: "）来学习文本嵌入。在训练过程中，使用了低温0.01的InfoNCE对比损失，这导致余弦相似度得分通常分布在0.7到1.0之间。

📄 许可证

本模型使用MIT许可证。

引用

如果您认为我们的论文或模型有帮助，请按以下方式引用：

@article{wang2022text,
  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2212.03533},
  year={2022}
}