e5-base開源文本嵌入模型 - 免費用於分類、檢索等自然語言處理任務

首頁

E5 Base

由intfloat開發

E5-base 是一個通用的文本嵌入模型，適用於多種自然語言處理任務，如分類、檢索、聚類和語義相似度計算。

文本嵌入

Safetensors

英語開源協議:MIT #多任務文本嵌入 #高精度分類 #語義檢索優化

下載量 30.85k

發布時間 : 12/26/2022

模型概述

E5-base 是一個基於 Transformer 架構的文本嵌入模型，能夠將文本轉換為高維向量表示，適用於多種下游任務。

模型特點

多任務支持

支持多種自然語言處理任務，包括分類、檢索、聚類和語義相似度計算。

高性能

在多個基準數據集上表現出色，如 MTEB 數據集。

通用性

適用於多種文本處理場景，無需針對特定任務進行大量調整。

模型能力

文本分類

文本檢索

文本聚類

語義相似度計算

文本重排序

使用案例

電子商務

商品評論分類

對亞馬遜商品評論進行分類，識別正面和負面評價。

在 MTEB AmazonPolarityClassification 數據集上準確率達到 87.96%。

商品檢索

根據用戶查詢檢索相關商品。

在 MTEB AmazonReviewsClassification 數據集上 F1 分數為 42.23。

學術研究

論文聚類

對 arXiv 和 BioRxiv 上的學術論文進行聚類。

在 MTEB ArxivClusteringP2P 數據集上 V-measure 為 44.57。

問答系統

重複問題檢測

在問答社區中檢測重複問題。

在 MTEB AskUbuntuDupQuestions 數據集上 MAP 為 59.66。

🚀 E5-base

E5-base模型通過弱監督對比預訓練生成文本嵌入，可用於文本檢索、語義相似度計算等任務。該模型有12層，嵌入大小為768。

🚀 快速開始

2023年5月消息：建議切換到 e5-base-v2，它性能更優且用法相同。

此模型基於論文 Text Embeddings by Weakly-Supervised Contrastive Pre-training 開發，作者包括 Liang Wang、Nan Yang 等，於2022年發表在 arXiv 上。

✨ 主要特性

高效嵌入：能夠將文本高效地轉換為768維的嵌入向量。
弱監督訓練：採用弱監督對比預訓練方法，提升模型性能。
多任務支持：支持檢索、語義相似度計算等多種自然語言處理任務。

📦 安裝指南

使用此模型前，需安裝 transformers 庫，可通過以下命令安裝：

pip install transformers

若要使用 sentence_transformers 庫，可執行以下命令：

pip install sentence_transformers~=2.2.2

💻 使用示例

基礎用法

以下是一個對 MS-MARCO 段落排名數據集中的查詢和段落進行編碼的示例：

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = ['query: how much protein should a female eat',
               'query: summit define',
               "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
               "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."]

tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-base')
model = AutoModel.from_pretrained('intfloat/e5-base')

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

高級用法

使用 sentence_transformers 庫的示例：

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-base')
input_texts = [
    'query: how much protein should a female eat',
    'query: summit define',
    "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
embeddings = model.encode(input_texts, normalize_embeddings=True)

📚 詳細文檔

輸入文本前綴規則

對於非對稱任務（如開放問答中的段落檢索、即席信息檢索），分別使用 "query: " 和 "passage: " 前綴。
對於對稱任務（如語義相似度、釋義檢索），使用 "query: " 前綴。
若將嵌入用作特徵（如線性探測分類、聚類），使用 "query: " 前綴。

🔧 技術細節

該模型有12層，嵌入大小為768。它通過弱監督對比預訓練學習文本嵌入，使用 InfoNCE 對比損失，溫度設置為 0.01。

📄 許可證

本模型採用 MIT 許可證。

📋 信息表格

屬性	詳情
模型類型	基於弱監督對比預訓練的文本嵌入模型
訓練數據	未詳細說明

常用提示信息

⚠️ 重要提示

輸入文本需添加 "query: " 或 "passage: " 前綴，否則模型性能會下降。

💡 使用建議

若復現結果與模型卡片中報告的結果略有不同，可能是 transformers 和 pytorch 版本不同導致的。對於文本嵌入任務，餘弦相似度分數的相對順序比絕對值更重要。

引用格式

如果您覺得我們的論文或模型有幫助，請按以下方式引用：

@article{wang2022text,
  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2212.03533},
  year={2022}
}