e5-small-unsupervised開源文本嵌入模型 - 免費用於文本相似度計算任務

Home

E5 Small Unsupervised

Developed by intfloat

E5-small的無監督版本，通過弱監督對比預訓練生成文本嵌入，適用於文本相似度計算等任務

文本嵌入

Safetensors

EnglishOpen Source License:MIT #無監督文本嵌入 #弱監督對比學習 #英文語義檢索

Downloads 2,093

Release Time : 1/31/2023

Model Overview

該模型是基於對比學習的文本嵌入模型，能夠將文本轉換為向量表示，主要用於計算句子相似度和信息檢索任務

Model Features

無監督預訓練

採用弱監督對比學習進行預訓練，無需標註數據

高效嵌入

生成384維的緊湊文本嵌入表示

前綴敏感

支持通過'query:'和'passage:'前綴區分不同文本類型

Model Capabilities

文本向量化

句子相似度計算

信息檢索

語義搜索

Use Cases

信息檢索

文檔檢索

根據查詢查找相關文檔段落

在BEIR基準測試中表現良好

語義分析

句子相似度計算

計算兩個句子之間的語義相似度

🚀 E5-small-unsupervised

該模型與 e5-small 類似，但未經過有監督的微調。

Text Embeddings by Weakly-Supervised Contrastive Pre-training Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022

此模型有 12 層，嵌入維度為 384。

🚀 快速開始

本模型可用於對英文文本進行編碼，以實現文本相似度計算、信息檢索等任務。

💻 使用示例

基礎用法

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = ['query: how much protein should a female eat',
               'query: summit define',
               "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
               "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."]

tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-small-unsupervised')
model = AutoModel.from_pretrained('intfloat/e5-small-unsupervised')

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

與 Sentence Transformers 結合使用

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-small-unsupervised')
input_texts = [
    'query: how much protein should a female eat',
    'query: summit define',
    "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
embeddings = model.encode(input_texts, normalize_embeddings=True)

安裝依賴

pip install sentence_transformers~=2.2.2

📚 詳細文檔

訓練詳情

請參考我們的論文 https://arxiv.org/pdf/2212.03533.pdf。

基準評估

請查看 unilm/e5 以復現該模型在 BEIR 和 MTEB benchmark 上的評估結果。

常見問題解答

1. 是否需要在輸入文本前添加 "query: " 和 "passage: " 前綴？

是的，模型是按照這種方式進行訓練的，否則會導致性能下降。

以下是一些經驗法則：

對於非對稱任務，如開放問答中的段落檢索、臨時信息檢索，應分別使用 "query: " 和 "passage: "。
對於對稱任務，如語義相似度、釋義檢索，使用 "query: " 前綴。
如果想將嵌入向量用作特徵，如線性探測分類、聚類，使用 "query: " 前綴。

2. 為什麼我復現的結果與模型卡片中報告的結果略有不同？

不同版本的 transformers 和 pytorch 可能會導致微小但非零的性能差異。

引用

如果您發現我們的論文或模型有幫助，請考慮按以下方式引用：

@article{wang2022text,
  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2212.03533},
  year={2022}
}