🚀 E5-small-unsupervised
該模型與 e5-small 類似,但未經過有監督的微調。
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022
此模型有 12 層,嵌入維度為 384。
🚀 快速開始
本模型可用於對英文文本進行編碼,以實現文本相似度計算、信息檢索等任務。
💻 使用示例
基礎用法
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = ['query: how much protein should a female eat',
'query: summit define',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]
tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-small-unsupervised')
model = AutoModel.from_pretrained('intfloat/e5-small-unsupervised')
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
與 Sentence Transformers 結合使用
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-small-unsupervised')
input_texts = [
'query: how much protein should a female eat',
'query: summit define',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]
embeddings = model.encode(input_texts, normalize_embeddings=True)
安裝依賴
pip install sentence_transformers~=2.2.2
📚 詳細文檔
訓練詳情
請參考我們的論文 https://arxiv.org/pdf/2212.03533.pdf。
基準評估
請查看 unilm/e5 以復現該模型在 BEIR 和 MTEB benchmark 上的評估結果。
常見問題解答
1. 是否需要在輸入文本前添加 "query: " 和 "passage: " 前綴?
是的,模型是按照這種方式進行訓練的,否則會導致性能下降。
以下是一些經驗法則:
- 對於非對稱任務,如開放問答中的段落檢索、臨時信息檢索,應分別使用 "query: " 和 "passage: "。
- 對於對稱任務,如語義相似度、釋義檢索,使用 "query: " 前綴。
- 如果想將嵌入向量用作特徵,如線性探測分類、聚類,使用 "query: " 前綴。
2. 為什麼我復現的結果與模型卡片中報告的結果略有不同?
不同版本的 transformers
和 pytorch
可能會導致微小但非零的性能差異。
引用
如果您發現我們的論文或模型有幫助,請考慮按以下方式引用:
@article{wang2022text,
title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2212.03533},
year={2022}
}
侷限性
該模型僅適用於英文文本,長文本將被截斷為最多 512 個詞元。
📄 許可證
本項目採用 MIT 許可證。