🚀 E5-base
E5-base模型通過弱監督對比預訓練生成文本嵌入,可用於文本檢索、語義相似度計算等任務。該模型有12層,嵌入大小為768。
🚀 快速開始
2023年5月消息:建議切換到 e5-base-v2,它性能更優且用法相同。
此模型基於論文 Text Embeddings by Weakly-Supervised Contrastive Pre-training 開發,作者包括 Liang Wang、Nan Yang 等,於2022年發表在 arXiv 上。
✨ 主要特性
- 高效嵌入:能夠將文本高效地轉換為768維的嵌入向量。
- 弱監督訓練:採用弱監督對比預訓練方法,提升模型性能。
- 多任務支持:支持檢索、語義相似度計算等多種自然語言處理任務。
📦 安裝指南
使用此模型前,需安裝 transformers
庫,可通過以下命令安裝:
pip install transformers
若要使用 sentence_transformers
庫,可執行以下命令:
pip install sentence_transformers~=2.2.2
💻 使用示例
基礎用法
以下是一個對 MS-MARCO 段落排名數據集中的查詢和段落進行編碼的示例:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = ['query: how much protein should a female eat',
'query: summit define',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]
tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-base')
model = AutoModel.from_pretrained('intfloat/e5-base')
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
高級用法
使用 sentence_transformers
庫的示例:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-base')
input_texts = [
'query: how much protein should a female eat',
'query: summit define',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]
embeddings = model.encode(input_texts, normalize_embeddings=True)
📚 詳細文檔
輸入文本前綴規則
- 對於非對稱任務(如開放問答中的段落檢索、即席信息檢索),分別使用 "query: " 和 "passage: " 前綴。
- 對於對稱任務(如語義相似度、釋義檢索),使用 "query: " 前綴。
- 若將嵌入用作特徵(如線性探測分類、聚類),使用 "query: " 前綴。
🔧 技術細節
該模型有12層,嵌入大小為768。它通過弱監督對比預訓練學習文本嵌入,使用 InfoNCE 對比損失,溫度設置為 0.01。
📄 許可證
本模型採用 MIT 許可證。
📋 信息表格
屬性 |
詳情 |
模型類型 |
基於弱監督對比預訓練的文本嵌入模型 |
訓練數據 |
未詳細說明 |
常用提示信息
⚠️ 重要提示
輸入文本需添加 "query: " 或 "passage: " 前綴,否則模型性能會下降。
💡 使用建議
若復現結果與模型卡片中報告的結果略有不同,可能是 transformers
和 pytorch
版本不同導致的。對於文本嵌入任務,餘弦相似度分數的相對順序比絕對值更重要。
引用格式
如果您覺得我們的論文或模型有幫助,請按以下方式引用:
@article{wang2022text,
title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2212.03533},
year={2022}
}
侷限性
此模型僅適用於英文文本,長文本將被截斷為最多512個標記。