🚀 E5-RoPE-Base
E5-RoPE-Base是一個用於長文本上下文檢索的嵌入模型。它基於論文LongEmbed: Extending Embedding Models for Long Context Retrieval,旨在對比使用絕對位置嵌入(APE)和旋轉位置嵌入(RoPE)的嵌入模型性能,展示RoPE在處理長上下文時的優勢。
🚀 快速開始
本模型有12層,嵌入維度為768。下面將介紹其使用方法。
💻 使用示例
基礎用法
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = ['query: how much protein should a female eat',
'query: summit define',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]
tokenizer = AutoTokenizer.from_pretrained('dwzhu/e5rope-base', trust_remote_code=True)
model = AutoModel.from_pretrained('dwzhu/e5rope-base', trust_remote_code=True).cuda()
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', pad_to_multiple_of=8)
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
📚 詳細文檔
訓練詳情
關於模型的訓練細節,請參考我們的論文 https://arxiv.org/abs/2404.12096.pdf。
基準評估
你可以參考 unilm/e5 來複現該模型在 BEIR 和 MTEB benchmark 上的評估結果。
需要注意的是,E5-RoPE-Base並非專門為優化性能而訓練,其目的是對比使用絕對位置嵌入(APE)和旋轉位置嵌入(RoPE)的嵌入模型性能。通過比較E5-Base和E5-RoPE-Base,我們展示了基於RoPE的嵌入模型在處理長上下文時的優越性。更多細節請參考我們的論文 LongEmbed: Extending Embedding Models for Long Context Retrieval。
📄 許可證
本項目採用MIT許可證。
📖 引用
如果你覺得我們的論文或模型有幫助,請按以下格式引用:
@article{zhu2024longembed,
title={LongEmbed: Extending Embedding Models for Long Context Retrieval},
author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian},
journal={arXiv preprint arXiv:2404.12096},
year={2024}
}