🚀 E5-RoPE-Base
E5-RoPE-Base是一个用于长文本上下文检索的嵌入模型。它基于论文LongEmbed: Extending Embedding Models for Long Context Retrieval,旨在对比使用绝对位置嵌入(APE)和旋转位置嵌入(RoPE)的嵌入模型性能,展示RoPE在处理长上下文时的优势。
🚀 快速开始
本模型有12层,嵌入维度为768。下面将介绍其使用方法。
💻 使用示例
基础用法
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = ['query: how much protein should a female eat',
'query: summit define',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]
tokenizer = AutoTokenizer.from_pretrained('dwzhu/e5rope-base', trust_remote_code=True)
model = AutoModel.from_pretrained('dwzhu/e5rope-base', trust_remote_code=True).cuda()
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', pad_to_multiple_of=8)
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
📚 详细文档
训练详情
关于模型的训练细节,请参考我们的论文 https://arxiv.org/abs/2404.12096.pdf。
基准评估
你可以参考 unilm/e5 来复现该模型在 BEIR 和 MTEB benchmark 上的评估结果。
需要注意的是,E5-RoPE-Base并非专门为优化性能而训练,其目的是对比使用绝对位置嵌入(APE)和旋转位置嵌入(RoPE)的嵌入模型性能。通过比较E5-Base和E5-RoPE-Base,我们展示了基于RoPE的嵌入模型在处理长上下文时的优越性。更多细节请参考我们的论文 LongEmbed: Extending Embedding Models for Long Context Retrieval。
📄 许可证
本项目采用MIT许可证。
📖 引用
如果你觉得我们的论文或模型有帮助,请按以下格式引用:
@article{zhu2024longembed,
title={LongEmbed: Extending Embedding Models for Long Context Retrieval},
author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian},
journal={arXiv preprint arXiv:2404.12096},
year={2024}
}