E5rope-base開源嵌入模型 - 免費支持長上下文檢索任務！

首頁

E5rope Base

由dwzhu開發

E5-RoPE-基礎版是基於旋轉位置嵌入（RoPE）的嵌入模型，旨在支持長上下文檢索任務。

文本嵌入

Safetensors

英語開源協議:MIT #長上下文檢索 #旋轉位置嵌入 #句子相似度

下載量 129

發布時間 : 4/18/2024

模型概述

該模型主要用於句子相似度計算和長上下文檢索任務，通過旋轉位置嵌入（RoPE）技術提升對長文本的處理能力。

模型特點

旋轉位置嵌入（RoPE）

使用旋轉位置嵌入技術，有效處理長上下文檢索任務。

高效檢索

優化了嵌入模型在長上下文中的檢索性能。

多任務支持

支持句子相似度計算和長上下文檢索等多種任務。

模型能力

句子相似度計算

長上下文檢索

文本嵌入生成

使用案例

信息檢索

查詢與段落匹配

用於匹配查詢與相關段落，提升檢索系統的準確性。

在BEIR和MTEB基準測試中表現良好。

語義相似度

句子相似度計算

計算兩個句子之間的語義相似度。

🚀 E5-RoPE-Base

E5-RoPE-Base是一個用於長文本上下文檢索的嵌入模型。它基於論文LongEmbed: Extending Embedding Models for Long Context Retrieval，旨在對比使用絕對位置嵌入（APE）和旋轉位置嵌入（RoPE）的嵌入模型性能，展示RoPE在處理長上下文時的優勢。

🚀 快速開始

本模型有12層，嵌入維度為768。下面將介紹其使用方法。

💻 使用示例

基礎用法

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = ['query: how much protein should a female eat',
               'query: summit define',
               "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
               "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."]
tokenizer = AutoTokenizer.from_pretrained('dwzhu/e5rope-base', trust_remote_code=True)
model = AutoModel.from_pretrained('dwzhu/e5rope-base', trust_remote_code=True).cuda()
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', pad_to_multiple_of=8)
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

📚 詳細文檔

訓練詳情

關於模型的訓練細節，請參考我們的論文 https://arxiv.org/abs/2404.12096.pdf。

基準評估

你可以參考 unilm/e5 來複現該模型在 BEIR 和 MTEB benchmark 上的評估結果。

需要注意的是，E5-RoPE-Base並非專門為優化性能而訓練，其目的是對比使用絕對位置嵌入（APE）和旋轉位置嵌入（RoPE）的嵌入模型性能。通過比較E5-Base和E5-RoPE-Base，我們展示了基於RoPE的嵌入模型在處理長上下文時的優越性。更多細節請參考我們的論文 LongEmbed: Extending Embedding Models for Long Context Retrieval。

📄 許可證

本項目採用MIT許可證。

📖 引用

如果你覺得我們的論文或模型有幫助，請按以下格式引用：

@article{zhu2024longembed,
  title={LongEmbed: Extending Embedding Models for Long Context Retrieval},
  author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian},
  journal={arXiv preprint arXiv:2404.12096},
  year={2024}
}