E5rope-baseオープンソース埋め込みモデル - 無料で長文脈検索タスクをサポート！

ホーム

E5rope Base

dwzhuによって開発

E5-RoPE-基本版は回転位置埋め込み（RoPE）を基にした埋め込みモデルで、長文脈検索タスクをサポートすることを目的としています。

テキスト埋め込み

Safetensors

英語オープンソースライセンス:MIT #長文脈検索 #回転位置埋め込み #文類似度

ダウンロード数 129

リリース時間 : 4/18/2024

モデル概要

このモデルは主に文類似度計算と長文脈検索タスクに使用され、回転位置埋め込み（RoPE）技術により長文テキストの処理能力を向上させます。

モデル特徴

回転位置埋め込み（RoPE）

回転位置埋め込み技術を使用し、長文脈検索タスクを効果的に処理します。

効率的な検索

長文脈における埋め込みモデルの検索性能を最適化しました。

マルチタスクサポート

文類似度計算や長文脈検索など、複数のタスクをサポートします。

モデル能力

文類似度計算

長文脈検索

テキスト埋め込み生成

使用事例

情報検索

クエリと段落のマッチング

クエリと関連段落をマッチングさせ、検索システムの精度を向上させます。

BEIRおよびMTEBベンチマークテストで良好な性能を示しました。

意味的類似性

文類似度計算

2つの文間の意味的類似性を計算します。

🚀 E5-RoPE-Base

論文「LongEmbed: Extending Embedding Models for Long Context Retrieval」で提案されたモデルで、長文コンテキストの検索に対応した埋め込みモデルです。

このモデルは12層で、埋め込みサイズは768です。

🚀 クイックスタート

このモデルは、MS-MARCOパッセージランキングデータセットのクエリとパッセージをエンコードするために使用できます。

💻 使用例

基本的な使用法

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = ['query: how much protein should a female eat',
               'query: summit define',
               "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
               "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."]
tokenizer = AutoTokenizer.from_pretrained('dwzhu/e5rope-base', trust_remote_code=True)
model = AutoModel.from_pretrained('dwzhu/e5rope-base', trust_remote_code=True).cuda()
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', pad_to_multiple_of=8)
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

📚 ドキュメント

学習の詳細

学習の詳細については、論文「https://arxiv.org/abs/2404.12096.pdf」を参照してください。

ベンチマーク評価

BEIR と MTEBベンチマークでの評価結果を再現するには、unilm/e5 を参照してください。

なお、E5-RoPE-Baseは最適なパフォーマンスを得るために特別に学習されたものではありません。絶対位置埋め込み (APE) を使用する埋め込みモデルと回転位置埋め込み (RoPE) を使用する埋め込みモデルのパフォーマンスを比較することを目的としています。E5-BaseとE5-RoPE-Baseを比較することで、RoPEベースの埋め込みモデルが長いコンテキストを効果的に処理する上での優位性を示しています。詳細については、論文「LongEmbed: Extending Embedding Models for Long Context Retrieval」を参照してください。

📄 ライセンス

このモデルはMITライセンスの下で公開されています。

引用

もしこの論文やモデルが役に立った場合は、以下のように引用してください。

@article{zhu2024longembed,
  title={LongEmbed: Extending Embedding Models for Long Context Retrieval},
  author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian},
  journal={arXiv preprint arXiv:2404.12096},
  year={2024}
}