multilingual-e5-small-4096多言語テキスト埋め込みモデル - 約4kトークン処理をサポートするオープンソース

ホーム

Multilingual E5 Small 4096

efedericiによって開発

intfloat/multilingual-e5-smallの局所スパースグローバル版で、約4kトークンをサポートする多言語テキスト埋め込みモデル

テキスト埋め込み

Transformers

複数言語対応#多言語テキスト埋め込み #長文処理 #弱教師付きコントラスティブ学習

ダウンロード数 16

リリース時間 : 8/7/2023

モデル概要

このモデルは多言語テキスト埋め込みモデルで、4096トークンの入力長をサポートし、文類似度計算などのタスクに適しています

モデル特徴

多言語サポート

100以上の言語のテキスト埋め込みをサポート

長文処理

約4096トークンの長文入力を処理可能

局所スパースグローバルアーキテクチャ

局所スパースグローバル技術を採用してモデル性能を最適化

モデル能力

多言語テキスト埋め込み

文類似度計算

クロスランゲージ検索

使用事例

情報検索

クロスランゲージドキュメント検索

異なる言語のドキュメントコレクションから関連コンテンツを検索

意味的類似性

多言語文類似度計算

異なる言語の文間の意味的類似性を計算

🚀 multilingual-e5-small-4096

intfloat/multilingual-e5-smallのLocal-Sparse-Globalバージョンです。約4kトークンまで処理できます。

🚀 クイックスタート

このモデルは、文章の類似度を計算するためのものです。以下に、MS-MARCOパッセージランキングデータセットのクエリとパッセージをエンコードする例を示します。

💻 使用例

基本的な使用法

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('efederici/multilingual-e5-small-4096', {"trust_remote_code": True})
input_texts = [
    'query: how much protein should a female eat',
    'query: summit define',
    "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
embeddings = model.encode(input_texts, normalize_embeddings=True)

高度な使用法

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(
  last_hidden_states: Tensor,
  attention_mask: Tensor
) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
  'query: how much protein should a female eat',
  'query: summit define',
  "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
  "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]

tokenizer = AutoTokenizer.from_pretrained('efederici/multilingual-e5-small-4096')
model = AutoModel.from_pretrained('efederici/multilingual-e5-small-4096', trust_remote_code=True)

batch_dict = tokenizer(input_texts, max_length=4096, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100

print(scores.tolist())

引用

@article{wang2022text,
  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2212.03533},
  year={2022}
}