gte-smallオープンソース汎用テキスト埋め込みモデル - 情報検索と意味的類似度分析を無料でサポート

ホーム

Gte Small

Supabaseによって開発

GTE-smallはアリババDAMOアカデミーによって訓練された汎用テキスト埋め込みモデルで、BERTフレームワークに基づいており、情報検索や意味的テキスト類似度などのタスクに適しています。

テキスト埋め込み

Transformers

英語オープンソースライセンス:MIT #英文テキスト埋め込み #情報検索最適化 #意味的類似度計算

ダウンロード数 481.27k

リリース時間 : 8/1/2023

モデル概要

GTEモデルシリーズは大規模な関連テキストペアの訓練を通じて、複数の分野のシナリオをカバーし、情報検索、意味的テキスト類似度、テキスト再ランキングなどの下流タスクに適用可能です。

モデル特徴

マルチドメイン適用

大規模な関連テキストペアの訓練を通じて、複数の分野のシナリオをカバーします。

高性能

MTEBベンチマークテストで優れた性能を発揮し、総合スコアは61.36です。

軽量

モデルサイズはわずか0.07GBで、リソースが限られた環境に適しています。

モデル能力

テキスト特徴抽出

意味的テキスト類似度計算

情報検索

テキスト再ランキング

使用事例

情報検索

ドキュメント検索

効率的なドキュメント検索エンジンの構築に使用されます。

検索結果の関連性を向上させる

意味的解析

テキスト類似度計算

2つのテキストの意味的類似度を計算します。

STSタスクで82.07のスコアを獲得

🚀 gte-small

General Text Embeddings (GTE) モデルは、Alibaba DAMO Academyによって開発されたテキスト埋め込みモデルです。BERTフレームワークをベースに構築され、大規模な関連性のあるテキストペアのコーパスで訓練されています。これにより、様々なドメインやシナリオに対応し、情報検索、意味的なテキストの類似性評価、テキストの再ランキングなど、様々な下流タスクに適用できます。

✨ 主な機能

大規模コーパスで訓練された、汎用的なテキスト埋め込みモデル。
様々な下流タスクに適用可能。
PythonとJavaScriptの両方で使用可能。

📦 インストール

このモデルは、transformersライブラリを介して使用できます。Python環境では、以下のコマンドでインストールできます。

pip install transformers torch

JavaScript環境では、@xenova/transformersを使用します。

npm install @xenova/transformers

💻 使用例

基本的な使用法

Pythonでの使用

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

tokenizer = AutoTokenizer.from_pretrained("Supabase/gte-small")
model = AutoModel.from_pretrained("Supabase/gte-small")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

JavaScriptでの使用

DenoまたはSupabase Edge Functionsでの使用例：

import { serve } from 'https://deno.land/std@0.168.0/http/server.ts'
import { env, pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0'

// Configuration for Deno runtime
env.useBrowserCache = false;
env.allowLocalModels = false;

const pipe = await pipeline(
  'feature-extraction',
  'Supabase/gte-small',
);

serve(async (req) => {
  // Extract input string from JSON body
  const { input } = await req.json();

  // Generate the embedding from the user input
  const output = await pipe(input, {
    pooling: 'mean',
    normalize: true,
  });

  // Extract the embedding output
  const embedding = Array.from(output.data);

  // Return the embedding
  return new Response(
    JSON.stringify({ embedding }),
    { headers: { 'Content-Type': 'application/json' } }
  );
});

高度な使用法

Pythonでの`sentence-transformers`を使用した例

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']

model = SentenceTransformer('Supabase/gte-small')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

ブラウザ内でのJavaScriptの使用例

<script type="module">

import { pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0';

const pipe = await pipeline(
  'feature-extraction',
  'Supabase/gte-small',
);

// Generate the embedding from text
const output = await pipe('Hello world', {
  pooling: 'mean',
  normalize: true,
});

// Extract the embedding output
const embedding = Array.from(output.data);

console.log(embedding);

</script>

Node.jsまたはWebpackでの使用例

import { pipeline } from '@xenova/transformers';

const pipe = await pipeline(
  'feature-extraction',
  'Supabase/gte-small',
);

// Generate the embedding from text
const output = await pipe('Hello world', {
  pooling: 'mean',
  normalize: true,
});

// Extract the embedding output
const embedding = Array.from(output.data);

console.log(embedding);

📚 ドキュメント

メトリクス

GTEモデルの性能は、MTEBベンチマークで他の人気のテキスト埋め込みモデルと比較されています。詳細な比較結果については、MTEBリーダーボードを参照してください。

モデル名	モデルサイズ (GB)	次元数	シーケンス長	平均 (56)	クラスタリング (11)	ペア分類 (3)	再ランキング (4)	検索 (15)	STS (10)	要約 (1)	分類 (12)
gte-large	0.67	1024	512	63.13	46.84	85.00	59.13	52.22	83.35	31.66	73.33
gte-base	0.22	768	512	62.39	46.2	84.57	58.61	51.14	82.3	31.17	73.01
e5-large-v2	1.34	1024	512	62.25	44.49	86.03	56.61	50.56	82.05	30.19	75.24
e5-base-v2	0.44	768	512	61.5	43.80	85.73	55.91	50.29	81.05	30.28	73.84
gte-small	0.07	384	512	61.36	44.89	83.54	57.7	49.46	82.07	30.42	72.31
text-embedding-ada-002	-	1536	8192	60.99	45.9	84.89	56.32	49.25	80.97	30.8	70.93
e5-small-v2	0.13	384	512	59.93	39.92	84.67	54.32	49.04	80.39	31.16	72.94
sentence-t5-xxl	9.73	768	512	59.51	43.72	85.06	56.42	42.24	82.63	30.08	73.42
all-mpnet-base-v2	0.44	768	514	57.78	43.69	83.04	59.36	43.81	80.28	27.49	65.07
sgpt-bloom-7b1-msmarco	28.27	4096	2048	57.59	38.93	81.9	55.65	48.22	77.74	33.6	66.19
all-MiniLM-L12-v2	0.13	384	512	56.53	41.81	82.41	58.44	42.69	79.8	27.9	63.21
all-MiniLM-L6-v2	0.09	384	512	56.26	42.35	82.37	58.04	41.95	78.9	30.81	63.05
contriever-base-msmarco	0.44	768	512	56.00	41.1	82.54	53.14	41.88	76.51	30.36	66.68
sentence-t5-base	0.22	768	512	55.27	40.21	85.18	53.09	33.63	81.14	31.39	69.81