Multilingual-Text-Semantic-Search-Siamese-BERT-V1オープンソースモデル - 多言語テキスト意味検索の実用的な選択肢

ホーム

Multilingual Text Semantic Search Siamese BERT V1

SeyedAliによって開発

Siamese-BERTアーキテクチャに基づく多言語テキスト意味検索モデル、2.15億(質問,回答)ペアで訓練、384次元正規化埋め込みベクトルを生成

テキスト埋め込み #多言語意味検索 #質問応答マッチング #高次元密ベクトル

ダウンロード数 166

リリース時間 : 9/26/2023

モデル概要

このモデルは意味検索のために設計され、文や段落を384次元密ベクトル空間にマッピングし、多言語テキストの意味的類似度計算をサポート

モデル特徴

大規模訓練データ

11の異なるデータソースからの2.15億(質問,回答)ペアを使用して訓練

効率的な意味検索

意味検索シナリオに最適化され、テキスト類似度の高速計算をサポート

正規化埋め込み

正規化された384次元埋め込みベクトルを生成し、内積とコサイン類似度計算を等価にする

多言語サポート

主に英語データで訓練されているが、多言語テキスト意味検索を処理可能

モデル能力

テキスト意味エンコーディング

意味的類似度計算

質問応答マッチング

情報検索

多言語テキスト処理

使用事例

情報検索

質問応答システム

ユーザーの質問と知識ベースの候補回答をマッチング

クエリの意味に最も関連する回答を正確に見つけることが可能

ドキュメント検索

クエリの意味に基づいて関連するドキュメント段落を検索

キーワード検索に比べてより関連性の高い結果を得られる

コンテンツ推薦

🚀 多言語テキスト意味検索シアメーズBERT

このモデルはsentence-transformersをベースに構築されており、文章や段落を384次元の密ベクトル空間にマッピングし、意味検索に最適化されています。多様なソースから収集された2億1500万組の（質問、回答）ペアを用いて学習されています。意味検索の詳細については、SBERT.net - Semantic Searchを参照してください。

🚀 クイックスタート

✨ 主な機能

文章や段落を384次元の密ベクトル空間にマッピングします。
多様なソースから収集された2億1500万組の（質問、回答）ペアを用いて学習されています。
意味検索に最適化されています。

📦 インストール

sentence-transformersをインストールすると、このモデルを簡単に使用できます。

pip install -U sentence-transformers

💻 使用例

基本的な使用法

from sentence_transformers import SentenceTransformer, util

query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

#Load the model
model = SentenceTransformer('SeyedAli/Multilingual-Text-Semantic-Search-Siamese-BERT-V1')

#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

高度な使用法

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
	
    return embeddings


# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("SeyedAli/Multilingual-Text-Semantic-Search-Siamese-BERT-V1")
model = AutoModel.from_pretrained("SeyedAli/Multilingual-Text-Semantic-Search-Siamese-BERT-V1")

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = tf.cast(tf.tile(tf.expand_dims(attention_mask, -1), [1, 1, token_embeddings.shape[-1]]), tf.float32)
    return tf.math.reduce_sum(token_embeddings * input_mask_expanded, 1) / tf.math.maximum(tf.math.reduce_sum(input_mask_expanded, 1), 1e-9)


#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='tf')

    # Compute token embeddings
    model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = tf.math.l2_normalize(embeddings, axis=1)

    return embeddings


# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("SeyedAli/Multilingual-Text-Semantic-Search-Siamese-BERT-V1")
model = TFAutoModel.from_pretrained("SeyedAli/Multilingual-Text-Semantic-Search-Siamese-BERT-V1")

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = (query_emb @ tf.transpose(doc_emb))[0].numpy().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

🔧 技術詳細

このモデルの使用に関するいくつかの技術的な詳細を以下に示します。

設定	値
次元数	384
正規化された埋め込みを生成	はい
プーリング方法	平均プーリング
適切なスコア関数	ドット積 (`util.dot_score`)、コサイン類似度 (`util.cos_sim`)、またはユークリッド距離

注: sentence-transformersでロードすると、このモデルは長さ1の正規化された埋め込みを生成します。この場合、ドット積とコサイン類似度は同等です。ドット積は高速であるため、推奨されます。ユークリッド距離はドット積に比例しており、使用することもできます。

📚 詳細ドキュメント

背景

このプロジェクトの目的は、自己教師付きのコントラスト学習目標を用いて、非常に大規模な文章レベルのデータセットで文章埋め込みモデルを学習させることです。コントラスト学習目標を使用しており、与えられたペアの文章から、ランダムにサンプリングされた他の文章のセットの中から、実際にデータセットでペアになっている文章を予測するようにモデルを学習させます。

このモデルは、Hugging Faceが主催するCommunity week using JAX/Flax for NLP & CVの間に開発されました。Train the Best Sentence Embedding Model Ever with 1B Training Pairsというプロジェクトの一環として開発され、7台のTPU v3 - 8という効率的なハードウェアインフラストラクチャを利用し、GoogleのFlax、JAX、およびCloudチームメンバーからの効率的なディープラーニングフレームワークに関する助言を得ています。

想定される用途

このモデルは意味検索に使用することを想定しています。クエリ/質問とテキスト段落を密ベクトル空間にエンコードし、与えられた文章に関連するドキュメントを見つけます。

ただし、ワードピースの制限が512であり、それより長いテキストは切り捨てられます。また、このモデルは最大250ワードピースの入力テキストで学習されているため、より長いテキストに対してはうまく機能しない可能性があります。

学習手順

完全な学習スクリプトはこのリポジトリ内のtrain_script.pyで確認できます。

事前学習

事前学習済みのnreimers/MiniLM-L6-H384-uncasedモデルを使用しています。事前学習手順の詳細については、モデルカードを参照してください。

学習

複数のデータセットを連結してモデルを微調整しています。合計で約2億1500万組の（質問、回答）ペアを使用しています。各データセットは、data_config.jsonファイルに詳細が記載されている重み付き確率に基づいてサンプリングされています。

このモデルは、平均プーリング、コサイン類似度を類似度関数として、スケール20でMultipleNegativesRankingLossを使用して学習されています。

データセット	学習タプル数
WikiAnswers の重複質問ペア	77,427,422
PAQ のWikipediaの各段落に対する自動生成（質問、段落）ペア	64,371,441
Stack Exchange のすべてのStackExchangesの（タイトル、本文）ペア	25,316,456
Stack Exchange のすべてのStackExchangesの（タイトル、回答）ペア	21,396,559
MS MARCO のBing検索エンジンからの50万件のクエリのトリプレット（クエリ、回答、ハードネガティブ）	17,579,773
GOOAQ: Open Question Answering with Diverse Answer Types の300万件のGoogleクエリとGoogleの特集スニペットの（クエリ、回答）ペア	3,012,496
Amazon-QA のAmazon商品ページの（質問、回答）ペア	2,448,839
Yahoo Answers のYahoo Answersの（タイトル、回答）ペア	1,198,260
Yahoo Answers のYahoo Answersの（質問、回答）ペア	681,164
Yahoo Answers のYahoo Answersの（タイトル、質問）ペア	659,896
SearchQA の14万件の質問の（質問、回答）ペア、各質問には上位5件のGoogleスニペットが含まれています	582,261
ELI5 のReddit ELI5（explainlikeimfive）の（質問、回答）ペア	325,475
Stack Exchange の重複質問ペア（タイトル）	304,525
Quora Question Triplets のQuora質問ペアデータセットのトリプレット（質問、重複質問、ハードネガティブ）	103,663
Natural Questions (NQ) の10万件の実際のGoogleクエリと関連するWikipedia段落の（質問、段落）ペア	100,231
SQuAD2.0 のSQuAD2.0データセットの（質問、段落）ペア	87,599
TriviaQA の（質問、証拠）ペア	73,346
合計	214,988,242