paraphrase-spanish-distilrobertaオープンソースバイリンガルモデル - 意味検索とクラスタリングタスクをサポート、無料でデプロイ可能！

ホーム

Paraphrase Spanish Distilroberta

somosnlp-hackathon-2022によって開発

sentence-transformersベースのスペイン語-英語バイリンガルモデルで、テキストを768次元ベクトル空間にマッピングでき、意味検索やクラスタリングタスクに適しています

テキスト埋め込み

Transformers

スペイン語#スペイン語意味符号化 #バイリンガル並列トレーニング #教師-学生アーキテクチャ

ダウンロード数 17.25k

リリース時間 : 3/30/2022

モデル概要

このモデルは教師-学生転移学習手法でトレーニングされ、スペイン語の文や段落を意味情報を含む密ベクトルに変換でき、特にクロスランゲージまたは単一言語のテキスト類似度計算タスクに適しています

モデル特徴

バイリンガルベクトル表現

スペイン語と英語テキストの統合ベクトル符号化をサポートし、クロスランゲージ意味マッチングを実現

効率的蒸留アーキテクチャ

DistilRoBERTaベースの軽量設計で、性能を維持しながら推論効率を向上

転移学習最適化

教師-学生トレーニングパラダイムを採用し、並列コーパスを利用して知識転移

モデル能力

文ベクトル化

クロスランゲージ意味検索

テキストクラスタリング分析

意味類似度計算

使用事例

情報検索

クロスランゲージ文書検索

統一ベクトル空間を使用してスペイン語と英語文書の混合検索を実現

テキスト分析

類似質問識別

カスタマーサポートシステムで意味的に類似した顧客問い合わせを自動識別

🚀 paraphrase-spanish-distilroberta

このモデルはsentence-transformersを使用したもので、文章や段落を768次元の密ベクトル空間にマッピングし、クラスタリングや意味検索などのタスクに使用できます。

私たちは、並列の英語 - スペイン語の文章ペアを用いて、教師 - 学生の転移学習アプローチに従ってbertin-roberta-base-spanishモデルを訓練しています。

🚀 クイックスタート

📦 インストール

sentence-transformersをインストールすると、このモデルを簡単に使用できます。

pip install -U sentence-transformers

💻 使用例

基本的な使用法

from sentence_transformers import SentenceTransformer
sentences = ["Este es un ejemplo", "Cada oración es transformada"]

model = SentenceTransformer('hackathon-pln-es/paraphrase-spanish-distilroberta')
embeddings = model.encode(sentences)
print(embeddings)

高度な使用法

sentence-transformersを使用せずに、以下のようにモデルを使用できます。まず、入力をトランスフォーマーモデルに通し、その後、文脈化された単語埋め込みに対して適切なプーリング操作を適用する必要があります。

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['Este es un ejemplo", "Cada oración es transformada']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/paraphrase-spanish-distilroberta')
model = AutoModel.from_pretrained('hackathon-pln-es/paraphrase-spanish-distilroberta')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)

📚 詳細ドキュメント

モデルアーキテクチャ

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

評価結果

STS-2017.es-en.txtとSTS-2017.es-es.txt（評価目的で手動翻訳）における類似性評価を行いました。異なる言語の文章ペア間の意味的な文章類似性（STS）を測定しました。

ES - ES

cosine_pearson	cosine_spearman	manhattan_pearson	manhattan_spearman	euclidean_pearson	euclidean_spearman	dot_pearson	dot_spearman
0.8495	0.8579	0.8675	0.8474	0.8676	0.8478	0.8277	0.8258

ES - EN

cosine_pearson	cosine_spearman	manhattan_pearson	manhattan_spearman	euclidean_pearson	euclidean_spearman	dot_pearson	dot_spearman
0.8344	0.8448	0.8279	0.8168	0.8282	0.8159	0.8083	0.8145

想定される用途

このモデルは、文章や短い段落のエンコーダとして使用することを想定しています。入力テキストが与えられると、意味情報を捉えたベクトルを出力します。この文章ベクトルは、情報検索、クラスタリング、文章類似性タスクに使用できます。

背景

このモデルは、論文Making Monolingual Sentence Embeddings Multilingual using Knowledge DistillationとそのPythonパッケージに付属するドキュメントの指示に従って訓練された、バイリンガル（スペイン語 - 英語）のモデルです。最も強力な事前学習済みの英語バイエンコーダ（paraphrase-mpnet-base-v2）を教師モデルとして、事前学習済みのスペイン語のBERTINを学生モデルとして使用しています。

このモデルは、Hackathon 2022 NLP - Spanish（hackathon-pln-es Organizationによって主催された）の間に開発されました。