ukr - paraphrase多言语モデルがオープンソースで公開！ウクライナ語に特化して最適化され、意味的な類似度や特徴抽出に使用可能です。

ホーム

Ukr Paraphrase Multilingual Mpnet Base

lang-ukによって開発

ウクライナ語に最適化された文埋め込みモデルで、多言語MPNetアーキテクチャに基づき、意味的類似性や特徴抽出タスクに適しています

テキスト埋め込みオープンソースライセンス:Apache-2.0 #ウクライナ語最適化 #多言語意味マッチング #768次元密ベクトル

ダウンロード数 1,110

リリース時間 : 3/23/2024

モデル概要

このモデルはウクライナ語の文や段落を768次元の密ベクトル空間にマッピングでき、クラスタリング、意味検索などの自然言語処理タスクをサポートします

モデル特徴

ウクライナ語最適化

ウクライナ語に特化して微調整され、より正確な意味表現を提供します

多言語サポート

多言語モデルアーキテクチャに基づき、複数言語の文埋め込みをサポートします

効率的な意味エンコーディング

テキストを768次元の密ベクトルに変換し、豊富な意味情報を保持します

モデル能力

文ベクトル化

意味的類似性計算

テキストクラスタリング

クロスランゲージ特徴抽出

使用事例

情報検索

意味検索

キーワードではなく意味に基づく検索システムを構築

検索の関連性と精度を向上

テキスト分析

文書クラスタリング

類似文書を自動的にグループ化

教師なし文書整理を実現

🚀 lang-uk/ukr-paraphrase-multilingual-mpnet-base

このモデルは、ウクライナ語に対してファインチューニングされたsentence-transformersモデルです。文章や段落を768次元の密ベクトル空間にマッピングし、クラスタリングや意味検索などのタスクに使用できます。

ファインチューニングに使用された元のモデルは、sentence-transformers/paraphrase-multilingual-mpnet-base-v2です。詳細については、当社の論文Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguationを参照してください。

🚀 クイックスタート

📦 インストール

sentence-transformersをインストールすると、このモデルを簡単に使用できます。

pip install -U sentence-transformers

💻 使用例

基本的な使用法

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('lang-uk/ukr-paraphrase-multilingual-mpnet-base')
embeddings = model.encode(sentences)
print(embeddings)

高度な使用法

sentence-transformersを使用せずに、このモデルを使用することもできます。まず、入力をトランスフォーマーモデルに通し、その後、文脈化された単語埋め込みに適切なプーリング操作を適用する必要があります。

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('lang-uk/ukr-paraphrase-multilingual-mpnet-base')
model = AutoModel.from_pretrained('lang-uk/ukr-paraphrase-multilingual-mpnet-base')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, average pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 詳細ドキュメント

引用と著者

このモデルが役に立った場合は、当社の出版物Contextual Embeddings for {U}krainian: A Large Language Model Approach to Word Sense Disambiguationを引用してください。

@inproceedings{laba-etal-2023-contextual,
    title = "Contextual Embeddings for {U}krainian: A Large Language Model Approach to Word Sense Disambiguation",
    author = "Laba, Yurii  and
      Mudryi, Volodymyr  and
      Chaplynskyi, Dmytro  and
      Romanyshyn, Mariana  and
      Dobosevych, Oles",
    editor = "Romanyshyn, Mariana",
    booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.unlp-1.2",
    doi = "10.18653/v1/2023.unlp-1.2",
    pages = "11--19",
    abstract = "This research proposes a novel approach to the Word Sense Disambiguation (WSD) task in the Ukrainian language based on supervised fine-tuning of a pre-trained Large Language Model (LLM) on the dataset generated in an unsupervised way to obtain better contextual embeddings for words with multiple senses. The paper presents a method for generating a new dataset for WSD evaluation in the Ukrainian language based on the SUM dictionary. We developed a comprehensive framework that facilitates the generation of WSD evaluation datasets, enables the use of different prediction strategies, LLMs, and pooling strategies, and generates multiple performance reports. Our approach shows 77,9{\%} accuracy for lexical meaning prediction for homonyms.",
}

著作権: Yurii Laba, Volodymyr Mudryi, Dmytro Chaplynskyi, Mariana Romanyshyn, Oles Dobosevych, Ukrainian Catholic University, lang-uk project, 2023

ファインチューニングに使用された元のモデルは、sentence-transformersによってトレーニングされました。