Bert - MLM_arXiv - MP - class_zbMathオープンソースモデル - 短い数学テキストの類似度を無料で計算する

ホーム

Bert MLM Arxiv MP Class Zbmath

math-similarityによって開発

これはsentence-transformersに基づくモデルで、短い数学テキストの類似度を計算するために特別に設計されており、文や段落を768次元の密なベクトル空間にマッピングすることができます。

テキスト埋め込み

Transformers

#数学テキストの類似度 #短いテキストのベクトル化 #学術論文のマッチング

ダウンロード数 415

リリース時間 : 5/18/2024

モデル概要

このモデルは数学分野のテキストを処理するために設計されており、数学論文の要約や定理の記述などの短いテキストの意味的な類似度を計算するのに特に適しており、クラスタリングや意味的な検索などのタスクに使用できます。

モデル特徴

数学テキスト専用

数学分野のテキストに特化して最適化されており、数式や専門用語を含む短いテキストを効果的に処理できます。

高次元の意味エンコーディング

テキストを768次元の密なベクトル空間にマッピングし、深層の意味関係を捉えます。

文変換器と互換性

sentence-transformersフレームワークに基づいており、既存のNLPプロセスに容易に統合できます。

モデル能力

数学テキストの類似度計算

意味ベクトルの生成

短いテキストのクラスタリング

学術文献の検索

使用事例

学術研究

数学論文の類似性検索

数学文献データベース内で与えられた要約に類似する論文を検索する

関連文献検索の精度を向上させる

定理の分類

定理の記述の意味的な類似度に基づいて自動的に分類する

数学知識ベースの構築を支援する

教育技術

問題の類似度マッチング

教育プラットフォーム内で類似する数学問題をマッチングする

個別化された学習推薦をサポートする

🚀 Bert-MLM_arXiv-MP-class_zbMath

このモデルはsentence-transformersモデルです。文章や段落を768次元の密ベクトル空間にマッピングし、クラスタリングや意味検索などのタスクに使用できます。このモデルは、短い数学的テキストの類似度を計算するように特別に設計されています。

🚀 クイックスタート

📦 インストール

sentence-transformersをインストールすると、このモデルの使用が簡単になります。

pip install -U sentence-transformers

💻 使用例

基本的な使用法

from sentence_transformers import SentenceTransformer
sentences = ["In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge 0$, using the dyadic grid. This result is a consequence of the description of the Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.",
             "We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant."]

model = SentenceTransformer('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')
embeddings = model.encode(sentences)
print(embeddings)

高度な使用法

sentence-transformersを使用せずにモデルを使用するには、まず入力をトランスフォーマーモデルに通し、その後コンテキスト化された単語埋め込みに適切なプーリング操作を適用する必要があります。

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge 0$, using the dyadic grid. This result is a consequence of the description of the Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.",
             "We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')
model = AutoModel.from_pretrained('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 ドキュメント

意図された用途

当社のモデルは、数学的テキストの文章および短い段落のエンコーダとして使用することを目的としています。入力テキストを与えると、意味情報を捉えたベクトルを出力します。文章ベクトルは、情報検索、クラスタリング、または文章の類似度タスクに使用できます。デフォルトでは、256単語片より長い入力テキストは切り捨てられます。

トレーニング手順

ドメイン適応

ドメイン適応されたmath-similarity/Bert-MLM_arXivモデルを使用しています。ドメイン適応手順の詳細については、モデルカードを参照してください。

プーリング

ドメイン適応モデルの上に平均プーリング層を追加しています。

ファインチューニング

コサイン類似度の目的関数を使用してモデルをファインチューニングしています。正式には、u = model(sentence_A)とv = model(sentence_B)のベクトルを計算し、2つの間のコサイン類似度を測定します。デフォルトでは、次の損失を最小化します: ||input_label - cos_score_transformation(cosine_sim(u,v))||_2、損失関数としてMSEを使用します。 zbMathのタイトルペアをファインチューニングデータセットとして使用し、それらのMSCコードで意味的な類似度をモデル化しています。2つのタイトルは、主要なMSC₅と別の二次的なMSC₅を共有する場合、類似と定義されます。それ以外の場合は、意味的に異なると定義されます。トレーニングセットには351,472個のタイトルペアが含まれ、評価セットには43,935個のペアが含まれています。詳細については、トレーニングノートブックを参照してください。残念ながら、ライセンスの問題でタイトル付きのデータセットを含めることはできません。ただし、主要および二次的なMSC分類を持つそれぞれのzbMath識別子（anとも呼ばれる）のみを含み、タイトルを含まないデータセットを作成しました。これは、datasets/math-similarity/class-zbmath-identifierとして利用可能です。