RhetoriBERTオープンソースモデル - 学術テキストの修辞機能（結果の要約など）の分析に無料で使用可能

ホーム

Rhetoribert

KaiserMLによって開発

このモデルはnomic-ai/nomic-embed-text-v1.5を科学文献データセットで微調整した文変換モデルで、結果の要約や限界の表現など、学術テキストの修辞機能を分析するために特別に設計されています。

テキスト埋め込み

Safetensors

英語オープンソースライセンス:Apache-2.0 #学術テキスト類似度 #修辞機能エンコーディング #長文テキスト埋め込み

ダウンロード数 70

リリース時間 : 1/24/2025

モデル概要

学術テキストの文を768次元ベクトル空間にマッピングし、その修辞機能に基づいてエンコードします。機能的なテキスト類似度、限界分析、修辞機能分類などのタスクに適しています。

モデル特徴

長文処理能力

最大8192トークンのシーケンス長をサポートし、学術文献の長い段落の処理に適しています

修辞機能エンコーディング

学術テキストの修辞機能(研究目的の記述、方法の説明など)に特化して最適化されています

多次元類似度

MatryoshkaLossを使用して訓練されており、64から768次元までの多粒度の類似度計算をサポートします

効率的な検索

科学文献検索タスクで94.15%のnDCG@10指標を達成しています

モデル能力

学術テキスト埋め込み生成

機能的テキスト類似度計算

科学文献検索

修辞機能分類

学術テキストクラスタリング分析

使用事例

学術研究

文献検索システム

修辞機能に基づいて関連研究文献をマッチング

テストセットで90%の精度@1を達成

論文執筆支援

現在の執筆内容と修辞機能が類似した参照文を識別

教育技術

学術執筆評価

学生論文の各部分の修辞機能の完全性を分析

🚀 sentence-transformers/static-retrieval-mrl-en-v1

このモデルは、sci_gen_colbert_triplets データセットを使用して、nomic-ai/nomic-embed-text-v1.5 からファインチューニングされた sentence-transformers モデルです。学術テキストの文章を、修辞的な機能（結果の要約、制限の表現など）に基づいて768次元の密ベクトル空間にマッピングし、機能的なテキストの類似性、制限分析、修辞的機能分類、クラスタリングなどに使用できます。

🚀 クイックスタート

モデルの詳細

モデルの説明

属性	详情
モデルタイプ	Sentence Transformer
ベースモデル	nomic-ai/nomic-embed-text-v1.5
最大シーケンス長	8192トークン
出力次元数	768次元
類似度関数	コサイン類似度
学習データセット	sci_gen_colbert_triplets
言語	en
ライセンス	apache-2.0

モデルのソース

ドキュメント: Sentence Transformers Documentation
リポジトリ: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

完全なモデルアーキテクチャ

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

使い方

直接使用（Sentence Transformers）

まず、Sentence Transformersライブラリをインストールします。

pip install -U sentence-transformers

次に、このモデルをロードして推論を実行できます。

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("KaiserML/RhetoriBERT")
# Run inference
sentences = [
    'Surveys and interviews: Introducing excerpts from interview data',
    "Through surveys and interviews, multiliterate teachers expressed a shared belief in the importance of fostering students' ability to navigate multiple discourse communities.",
    'The authors employ a constructivist approach to learning, where students build knowledge through active engagement with multimedia texts and collaborative discussions.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

評価

メトリクス

情報検索

データセット: SciGen-Eval-Set
InformationRetrievalEvaluator で評価

メトリクス	値
cosine_accuracy@1	0.9
cosine_accuracy@3	0.9452
cosine_accuracy@5	0.9642
cosine_accuracy@10	0.9853
cosine_precision@1	0.9
cosine_precision@3	0.3151
cosine_precision@5	0.1928
cosine_precision@10	0.0985
cosine_recall@1	0.9
cosine_recall@3	0.9452
cosine_recall@5	0.9642
cosine_recall@10	0.9853
cosine_ndcg@10	0.9415
cosine_mrr@10	0.9276
cosine_map@100	0.9284

学習の詳細

学習データセット

sci_gen_colbert_triplets

データセット: sci_gen_colbert_triplets at 44071bd
サイズ: 35,934個の学習サンプル
列: query、positive、negative

最初の1000サンプルに基づく概算統計:

	query	positive	negative
タイプ	文字列	文字列	文字列
詳細	最小: 5トークン平均: 10.24トークン最大: 23トークン	最小: 2トークン平均: 39.86トークン最大: 80トークン	最小: 18トークン平均: 40.41トークン最大: 88トークン

サンプル:

query	positive	negative
`Previous research: highlighting negative outcomes`	`Despite the widespread use of seniority-based wage systems in labor contracts, previous research has highlighted their negative outcomes, such as inefficiencies and demotivating effects on workers.`	`This paper, published in 1974, was among the first to establish the importance of rank-order tournaments as optimal labor contracts in microeconomics.`
`Synthesising sources: contrasting evidence or ideas`	`Despite the observed chronic enterocolitis in Interleukin-10-deficient mice, some studies suggest that this cytokine plays a protective role in intestinal inflammation in humans (Kurimoto et al., 2001).`	`Chronic enterocolitis developed in Interleukin-10-deficient mice, characterized by inflammatory cell infiltration, epithelial damage, and increased production of pro-inflammatory cytokines.`
`Previous research: Approaches taken`	`Previous research on measuring patient-relevant outcomes in osteoarthritis has primarily relied on self-reported measures, such as the Western Ontario and McMaster Universities Arthritis Index (WOMAC) (Bellamy et al., 1988).`	`The WOMAC (Western Ontario and McMaster Universities Osteoarthritis Index) questionnaire has been widely used in physical therapy research to assess the impact of antirheumatic drug therapy on patient-reported outcomes in individuals with hip or knee osteoarthritis.`

損失関数: MatryoshkaLoss で、以下のパラメータを使用:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        384,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

評価データセット

sci_gen_colbert_triplets

データセット: sci_gen_colbert_triplets at 44071bd
サイズ: 4,492個の評価サンプル
列: query、positive、negative

最初の1000サンプルに基づく概算統計:

	query	positive	negative
タイプ	文字列	文字列	文字列
詳細	最小: 5トークン平均: 10.23トークン最大: 23トークン	最小: 18トークン平均: 39.83トークン最大: 84トークン	最小: 8トークン平均: 39.89トークン最大: 84トークン

サンプル:

query	positive	negative
`Providing background information: reference to the purpose of the study`	`This study aimed to investigate the impact of socioeconomic status on child development, specifically focusing on cognitive, language, and social-emotional domains.`	`Children from high socioeconomic status families showed significantly higher IQ scores (M = 112.5, SD = 5.6) compared to children from low socioeconomic status families (M = 104.3, SD = 6.2) in the verbal IQ subtest.`
`Providing background information: reference to the literature`	`According to previous studies using WinGX suite for small-molecule single-crystal crystallography, the optimization of crystal structures leads to improved accuracy in determining atomic coordinates.`	`This paper describes the WinGX suite, a powerful tool for small-molecule single-crystal crystallography that significantly advances the field of crystallography by streamlining data collection and analysis.`
`General comments on the relevant literature`	`Polymer brushes have gained significant attention in the field of polymer science due to their unique properties, such as controlled thickness, high surface density, and tunable interfacial properties.`	`Despite previous reports suggesting that polymer brushes with short grafting densities exhibit poorer performance in terms of adhesion and stability compared to those with higher grafting densities (Liu et al., 2010), our results indicate that the oppos`