bge-base-financial-matryoshkaオープンソースモデル - 無料でのデプロイで金融テキスト分析を支援

ホーム

Bge Base Financial Matryoshka

philschmidによって開発

これはBAAI/bge-base-en-v1.5をベースにファインチューニングされた文章埋め込みモデルで、金融分野のテキスト用に設計されており、文章や段落を768次元のベクトル空間にマッピングすることができます。

テキスト埋め込み英語オープンソースライセンス:Apache-2.0 #金融語義検索 #高次元ベクトル表現 #多タスクサポート

ダウンロード数 1,138

リリース時間 : 6/3/2024

モデル概要

このモデルはsentence-transformersフレームワークを基に開発されており、語義テキスト類似度計算、語義検索、言い換えマイニング、テキスト分類やクラスタリングなどの自然言語処理タスクに適しています。

モデル特徴

金融分野最適化

金融分野のテキストに対してファインチューニングされており、金融関連の語義をより適切に処理できます。

高次元ベクトル表現

テキストを768次元の密なベクトル空間にマッピングし、語義情報を効果的に捉えます。

多タスクサポート

語義類似度計算、検索、分類などの様々なNLPタスクをサポートします。

長文テキスト処理

最大512トークンのシーケンス長をサポートし、段落レベルのテキスト処理に適しています。

モデル能力

語義テキスト類似度計算

語義検索

言い換えマイニング

テキスト分類

テキストクラスタリング

使用事例

金融情報検索

財務報告情報照会

会社の財務報告内の重要な情報を迅速に検索します。

baselineデータセットでMAP@100が0.7907に達します。

金融質問応答システム

語義マッチングに基づく金融質問応答システムを構築します。

baselineデータセットで@1の正解率が0.7086に達します。

金融テキスト分析

財務報告の重要情報抽出

財務報告内の重要なデータポイントを自動的に識別して分類します。

🚀 BGE base Financial Matryoshka

このモデルは、sentence-transformers フレームワークをベースに、BAAI/bge-base-en-v1.5 を微調整して作成されたものです。文章や段落を 768 次元の密ベクトル空間にマッピングし、意味的なテキスト類似度計算、意味検索、言い換えの発掘、テキスト分類、クラスタリングなどのタスクに利用できます。

🚀 クイックスタート

このモデルを使用するには、以下の手順に従ってください。

sentence-transformers ライブラリをインストールします。

pip install -U sentence-transformers

モデルをロードして推論を行います。

from sentence_transformers import SentenceTransformer

# Hugging Face Hub からモデルをダウンロード
model = SentenceTransformer("philschmid/bge-base-financial-matryoshka")
# 推論を実行
sentences = [
    "What was Gilead's total revenue in 2023?",
    'What was the total revenue for the year ended December 31, 2023?',
    'How much was the impairment related to the CAT loan receivable in 2023?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# 埋め込みベクトルの類似度スコアを取得
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主な機能

多タスク対応：意味的なテキスト類似度計算、意味検索、言い換えの発掘、テキスト分類、クラスタリングなど、様々な自然言語処理タスクに使用できます。
高次元ベクトル表現：文章や段落を 768 次元の密ベクトル空間にマッピングし、効果的に意味情報を捉えることができます。
コサイン類似度計算：コサイン類似度を類似度の尺度として使用し、意味的なマッチングを容易に行えます。

📚 ドキュメント

モデル詳細

モデルの説明

属性	詳細
モデルタイプ	Sentence Transformer
ベースモデル	BAAI/bge-base-en-v1.5
最大シーケンス長	512 トークン
出力次元	768 トークン
類似度関数	コサイン類似度
言語	英語
ライセンス	apache-2.0

モデルの出所

ドキュメント：Sentence Transformers ドキュメント
リポジトリ：GitHub 上の Sentence Transformers
Hugging Face：Hugging Face 上の Sentence Transformers

完全なモデルアーキテクチャ

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

評価指標

情報検索

以下は、異なるデータセットでの評価結果です。

`basline_768` データセット

InformationRetrievalEvaluator を使用して評価しました。

指標	値
cosine_accuracy@1	0.7086
cosine_accuracy@3	0.8514
cosine_accuracy@5	0.8843
cosine_accuracy@10	0.9271
cosine_precision@1	0.7086
cosine_precision@3	0.2838
cosine_precision@5	0.1769
cosine_precision@10	0.0927
cosine_recall@1	0.7086
cosine_recall@3	0.8514
cosine_recall@5	0.8843
cosine_recall@10	0.9271
cosine_ndcg@10	0.8215
cosine_mrr@10	0.7874
cosine_map@100	0.7907

`basline_512` データセット

InformationRetrievalEvaluator を使用して評価しました。

指標	値
cosine_accuracy@1	0.7114
cosine_accuracy@3	0.85
cosine_accuracy@5	0.8829
cosine_accuracy@10	0.9229
cosine_precision@1	0.7114
cosine_precision@3	0.2833
cosine_precision@5	0.1766
cosine_precision@10	0.0923
cosine_recall@1	0.7114
cosine_recall@3	0.85
cosine_recall@5	0.8829
cosine_recall@10	0.9229
cosine_ndcg@10	0.8209
cosine_mrr@10	0.7879
cosine_map@100	0.7916

`basline_256` データセット

InformationRetrievalEvaluator を使用して評価しました。

指標	値
cosine_accuracy@1	0.7057
cosine_accuracy@3	0.8414
cosine_accuracy@5	0.88
cosine_accuracy@10	0.9229
cosine_precision@1	0.7057
cosine_precision@3	0.2805
cosine_precision@5	0.176
cosine_precision@10	0.0923
cosine_recall@1	0.7057
cosine_recall@3	0.8414
cosine_recall@5	0.88
cosine_recall@10	0.9229
cosine_ndcg@10	0.8162
cosine_mrr@10	0.7818
cosine_map@100	0.7854

`basline_128` データセット

InformationRetrievalEvaluator を使用して評価しました。

指標	値
cosine_accuracy@1	0.7029
cosine_accuracy@3	0.8343
cosine_accuracy@5	0.8743
cosine_accuracy@10	0.9171
cosine_precision@1	0.7029
cosine_precision@3	0.2781
cosine_precision@5	0.1749
cosine_precision@10	0.0917
cosine_recall@1	0.7029
cosine_recall@3	0.8343
cosine_recall@5	0.8743
cosine_recall@10	0.9171
cosine_ndcg@10	0.8109
cosine_mrr@10	0.7769
cosine_map@100	0.7803

`basline_64` データセット

InformationRetrievalEvaluator を使用して評価しました。

指標	値
cosine_accuracy@1	0.6729
cosine_accuracy@3	0.8171
cosine_accuracy@5	0.8614
cosine_accuracy@10	0.9014
cosine_precision@1	0.6729
cosine_precision@3	0.2724
cosine_precision@5	0.1723
cosine_precision@10	0.0901
cosine_recall@1	0.6729
cosine_recall@3	0.8171
cosine_recall@5	0.8614
cosine_recall@10	0.9014
cosine_ndcg@10	0.79
cosine_mrr@10	0.754
cosine_map@100	0.7582

訓練の詳細

訓練データセット

未命名データセット

データ規模：6300 個の訓練サンプル
列情報：positive と anchor の 2 列を含む

最初の 1000 個のサンプルに基づく近似統計情報：

	positive	anchor
タイプ	文字列	文字列
詳細	最小値：10 トークン平均値：46.11 トークン最大値：289 トークン	最小値：7 トークン平均値：20.26 トークン最大値：43 トークン

サンプル例：

positive	anchor
`Fiscal 2023 total gross profit margin of 35.1% represents an increase of 1.7 percentage points as compared to the respective prior year period.`	`What was the total gross profit margin for Hewlett Packard Enterprise in fiscal 2023?`
`Noninterest expense increased to $65.8 billion in 2023, primarily due to higher investments in people and technology and higher FDIC expense, including $2.1 billion for the estimated special assessment amount arising from the closure of Silicon Valley Bank and Signature Bank.`	`What was the total noninterest expense for the company in 2023?`
`As of May 31, 2022, FedEx Office had approximately 12,000 employees.`