financial-rag-matryoshka開源金融模型 - 專為金融文檔檢索任務量身打造

首頁

Financial Rag Matryoshka

由rbhatia46開發

基於Alibaba-NLP/gte-large-en-v1.5微調的金融專用句子轉換器模型，專注於金融文檔檢索任務

文本嵌入

Safetensors

支持多種語言開源協議:Apache-2.0 #金融文檔檢索 #高維語義嵌入 #長文本處理

下載量 17.08k

發布時間 : 7/8/2024

模型概述

該模型能將句子和段落映射到1024維密集向量空間，可用於語義文本相似度、語義搜索、複述挖掘、文本分類、聚類等任務，特別優化了金融領域的表現

模型特點

金融領域優化

在保持通用性能的同時，特別針對金融文檔檢索任務進行了優化

高維向量空間

能將文本映射到1024維密集向量空間，捕捉豐富的語義信息

長文本處理

支持最大8192個token的序列長度，適合處理長文檔

Matryoshka損失函數

使用MatryoshkaLoss配合MultipleNegativesRankingLoss進行訓練，提升模型性能

模型能力

語義文本相似度計算

語義搜索

複述挖掘

文本分類

文本聚類

金融文檔檢索

使用案例

金融信息檢索

金融機構報告檢索

快速檢索金融機構報告中的關鍵信息

在金融文檔檢索任務中表現出色

金融問答系統

構建基於語義匹配的金融問答系統

高準確率的語義匹配能力

通用文本處理

文檔相似度計算

計算不同文檔之間的語義相似度

文本聚類

對大量文本進行自動分類和聚類

🚀 financial-rag-matryoshka

該模型是基於 Alibaba-NLP/gte-large-en-v1.5 針對金融用例進行微調的模型。它可以將句子和段落映射到一個 1024 維的密集向量空間，可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等任務。該模型在金融文檔檢索任務中表現出色，同時也能保持較高的通用性能。

🚀 快速開始

直接使用（Sentence Transformers）

首先，安裝 Sentence Transformers 庫：

pip install -U sentence-transformers

然後，你可以加載該模型並進行推理：

from sentence_transformers import SentenceTransformer

# 從 Hugging Face Hub 下載
model = SentenceTransformer("rbhatia46/gte-large-en-v1.5-financial-rag-matryoshka")
# 運行推理
sentences = [
    'JP Morgan reported total deposits of $2.6 trillion in the year ending December 31, 2023.',
    "What were JP Morgan's total deposits in 2023?",
    'What is the primary source of revenue for the software company, Microsoft?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主要特性

多任務適用：可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等多種任務。
金融領域優化：針對金融用例進行了微調，在金融文檔檢索任務中表現出色。
高維向量映射：將句子和段落映射到 1024 維的密集向量空間。

📦 安裝指南

若要使用該模型，你需要安裝 Sentence Transformers 庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import SentenceTransformer

# 從 Hugging Face Hub 下載
model = SentenceTransformer("rbhatia46/gte-large-en-v1.5-financial-rag-matryoshka")
# 運行推理
sentences = [
    'JP Morgan reported total deposits of $2.6 trillion in the year ending December 31, 2023.',
    "What were JP Morgan's total deposits in 2023?",
    'What is the primary source of revenue for the software company, Microsoft?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 詳細文檔

模型詳情

模型描述

屬性	詳情
模型類型	Sentence Transformer
基礎模型	Alibaba-NLP/gte-large-en-v1.5
最大序列長度	8192 個標記
輸出維度	1024 個標記
相似度函數	餘弦相似度
語言	英語
許可證	apache - 2.0

模型來源

文檔：Sentence Transformers Documentation
倉庫：Sentence Transformers on GitHub
Hugging Face：Sentence Transformers on Hugging Face

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

評估

信息檢索指標

以下是不同數據集上的評估指標：

數據集	Cosine Accuracy@1	Cosine Accuracy@3	Cosine Accuracy@5	Cosine Accuracy@10	Cosine Precision@1	Cosine Precision@3	Cosine Precision@5	Cosine Precision@10	Cosine Recall@1	Cosine Recall@3	Cosine Recall@5	Cosine Recall@10	Cosine Ndcg@10	Cosine Mrr@10	Cosine Map@100
dim_1024	0.88	0.96	0.9867	0.9956	0.88	0.32	0.1973	0.0996	0.88	0.96	0.9867	0.9956	0.9427	0.9252	0.9254
dim_768	0.88	0.96	0.9867	0.9911	0.88	0.32	0.1973	0.0991	0.88	0.96	0.9867	0.9911	0.9408	0.924	0.9245
dim_512	0.8711	0.96	0.9867	0.9911	0.8711	0.32	0.1973	0.0991	0.8711	0.96	0.9867	0.9911	0.9381	0.9203	0.9207
dim_256	0.8756	0.96	0.9867	0.9911	0.8756	0.32	0.1973	0.0991	0.8756	0.96	0.9867	0.9911	0.9396	0.9223	0.9228
dim_128	0.8667	0.9556	0.9867	0.9911	0.8667	0.3185	0.1973	0.0991	0.8667	0.9556	0.9867	0.9911	0.9346	0.9157	0.916
dim_64	0.8311	0.96	0.9733	0.9911	0.8311	0.32	0.1947	0.0991	0.8311	0.96	0.9733	0.9911	0.9208	0.8972	0.8975

訓練詳情

訓練數據集

未命名數據集

大小：4275 個訓練樣本
列：positive 和 anchor
近似統計信息（基於前 1000 個樣本）：
positive anchor
類型字符串字符串
詳情
最小：15 個標記
平均：44.74 個標記
最大：114 個標記
最小：9 個標記
平均：18.12 個標記
最大：32 個標記

	positive	anchor
類型	字符串	字符串
詳情	最小：15 個標記平均：44.74 個標記最大：114 個標記	最小：9 個標記平均：18.12 個標記最大：32 個標記

樣本：

positive	anchor
`At the end of fiscal year 2023, Exxon Mobil reported a debt - to - equity ratio of 0.32, implying that the company used more equity than debt in its capital structure.`	`What was the debt - to - equity ratio for Exxon Mobil at the end of fiscal year 2023?`
`Amazon Web Services (AWS) generated $12.7 billion in net sales in the fourth quarter of 2020, up 28% from the same period of the previous year. It accounted for about 10% of Amazonâ€™s total net sales for the quarter.`	`How did Amazon's AWS segment perform in the fourth quarter of 2020?`
`JPMorgan Chase generates revenues by providing a wide range of banking and financial services. These include investment banking (M&As, advisory), consumer and community banking (home mortgages, auto loans), commercial banking, and asset and wealth management.`	`What are the key revenue sources for JPMorgan Chase?`

損失函數：MatryoshkaLoss，參數如下：

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        1024,
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

訓練超參數

非默認超參數

eval_strategy：epoch
per_device_train_batch_size：32
per_device_eval_batch_size：16
gradient_accumulation_steps：16
learning_rate：2e - 05
num_train_epochs：10
lr_scheduler_type：cosine
warmup_ratio：0.1
bf16：True
tf32：True
load_best_model_at_end：True
optim：adamw_torch_fused
batch_sampler：no_duplicates

訓練日誌

Epoch	Step	訓練損失	dim_1024_cosine_map@100	dim_128_cosine_map@100	dim_256_cosine_map@100	dim_512_cosine_map@100	dim_64_cosine_map@100	dim_768_cosine_map@100
0.9552	8	-	0.9090	0.8848	0.8992	0.9052	0.8775	0.9030
1.1940	10	0.4749	-	-	-	-	-	-
1.9104	16	-	0.9170	0.9095	0.9109	0.9201	0.8961	0.9212
2.3881	20	0.0862	-	-	-	-	-	-
2.9851	25	-	0.9190	0.9071	0.9160	0.9278	0.8998	0.9234
3.5821	30	0.0315	-	-	-	-	-	-
3.9403	33	-	0.9183	0.9053	0.9122	0.9287	0.8998	0.9183
4.7761	40	0.0184	-	-	-	-	-	-
4.8955	41	-	0.9225	0.9125	0.9164	0.9260	0.8985	0.9220
5.9701	50	0.0135	0.9268	0.9132	0.9208	0.9257	0.8961	0.9271
6.9254	58	-	0.9254	0.9158	0.9202	0.9212	0.8938	0.9213
7.1642	60	0.0123	-	-	-	-	-	-
8.0	67	-	0.9253	0.916	0.9228	0.9207	0.8972	0.9243
8.3582	70	0.01	-	-	-	-	-	-
8.9552	75	-	0.9254	0.9160	0.9213	0.9207	0.9005	0.9245
9.5522	80	0.0088	0.9254	0.9160	0.9228	0.9207	0.8975	0.9245

注：加粗行表示保存的檢查點。

框架版本

Python：3.10.6
Sentence Transformers：3.0.1
Transformers：4.41.2
PyTorch：2.1.2 + cu121
Accelerate：0.32.1
Datasets：2.19.1
Tokenizers：0.19.1

🔧 技術細節

該模型基於 Alibaba - NLP/gte - large - en - v1.5 進行微調，使用了 MatryoshkaLoss 損失函數，結合了 MultipleNegativesRankingLoss。在訓練過程中，通過不同維度的向量空間進行學習，以提高模型在金融領域的性能。同時，使用了多種超參數進行優化，如學習率調度、批量大小等，以確保模型的收斂和泛化能力。

📄 許可證

該模型使用 apache - 2.0 許可證。

引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard - Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al - Rfou and Brian Strope and Yun - hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}