financial-rag-matryoshka开源金融模型 - 专为金融文档检索任务量身打造

首页

Financial Rag Matryoshka

由 rbhatia46 开发

基于Alibaba-NLP/gte-large-en-v1.5微调的金融专用句子转换器模型，专注于金融文档检索任务

文本嵌入

Safetensors

支持多种语言开源协议:Apache-2.0 #金融文档检索 #高维语义嵌入 #长文本处理

下载量 17.08k

发布时间 : 7/8/2024

模型简介

该模型能将句子和段落映射到1024维密集向量空间，可用于语义文本相似度、语义搜索、复述挖掘、文本分类、聚类等任务，特别优化了金融领域的表现

模型特点

金融领域优化

在保持通用性能的同时，特别针对金融文档检索任务进行了优化

高维向量空间

能将文本映射到1024维密集向量空间，捕捉丰富的语义信息

长文本处理

支持最大8192个token的序列长度，适合处理长文档

Matryoshka损失函数

使用MatryoshkaLoss配合MultipleNegativesRankingLoss进行训练，提升模型性能

模型能力

语义文本相似度计算

语义搜索

复述挖掘

文本分类

文本聚类

金融文档检索

使用案例

金融信息检索

金融机构报告检索

快速检索金融机构报告中的关键信息

在金融文档检索任务中表现出色

金融问答系统

构建基于语义匹配的金融问答系统

高准确率的语义匹配能力

通用文本处理

文档相似度计算

计算不同文档之间的语义相似度

文本聚类

对大量文本进行自动分类和聚类

🚀 financial-rag-matryoshka

该模型是基于 Alibaba-NLP/gte-large-en-v1.5 针对金融用例进行微调的模型。它可以将句子和段落映射到一个 1024 维的密集向量空间，可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等任务。该模型在金融文档检索任务中表现出色，同时也能保持较高的通用性能。

🚀 快速开始

直接使用（Sentence Transformers）

首先，安装 Sentence Transformers 库：

pip install -U sentence-transformers

然后，你可以加载该模型并进行推理：

from sentence_transformers import SentenceTransformer

# 从 Hugging Face Hub 下载
model = SentenceTransformer("rbhatia46/gte-large-en-v1.5-financial-rag-matryoshka")
# 运行推理
sentences = [
    'JP Morgan reported total deposits of $2.6 trillion in the year ending December 31, 2023.',
    "What were JP Morgan's total deposits in 2023?",
    'What is the primary source of revenue for the software company, Microsoft?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主要特性

多任务适用：可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等多种任务。
金融领域优化：针对金融用例进行了微调，在金融文档检索任务中表现出色。
高维向量映射：将句子和段落映射到 1024 维的密集向量空间。

📦 安装指南

若要使用该模型，你需要安装 Sentence Transformers 库：

pip install -U sentence-transformers

💻 使用示例

基础用法

from sentence_transformers import SentenceTransformer

# 从 Hugging Face Hub 下载
model = SentenceTransformer("rbhatia46/gte-large-en-v1.5-financial-rag-matryoshka")
# 运行推理
sentences = [
    'JP Morgan reported total deposits of $2.6 trillion in the year ending December 31, 2023.',
    "What were JP Morgan's total deposits in 2023?",
    'What is the primary source of revenue for the software company, Microsoft?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 详细文档

模型详情

模型描述

属性	详情
模型类型	Sentence Transformer
基础模型	Alibaba-NLP/gte-large-en-v1.5
最大序列长度	8192 个标记
输出维度	1024 个标记
相似度函数	余弦相似度
语言	英语
许可证	apache - 2.0

模型来源

文档：Sentence Transformers Documentation
仓库：Sentence Transformers on GitHub
Hugging Face：Sentence Transformers on Hugging Face

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

评估

信息检索指标

以下是不同数据集上的评估指标：

数据集	Cosine Accuracy@1	Cosine Accuracy@3	Cosine Accuracy@5	Cosine Accuracy@10	Cosine Precision@1	Cosine Precision@3	Cosine Precision@5	Cosine Precision@10	Cosine Recall@1	Cosine Recall@3	Cosine Recall@5	Cosine Recall@10	Cosine Ndcg@10	Cosine Mrr@10	Cosine Map@100
dim_1024	0.88	0.96	0.9867	0.9956	0.88	0.32	0.1973	0.0996	0.88	0.96	0.9867	0.9956	0.9427	0.9252	0.9254
dim_768	0.88	0.96	0.9867	0.9911	0.88	0.32	0.1973	0.0991	0.88	0.96	0.9867	0.9911	0.9408	0.924	0.9245
dim_512	0.8711	0.96	0.9867	0.9911	0.8711	0.32	0.1973	0.0991	0.8711	0.96	0.9867	0.9911	0.9381	0.9203	0.9207
dim_256	0.8756	0.96	0.9867	0.9911	0.8756	0.32	0.1973	0.0991	0.8756	0.96	0.9867	0.9911	0.9396	0.9223	0.9228
dim_128	0.8667	0.9556	0.9867	0.9911	0.8667	0.3185	0.1973	0.0991	0.8667	0.9556	0.9867	0.9911	0.9346	0.9157	0.916
dim_64	0.8311	0.96	0.9733	0.9911	0.8311	0.32	0.1947	0.0991	0.8311	0.96	0.9733	0.9911	0.9208	0.8972	0.8975

训练详情

训练数据集

未命名数据集

大小：4275 个训练样本
列：positive 和 anchor
近似统计信息（基于前 1000 个样本）：
positive anchor
类型字符串字符串
详情
最小：15 个标记
平均：44.74 个标记
最大：114 个标记
最小：9 个标记
平均：18.12 个标记
最大：32 个标记

	positive	anchor
类型	字符串	字符串
详情	最小：15 个标记平均：44.74 个标记最大：114 个标记	最小：9 个标记平均：18.12 个标记最大：32 个标记

样本：

positive	anchor
`At the end of fiscal year 2023, Exxon Mobil reported a debt - to - equity ratio of 0.32, implying that the company used more equity than debt in its capital structure.`	`What was the debt - to - equity ratio for Exxon Mobil at the end of fiscal year 2023?`
`Amazon Web Services (AWS) generated $12.7 billion in net sales in the fourth quarter of 2020, up 28% from the same period of the previous year. It accounted for about 10% of Amazonâ€™s total net sales for the quarter.`	`How did Amazon's AWS segment perform in the fourth quarter of 2020?`
`JPMorgan Chase generates revenues by providing a wide range of banking and financial services. These include investment banking (M&As, advisory), consumer and community banking (home mortgages, auto loans), commercial banking, and asset and wealth management.`	`What are the key revenue sources for JPMorgan Chase?`

损失函数：MatryoshkaLoss，参数如下：

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        1024,
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

训练超参数

非默认超参数

eval_strategy：epoch
per_device_train_batch_size：32
per_device_eval_batch_size：16
gradient_accumulation_steps：16
learning_rate：2e - 05
num_train_epochs：10
lr_scheduler_type：cosine
warmup_ratio：0.1
bf16：True
tf32：True
load_best_model_at_end：True
optim：adamw_torch_fused
batch_sampler：no_duplicates

训练日志

Epoch	Step	训练损失	dim_1024_cosine_map@100	dim_128_cosine_map@100	dim_256_cosine_map@100	dim_512_cosine_map@100	dim_64_cosine_map@100	dim_768_cosine_map@100
0.9552	8	-	0.9090	0.8848	0.8992	0.9052	0.8775	0.9030
1.1940	10	0.4749	-	-	-	-	-	-
1.9104	16	-	0.9170	0.9095	0.9109	0.9201	0.8961	0.9212
2.3881	20	0.0862	-	-	-	-	-	-
2.9851	25	-	0.9190	0.9071	0.9160	0.9278	0.8998	0.9234
3.5821	30	0.0315	-	-	-	-	-	-
3.9403	33	-	0.9183	0.9053	0.9122	0.9287	0.8998	0.9183
4.7761	40	0.0184	-	-	-	-	-	-
4.8955	41	-	0.9225	0.9125	0.9164	0.9260	0.8985	0.9220
5.9701	50	0.0135	0.9268	0.9132	0.9208	0.9257	0.8961	0.9271
6.9254	58	-	0.9254	0.9158	0.9202	0.9212	0.8938	0.9213
7.1642	60	0.0123	-	-	-	-	-	-
8.0	67	-	0.9253	0.916	0.9228	0.9207	0.8972	0.9243
8.3582	70	0.01	-	-	-	-	-	-
8.9552	75	-	0.9254	0.9160	0.9213	0.9207	0.9005	0.9245
9.5522	80	0.0088	0.9254	0.9160	0.9228	0.9207	0.8975	0.9245

注：加粗行表示保存的检查点。

框架版本

Python：3.10.6
Sentence Transformers：3.0.1
Transformers：4.41.2
PyTorch：2.1.2 + cu121
Accelerate：0.32.1
Datasets：2.19.1
Tokenizers：0.19.1

🔧 技术细节

该模型基于 Alibaba - NLP/gte - large - en - v1.5 进行微调，使用了 MatryoshkaLoss 损失函数，结合了 MultipleNegativesRankingLoss。在训练过程中，通过不同维度的向量空间进行学习，以提高模型在金融领域的性能。同时，使用了多种超参数进行优化，如学习率调度、批量大小等，以确保模型的收敛和泛化能力。

📄 许可证

该模型使用 apache - 2.0 许可证。

引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard - Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al - Rfou and Brian Strope and Yun - hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}