stella_en_400M_v5-FinanceRAG-v2开源金融模型 - 实现金融文档语义检索与段落匹配

首页

Stella En 400M V5 FinanceRAG V2

由 thomaskim1130 开发

基于stella_en_400M_v5架构优化的金融领域检索增强生成模型，支持金融文档的语义检索和段落匹配

大型语言模型

Safetensors

其他#金融文本检索 #负样本排序优化 #专业领域RAG

下载量 555

发布时间 : 11/29/2024

模型简介

该模型专门针对金融文档检索任务优化，能够理解复杂金融查询并匹配相关文本段落。使用多重负样本排序损失训练，适用于问答系统和金融信息检索场景。

模型特点

金融领域优化

针对财务报表、金融术语等专业内容进行专门训练，提高金融文档的理解能力

高效段落检索

能够从长篇金融文档中精准定位与查询相关的关键段落

多重负样本训练

使用多重负样本排序损失(Multiple Negatives Ranking Loss)提高区分相似段落的能力

模型能力

金融文档语义检索

查询-段落相似度计算

金融问答系统支持

长文本关键信息定位

使用案例

金融信息检索

财务报表查询

根据具体财务指标查询相关报表段落

准确检索包含特定财务数据的表格和说明

监管文件分析

在SEC文件或年报中定位特定政策描述

快速找到合规性相关的关键段落

投资研究

公司财务数据提取

检索特定季度或年度的财务绩效数据

精确匹配包含查询指标的财务表格和上下文

🚀 基于thomaskim1130/stella_en_400M_v5-FinanceRAG的句子转换器

这是一个基于thomaskim1130/stella_en_400M_v5-FinanceRAG微调的sentence-transformers模型。它可以将句子和段落映射到一个1024维的密集向量空间，可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等任务。

✨ 主要特性

语义理解：能够深入理解句子和段落的语义信息，将其准确映射到1024维的向量空间中。
多任务支持：可广泛应用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等多种自然语言处理任务。
微调优化：基于特定数据集进行微调，针对特定领域或任务进行了优化，提高了模型在相关任务上的性能。

📦 安装指南

首先安装Sentence Transformers库：

pip install -U sentence-transformers

💻 使用示例

基础用法

from sentence_transformers import SentenceTransformer

# 从🤗 Hub下载模型
model = SentenceTransformer("sentence_transformers_model_id")
# 运行推理
sentences = [
    "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: Title: \nText: In the year with lowest amount of Deposits with banks Average volume, what's the increasing rate of Deposits with banks Average volume?",
    'Title: \nText: Additional Interest Rate Details Average Balances and Interest Ratesé\x88¥æ\x93\x9cssets(1)(2)(3)(4)\n|  | Average volume | Interest revenue | % Average rate |\n| In millions of dollars, except rates | 2015 | 2014 | 2013 | 2015 | 2014 | 2013 | 2015 | 2014 | 2013 |\n| Assets |  |  |  |  |  |  |  |  |  |\n| Deposits with banks-5 | $133,790 | $161,359 | $144,904 | $727 | $959 | $1,026 | 0.54% | 0.59% | 0.71% |\n| Federal funds sold and securities borrowed or purchased under agreements to resell-6 |  |  |  |  |  |  |  |  |  |\n| In U.S. offices | $150,359 | $153,688 | $158,237 | $1,211 | $1,034 | $1,133 | 0.81% | 0.67% | 0.72% |\n| In offices outside the U.S.-5 | 84,006 | 101,177 | 109,233 | 1,305 | 1,332 | 1,433 | 1.55 | 1.32 | 1.31 |\n| Total | $234,365 | $254,865 | $267,470 | $2,516 | $2,366 | $2,566 | 1.07% | 0.93% | 0.96% |\n| Trading account assets-7(8) |  |  |  |  |  |  |  |  |  |\n| In U.S. offices | $114,639 | $114,910 | $126,123 | $3,945 | $3,472 | $3,728 | 3.44% | 3.02% | 2.96% |\n| In offices outside the U.S.-5 | 103,348 | 119,801 | 127,291 | 2,141 | 2,538 | 2,683 | 2.07 | 2.12 | 2.11 |\n| Total | $217,987 | $234,711 | $253,414 | $6,086 | $6,010 | $6,411 | 2.79% | 2.56% | 2.53% |\n| Investments |  |  |  |  |  |  |  |  |  |\n| In U.S. offices |  |  |  |  |  |  |  |  |  |\n| Taxable | $214,714 | $188,910 | $174,084 | $3,812 | $3,286 | $2,713 | 1.78% | 1.74% | 1.56% |\n| Exempt from U.S. income tax | 20,034 | 20,386 | 18,075 | 443 | 626 | 811 | 2.21 | 3.07 | 4.49 |\n| In offices outside the U.S.-5 | 102,376 | 113,163 | 114,122 | 3,071 | 3,627 | 3,761 | 3.00 | 3.21 | 3.30 |\n| Total | $337,124 | $322,459 | $306,281 | $7,326 | $7,539 | $7,285 | 2.17% | 2.34% | 2.38% |\n| Loans (net of unearned income)(9) |  |  |  |  |  |  |  |  |  |\n| In U.S. offices | $354,439 | $361,769 | $354,707 | $24,558 | $26,076 | $25,941 | 6.93% | 7.21% | 7.31% |\n| In offices outside the U.S.-5 | 273,072 | 296,656 | 292,852 | 15,988 | 18,723 | 19,660 | 5.85 | 6.31 | 6.71 |\n| Total | $627,511 | $658,425 | $647,559 | $40,546 | $44,799 | $45,601 | 6.46% | 6.80% | 7.04% |\n| Other interest-earning assets-10 | $55,060 | $40,375 | $38,233 | $1,839 | $507 | $602 | 3.34% | 1.26% | 1.57% |\n| Total interest-earning assets | $1,605,837 | $1,672,194 | $1,657,861 | $59,040 | $62,180 | $63,491 | 3.68% | 3.72% | 3.83% |\n| Non-interest-earning assets-7 | $218,000 | $224,721 | $222,526 |  |  |  |  |  |  |\n| Total assets from discontinued operations | — | — | 2,909 |  |  |  |  |  |  |\n| Total assets | $1,823,837 | $1,896,915 | $1,883,296 |  |  |  |  |  |  |\nNet interest revenue includes the taxable equivalent adjustments related to the tax-exempt bond portfolio (based on the U. S.  federal statutory tax rate of 35%) of $487 million, $498 million and $521 million for 2015, 2014 and 2013, respectively.\nInterest rates and amounts include the effects of risk management activities associated with the respective asset categories.\nMonthly or quarterly averages have been used by certain subsidiaries where daily averages are unavailable.\nDetailed average volume, Interest revenue and Interest expense exclude Discontinued operations.\nSee Note 2 to the Consolidated Financial Statements.\nAverage rates reflect prevailing local interest rates, including inflationary effects and monetary corrections in certain countries.\nAverage volumes of securities borrowed or purchased under agreements to resell are reported net pursuant to ASC 210-20-45.\nHowever, Interest revenue excludes the impact of ASC 210-20-45.\nThe fair value carrying amounts of derivative contracts are reported net, pursuant to ASC 815-10-45, in Non-interest-earning assets and Other non-interest bearing liabilities.\nInterest expense on Trading account liabilities of ICG is reported as a reduction of Interest revenue.\nInterest revenue and Interest expense on cash collateral positions are reported in interest on Trading account assets and Trading account liabilities, respectively.\nIncludes cash-basis loans.\nIncludes brokerage receivables.\nDuring 2015, continued management actions, primarily the sale or transfer to held-for-sale of approximately $1.5 billion of delinquent residential first mortgages, including $0.9 billion in the fourth quarter largely associated with the transfer of CitiFinancial loans to held-for-sale referenced above, were the primary driver of the overall improvement in delinquencies within Citi Holdings\x80\x99 residential first mortgage portfolio.\nCredit performance from quarter to quarter could continue to be impacted by the amount of delinquent loan sales or transfers to held-for-sale, as well as overall trends in HPI and interest rates.\nNorth America Residential First Mortgages\x80\x94State Delinquency Trends The following tables set forth the six U. S.  states and/or regions with the highest concentration of Citi\x80\x99s residential first mortgages.\n| In billions of dollars | December 31, 2015 | December 31, 2014 |\n| State-1 | ENR-2 | ENRDistribution | 90+DPD% | %LTV >100%-3 | RefreshedFICO | ENR-2 | ENRDistribution | 90+DPD% | %LTV >100%-3 | RefreshedFICO |\n| CA | $19.2 | 37% | 0.2% | 1% | 754 | $18.9 | 31% | 0.6% | 2% | 745 |\n| NY/NJ/CT-4 | 12.7 | 25 | 0.8 | 1 | 751 | 12.2 | 20 | 1.9 | 2 | 740 |\n| VA/MD | 2.2 | 4 | 1.2 | 2 | 719 | 3.0 | 5 | 3.0 | 8 | 695 |\n| IL-4 | 2.2 | 4 | 1.0 | 3 | 735 | 2.5 | 4 | 2.5 | 9 | 713 |\n| FL-4 | 2.2 | 4 | 1.1 | 4 | 723 | 2.8 | 5 | 3.0 | 14 | 700 |\n| TX | 1.9 | 4 | 1.0 | — | 711 | 2.5 | 4 | 2.7 | — | 680 |\n| Other | 11.0 | 21 | 1.3 | 2 | 710 | 18.2 | 30 | 3.3 | 7 | 677 |\n| Total-5 | $51.5 | 100% | 0.7% | 1% | 738 | $60.1 | 100% | 2.1% | 4% | 715 |\nNote: Totals may not sum due to rounding.\n(1) Certain of the states are included as part of a region based on Citi\x80\x99s view of similar HPI within the region.\n(2) Ending net receivables.\nExcludes loans in Canada and Puerto Rico, loans guaranteed by U. S.  government agencies, loans recorded at fair value and loans subject to long term standby commitments (LTSCs).\nExcludes balances for which FICO or LTV data are unavailable.\n(3) LTV ratios (loan balance divided by appraised value) are calculated at origination and updated by applying market price data.\n(4) New York, New Jersey, Connecticut, Florida and Illinois are judicial states.\n(5) Improvement in state trends during 2015 was primarily due to the sale or transfer to held-for-sale of residential first mortgages, including the transfer of CitiFinancial residential first mortgages to held-for-sale in the fourth quarter of 2015.\nForeclosures A substantial majority of Citi\x80\x99s foreclosure inventory consists of residential first mortgages.\nAt December 31, 2015, Citi\x80\x99s foreclosure inventory included approximately $0.1 billion, or 0.2%, of the total residential first mortgage portfolio, compared to $0.6 billion, or 0.9%, at December 31, 2014, based on the dollar amount of ending net receivables of loans in foreclosure inventory, excluding loans that are guaranteed by U. S.  government agencies and loans subject to LTSCs.\nNorth America Consumer Mortgage Quarterly Credit Trends \x80\x94Net Credit Losses and Delinquencies\x80\x94Home Equity Loans Citi\x80\x99s home equity loan portfolio consists of both fixed-rate home equity loans and loans extended under home equity lines of credit.\nFixed-rate home equity loans are fully amortizing.\nHome equity lines of credit allow for amounts to be drawn for a period of time with the payment of interest only and then, at the end of the draw period, the then-outstanding amount is converted to an amortizing loan (the interest-only payment feature during the revolving period is standard for this product across the industry).\nAfter conversion, the home equity loans typically have a 20-year amortization period.\nAs of December 31, 2015, Citi\x80\x99s home equity loan portfolio of $22.8 billion consisted of $6.3 billion of fixed-rate home equity loans and $16.5 billion of loans extended under home equity lines of credit (Revolving HELOCs).',
    'Title: \nText: Issuer Purchases of Equity Securities Repurchases of common stock are made to support the Company\x80\x99s stock-based employee compensation plans and for other corporate purposes.\nOn February 13, 2006, the Board of Directors authorized the purchase of $2.0 billion of the Company\x80\x99s common stock between February 13, 2006 and February 28, 2007.\nIn August 2006, 3M\x80\x99s Board of Directors authorized the repurchase of an additional $1.0 billion in share repurchases, raising the total authorization to $3.0 billion for the period from February 13, 2006 to February 28, 2007.\nIn February 2007, 3M\x80\x99s Board of Directors authorized a twoyear share repurchase of up to $7.0 billion for the period from February 12, 2007 to February 28, 2009.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 详细文档

模型详情

模型描述

属性	详情
模型类型	句子转换器
基础模型	thomaskim1130/stella_en_400M_v5-FinanceRAG
最大序列长度	512个标记
输出维度	1024个标记
相似度函数	余弦相似度

模型来源

文档：Sentence Transformers文档
仓库：GitHub上的Sentence Transformers
Hugging Face：Hugging Face上的Sentence Transformers

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

评估

指标

信息检索

数据集：Evaluate
评估方法：使用InformationRetrievalEvaluator进行评估

指标	值
cosine_accuracy@1	0.4636
cosine_accuracy@3	0.682
cosine_accuracy@5	0.7597
cosine_accuracy@10	0.8519
cosine_precision@1	0.4636
cosine_precision@3	0.2565
cosine_precision@5	0.1777
cosine_precision@10	0.1024
cosine_recall@1	0.4095
cosine_recall@3	0.6424
cosine_recall@5	0.7299
cosine_recall@10	0.8398
cosine_ndcg@10	0.6409
cosine_mrr@10	0.5902
cosine_map@100	0.5753
dot_accuracy@1	0.4393
dot_accuracy@3	0.6748
dot_accuracy@5	0.7354
dot_accuracy@10	0.8422
dot_precision@1	0.4393
dot_precision@3	0.25
dot_precision@5	0.1709
dot_precision@10	0.0998
dot_recall@1	0.3828
dot_recall@3	0.6338
dot_recall@5	0.7005
dot_recall@10	0.8224
dot_ndcg@10	0.6195
dot_mrr@10	0.5712
dot_map@100	0.5528

训练详情

训练数据集

未命名数据集

规模：2256个训练样本
列信息：包含 sentence_0 和 sentence_1 两列
近似统计信息（基于前1000个样本）：
sentence_0 sentence_1
类型字符串字符串
详情
最小：28个标记
平均：45.02个标记
最大：114个标记
最小：23个标记
平均：406.36个标记
最大：512个标记

	sentence_0	sentence_1
类型	字符串	字符串
详情	最小：28个标记平均：45.02个标记最大：114个标记	最小：23个标记平均：406.36个标记最大：512个标记

样本示例：

sentence_0	sentence_1
`Instruct: Given a web search query, retrieve relevant passages that answer the query. Query: Title: Text: What do all Notional sum up, excluding those negative ones in 2008 for As of December 31, 2008 for Financial assets with interest rate risk? (in million)`	Title: Text: Cash Flows Our estimated future benefit payments for funded and unfunded plans are as follows (in millions): 1 The expected benefit payments for our other postretirement benefit plans are net of estimated federal subsidies expected to be received under the Medicare Prescription Drug, Improvement and Modernization Act of 2003. Federal subsidies are estimated to be $3 million for the period 2019-2023 and $2 million for the period 2024-2028. The Company anticipates making pension contributions in 2019 of $32 million, all of which will be allocated to our international plans. The majority of these contributions are required by funding regulations or law.
`Instruct: Given a web search query, retrieve relevant passages that answer the query. Query: Title: Text: what's the total amount of No surrender charge of 2010 Individual Fixed Annuities, Change in cash of 2008, and Total reserves of 2010 Individual Variable Annuities ?`	Title: Text: 2010 and 2009 Comparison Surrender rates have improved compared to the prior year for group retirement products, individual fixed annuities and individual variable annuities as surrenders have returned to more normal levels. Surrender rates for individual fixed annuities have decreased significantly in 2010 due to the low interest rate environment and the relative competitiveness of interest credited rates on the existing block of fixed annuities versus interest rates on alternative investment options available in the marketplace. Surrender rates for group retirement products are expected to increase in 2011 as certain large group surrenders are anticipated.2009 and 2008 Comparison Surrenders and other withdrawals increased in 2009 for group retirement products primarily due to higher large group surrenders. However, surrender rates and withdrawals have improved for individual fixed annuities and individual variable annuities. The following table presents reserves by surrender charge category and surrender rates:
`Instruct: Given a web search query, retrieve relevant passages that answer the query. Query: Title: Text: What was the total amount of elements for RevPAR excluding those elements greater than 150 in 2016 ?`	`Title: Text: 2016 Compared to 2015 Comparable?Company-Operated North American Properties`

损失函数：使用MultipleNegativesRankingLoss，参数如下：

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

训练超参数

非默认超参数

eval_strategy：按步骤评估
per_device_train_batch_size：16
per_device_eval_batch_size：16
num_train_epochs：2
fp16：True
batch_sampler：无重复采样
multi_dataset_batch_sampler：循环采样

所有超参数

点击展开

overwrite_output_dir：False
do_predict：False
eval_strategy：steps
prediction_loss_only：True
per_device_train_batch_size：16
per_device_eval_batch_size：16
per_gpu_train_batch_size：None
per_gpu_eval_batch_size：None
gradient_accumulation_steps：1
eval_accumulation_steps：None
torch_empty_cache_steps：None
learning_rate：5e-05
weight_decay：0.0
adam_beta1：0.9
adam_beta2：0.999
adam_epsilon：1e-08
max_grad_norm：1
num_train_epochs：2
max_steps：-1
lr_scheduler_type：linear
lr_scheduler_kwargs：{}
warmup_ratio：0.0
warmup_steps：0
log_level：passive
log_level_replica：warning
log_on_each_node：True
logging_nan_inf_filter：True
save_safetensors：True
save_on_each_node：False
save_only_model：False
restore_callback_states_from_checkpoint：False
no_cuda：False
use_cpu：False
use_mps_device：False
seed：42
data_seed：None
jit_mode_eval：False
use_ipex：False
bf16：False
fp16：True
fp16_opt_level：O1
half_precision_backend：auto
bf16_full_eval：False
fp16_full_eval：False
tf32：None
local_rank：0
ddp_backend：None
tpu_num_cores：None
tpu_metrics_debug：False
debug：[]
dataloader_drop_last：False
dataloader_num_workers：0
dataloader_prefetch_factor：None
past_index：-1
disable_tqdm：False
remove_unused_columns：True
label_names：None
load_best_model_at_end：False
ignore_data_skip：False
fsdp：[]
fsdp_min_num_params：0
fsdp_config：{'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap：None
accelerator_config：{'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed：None
label_smoothing_factor：0.0
optim：adamw_torch
optim_args：None
adafactor：False
group_by_length：False
length_column_name：length
ddp_find_unused_parameters：None
ddp_bucket_cap_mb：None
ddp_broadcast_buffers：False
dataloader_pin_memory：True
dataloader_persistent_workers：False
skip_memory_metrics：True
use_legacy_prediction_loop：False
push_to_hub：False
resume_from_checkpoint：None
hub_model_id：None
hub_strategy：every_save
hub_private_repo：False
hub_always_push：False
gradient_checkpointing：False
gradient_checkpointing_kwargs：None
include_inputs_for_metrics：False
eval_do_concat_batches：True
fp16_backend：auto
push_to_hub_model_id：None
push_to_hub_organization：None
mp_parameters：
auto_find_batch_size：False
full_determinism：False
torchdynamo：None
ray_scope：last
ddp_timeout：1800
torch_compile：False
torch_compile_backend：None
torch_compile_mode：None
dispatch_batches：None
split_batches：None
include_tokens_per_second：False
include_num_input_tokens_seen：False
neftune_noise_alpha：None
optim_target_modules：None
batch_eval_metrics：False
eval_on_start：False
use_liger_kernel：False
eval_use_gather_object：False
batch_sampler：no_duplicates
multi_dataset_batch_sampler：round_robin

训练日志

轮次	步骤	Evaluate_cosine_map@100
0	0	0.4564
1.0	141	0.5233
2.0	282	0.5753

框架版本

Python：3.10.12
Sentence Transformers：3.1.1
Transformers：4.45.2
PyTorch：2.5.1+cu121
Accelerate：1.1.1
Datasets：3.1.0
Tokenizers：0.20.3

📄 许可证

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}