stella_en_400M_v5-FinanceRAG-v2開源金融模型 - 實現金融文檔語義檢索與段落匹配

首頁

Stella En 400M V5 FinanceRAG V2

由thomaskim1130開發

基於stella_en_400M_v5架構優化的金融領域檢索增強生成模型，支持金融文檔的語義檢索和段落匹配

大型語言模型

Safetensors

其他#金融文本檢索 #負樣本排序優化 #專業領域RAG

下載量 555

發布時間 : 11/29/2024

模型概述

該模型專門針對金融文檔檢索任務優化，能夠理解複雜金融查詢並匹配相關文本段落。使用多重負樣本排序損失訓練，適用於問答系統和金融信息檢索場景。

模型特點

金融領域優化

針對財務報表、金融術語等專業內容進行專門訓練，提高金融文檔的理解能力

高效段落檢索

能夠從長篇金融文檔中精準定位與查詢相關的關鍵段落

多重負樣本訓練

使用多重負樣本排序損失(Multiple Negatives Ranking Loss)提高區分相似段落的能力

模型能力

金融文檔語義檢索

查詢-段落相似度計算

金融問答系統支持

長文本關鍵信息定位

使用案例

金融信息檢索

財務報表查詢

根據具體財務指標查詢相關報表段落

準確檢索包含特定財務數據的表格和說明

監管文件分析

在SEC文件或年報中定位特定政策描述

快速找到合規性相關的關鍵段落

投資研究

公司財務數據提取

檢索特定季度或年度的財務績效數據

精確匹配包含查詢指標的財務表格和上下文

🚀 基於thomaskim1130/stella_en_400M_v5-FinanceRAG的句子轉換器

這是一個基於thomaskim1130/stella_en_400M_v5-FinanceRAG微調的sentence-transformers模型。它可以將句子和段落映射到一個1024維的密集向量空間，可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等任務。

✨ 主要特性

語義理解：能夠深入理解句子和段落的語義信息，將其準確映射到1024維的向量空間中。
多任務支持：可廣泛應用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等多種自然語言處理任務。
微調優化：基於特定數據集進行微調，針對特定領域或任務進行了優化，提高了模型在相關任務上的性能。

📦 安裝指南

首先安裝Sentence Transformers庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import SentenceTransformer

# 從🤗 Hub下載模型
model = SentenceTransformer("sentence_transformers_model_id")
# 運行推理
sentences = [
    "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: Title: \nText: In the year with lowest amount of Deposits with banks Average volume, what's the increasing rate of Deposits with banks Average volume?",
    'Title: \nText: Additional Interest Rate Details Average Balances and Interest Ratesé\x88¥æ\x93\x9cssets(1)(2)(3)(4)\n|  | Average volume | Interest revenue | % Average rate |\n| In millions of dollars, except rates | 2015 | 2014 | 2013 | 2015 | 2014 | 2013 | 2015 | 2014 | 2013 |\n| Assets |  |  |  |  |  |  |  |  |  |\n| Deposits with banks-5 | $133,790 | $161,359 | $144,904 | $727 | $959 | $1,026 | 0.54% | 0.59% | 0.71% |\n| Federal funds sold and securities borrowed or purchased under agreements to resell-6 |  |  |  |  |  |  |  |  |  |\n| In U.S. offices | $150,359 | $153,688 | $158,237 | $1,211 | $1,034 | $1,133 | 0.81% | 0.67% | 0.72% |\n| In offices outside the U.S.-5 | 84,006 | 101,177 | 109,233 | 1,305 | 1,332 | 1,433 | 1.55 | 1.32 | 1.31 |\n| Total | $234,365 | $254,865 | $267,470 | $2,516 | $2,366 | $2,566 | 1.07% | 0.93% | 0.96% |\n| Trading account assets-7(8) |  |  |  |  |  |  |  |  |  |\n| In U.S. offices | $114,639 | $114,910 | $126,123 | $3,945 | $3,472 | $3,728 | 3.44% | 3.02% | 2.96% |\n| In offices outside the U.S.-5 | 103,348 | 119,801 | 127,291 | 2,141 | 2,538 | 2,683 | 2.07 | 2.12 | 2.11 |\n| Total | $217,987 | $234,711 | $253,414 | $6,086 | $6,010 | $6,411 | 2.79% | 2.56% | 2.53% |\n| Investments |  |  |  |  |  |  |  |  |  |\n| In U.S. offices |  |  |  |  |  |  |  |  |  |\n| Taxable | $214,714 | $188,910 | $174,084 | $3,812 | $3,286 | $2,713 | 1.78% | 1.74% | 1.56% |\n| Exempt from U.S. income tax | 20,034 | 20,386 | 18,075 | 443 | 626 | 811 | 2.21 | 3.07 | 4.49 |\n| In offices outside the U.S.-5 | 102,376 | 113,163 | 114,122 | 3,071 | 3,627 | 3,761 | 3.00 | 3.21 | 3.30 |\n| Total | $337,124 | $322,459 | $306,281 | $7,326 | $7,539 | $7,285 | 2.17% | 2.34% | 2.38% |\n| Loans (net of unearned income)(9) |  |  |  |  |  |  |  |  |  |\n| In U.S. offices | $354,439 | $361,769 | $354,707 | $24,558 | $26,076 | $25,941 | 6.93% | 7.21% | 7.31% |\n| In offices outside the U.S.-5 | 273,072 | 296,656 | 292,852 | 15,988 | 18,723 | 19,660 | 5.85 | 6.31 | 6.71 |\n| Total | $627,511 | $658,425 | $647,559 | $40,546 | $44,799 | $45,601 | 6.46% | 6.80% | 7.04% |\n| Other interest-earning assets-10 | $55,060 | $40,375 | $38,233 | $1,839 | $507 | $602 | 3.34% | 1.26% | 1.57% |\n| Total interest-earning assets | $1,605,837 | $1,672,194 | $1,657,861 | $59,040 | $62,180 | $63,491 | 3.68% | 3.72% | 3.83% |\n| Non-interest-earning assets-7 | $218,000 | $224,721 | $222,526 |  |  |  |  |  |  |\n| Total assets from discontinued operations | — | — | 2,909 |  |  |  |  |  |  |\n| Total assets | $1,823,837 | $1,896,915 | $1,883,296 |  |  |  |  |  |  |\nNet interest revenue includes the taxable equivalent adjustments related to the tax-exempt bond portfolio (based on the U. S.  federal statutory tax rate of 35%) of $487 million, $498 million and $521 million for 2015, 2014 and 2013, respectively.\nInterest rates and amounts include the effects of risk management activities associated with the respective asset categories.\nMonthly or quarterly averages have been used by certain subsidiaries where daily averages are unavailable.\nDetailed average volume, Interest revenue and Interest expense exclude Discontinued operations.\nSee Note 2 to the Consolidated Financial Statements.\nAverage rates reflect prevailing local interest rates, including inflationary effects and monetary corrections in certain countries.\nAverage volumes of securities borrowed or purchased under agreements to resell are reported net pursuant to ASC 210-20-45.\nHowever, Interest revenue excludes the impact of ASC 210-20-45.\nThe fair value carrying amounts of derivative contracts are reported net, pursuant to ASC 815-10-45, in Non-interest-earning assets and Other non-interest bearing liabilities.\nInterest expense on Trading account liabilities of ICG is reported as a reduction of Interest revenue.\nInterest revenue and Interest expense on cash collateral positions are reported in interest on Trading account assets and Trading account liabilities, respectively.\nIncludes cash-basis loans.\nIncludes brokerage receivables.\nDuring 2015, continued management actions, primarily the sale or transfer to held-for-sale of approximately $1.5 billion of delinquent residential first mortgages, including $0.9 billion in the fourth quarter largely associated with the transfer of CitiFinancial loans to held-for-sale referenced above, were the primary driver of the overall improvement in delinquencies within Citi Holdings\x80\x99 residential first mortgage portfolio.\nCredit performance from quarter to quarter could continue to be impacted by the amount of delinquent loan sales or transfers to held-for-sale, as well as overall trends in HPI and interest rates.\nNorth America Residential First Mortgages\x80\x94State Delinquency Trends The following tables set forth the six U. S.  states and/or regions with the highest concentration of Citi\x80\x99s residential first mortgages.\n| In billions of dollars | December 31, 2015 | December 31, 2014 |\n| State-1 | ENR-2 | ENRDistribution | 90+DPD% | %LTV >100%-3 | RefreshedFICO | ENR-2 | ENRDistribution | 90+DPD% | %LTV >100%-3 | RefreshedFICO |\n| CA | $19.2 | 37% | 0.2% | 1% | 754 | $18.9 | 31% | 0.6% | 2% | 745 |\n| NY/NJ/CT-4 | 12.7 | 25 | 0.8 | 1 | 751 | 12.2 | 20 | 1.9 | 2 | 740 |\n| VA/MD | 2.2 | 4 | 1.2 | 2 | 719 | 3.0 | 5 | 3.0 | 8 | 695 |\n| IL-4 | 2.2 | 4 | 1.0 | 3 | 735 | 2.5 | 4 | 2.5 | 9 | 713 |\n| FL-4 | 2.2 | 4 | 1.1 | 4 | 723 | 2.8 | 5 | 3.0 | 14 | 700 |\n| TX | 1.9 | 4 | 1.0 | — | 711 | 2.5 | 4 | 2.7 | — | 680 |\n| Other | 11.0 | 21 | 1.3 | 2 | 710 | 18.2 | 30 | 3.3 | 7 | 677 |\n| Total-5 | $51.5 | 100% | 0.7% | 1% | 738 | $60.1 | 100% | 2.1% | 4% | 715 |\nNote: Totals may not sum due to rounding.\n(1) Certain of the states are included as part of a region based on Citi\x80\x99s view of similar HPI within the region.\n(2) Ending net receivables.\nExcludes loans in Canada and Puerto Rico, loans guaranteed by U. S.  government agencies, loans recorded at fair value and loans subject to long term standby commitments (LTSCs).\nExcludes balances for which FICO or LTV data are unavailable.\n(3) LTV ratios (loan balance divided by appraised value) are calculated at origination and updated by applying market price data.\n(4) New York, New Jersey, Connecticut, Florida and Illinois are judicial states.\n(5) Improvement in state trends during 2015 was primarily due to the sale or transfer to held-for-sale of residential first mortgages, including the transfer of CitiFinancial residential first mortgages to held-for-sale in the fourth quarter of 2015.\nForeclosures A substantial majority of Citi\x80\x99s foreclosure inventory consists of residential first mortgages.\nAt December 31, 2015, Citi\x80\x99s foreclosure inventory included approximately $0.1 billion, or 0.2%, of the total residential first mortgage portfolio, compared to $0.6 billion, or 0.9%, at December 31, 2014, based on the dollar amount of ending net receivables of loans in foreclosure inventory, excluding loans that are guaranteed by U. S.  government agencies and loans subject to LTSCs.\nNorth America Consumer Mortgage Quarterly Credit Trends \x80\x94Net Credit Losses and Delinquencies\x80\x94Home Equity Loans Citi\x80\x99s home equity loan portfolio consists of both fixed-rate home equity loans and loans extended under home equity lines of credit.\nFixed-rate home equity loans are fully amortizing.\nHome equity lines of credit allow for amounts to be drawn for a period of time with the payment of interest only and then, at the end of the draw period, the then-outstanding amount is converted to an amortizing loan (the interest-only payment feature during the revolving period is standard for this product across the industry).\nAfter conversion, the home equity loans typically have a 20-year amortization period.\nAs of December 31, 2015, Citi\x80\x99s home equity loan portfolio of $22.8 billion consisted of $6.3 billion of fixed-rate home equity loans and $16.5 billion of loans extended under home equity lines of credit (Revolving HELOCs).',
    'Title: \nText: Issuer Purchases of Equity Securities Repurchases of common stock are made to support the Company\x80\x99s stock-based employee compensation plans and for other corporate purposes.\nOn February 13, 2006, the Board of Directors authorized the purchase of $2.0 billion of the Company\x80\x99s common stock between February 13, 2006 and February 28, 2007.\nIn August 2006, 3M\x80\x99s Board of Directors authorized the repurchase of an additional $1.0 billion in share repurchases, raising the total authorization to $3.0 billion for the period from February 13, 2006 to February 28, 2007.\nIn February 2007, 3M\x80\x99s Board of Directors authorized a twoyear share repurchase of up to $7.0 billion for the period from February 12, 2007 to February 28, 2009.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 詳細文檔

模型詳情

模型描述

屬性	詳情
模型類型	句子轉換器
基礎模型	thomaskim1130/stella_en_400M_v5-FinanceRAG
最大序列長度	512個標記
輸出維度	1024個標記
相似度函數	餘弦相似度

模型來源

文檔：Sentence Transformers文檔
倉庫：GitHub上的Sentence Transformers
Hugging Face：Hugging Face上的Sentence Transformers

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

評估

指標

信息檢索

數據集：Evaluate
評估方法：使用InformationRetrievalEvaluator進行評估

指標	值
cosine_accuracy@1	0.4636
cosine_accuracy@3	0.682
cosine_accuracy@5	0.7597
cosine_accuracy@10	0.8519
cosine_precision@1	0.4636
cosine_precision@3	0.2565
cosine_precision@5	0.1777
cosine_precision@10	0.1024
cosine_recall@1	0.4095
cosine_recall@3	0.6424
cosine_recall@5	0.7299
cosine_recall@10	0.8398
cosine_ndcg@10	0.6409
cosine_mrr@10	0.5902
cosine_map@100	0.5753
dot_accuracy@1	0.4393
dot_accuracy@3	0.6748
dot_accuracy@5	0.7354
dot_accuracy@10	0.8422
dot_precision@1	0.4393
dot_precision@3	0.25
dot_precision@5	0.1709
dot_precision@10	0.0998
dot_recall@1	0.3828
dot_recall@3	0.6338
dot_recall@5	0.7005
dot_recall@10	0.8224
dot_ndcg@10	0.6195
dot_mrr@10	0.5712
dot_map@100	0.5528

訓練詳情

訓練數據集

未命名數據集

規模：2256個訓練樣本
列信息：包含 sentence_0 和 sentence_1 兩列
近似統計信息（基於前1000個樣本）：
sentence_0 sentence_1
類型字符串字符串
詳情
最小：28個標記
平均：45.02個標記
最大：114個標記
最小：23個標記
平均：406.36個標記
最大：512個標記

	sentence_0	sentence_1
類型	字符串	字符串
詳情	最小：28個標記平均：45.02個標記最大：114個標記	最小：23個標記平均：406.36個標記最大：512個標記

樣本示例：

sentence_0	sentence_1
`Instruct: Given a web search query, retrieve relevant passages that answer the query. Query: Title: Text: What do all Notional sum up, excluding those negative ones in 2008 for As of December 31, 2008 for Financial assets with interest rate risk? (in million)`	Title: Text: Cash Flows Our estimated future benefit payments for funded and unfunded plans are as follows (in millions): 1 The expected benefit payments for our other postretirement benefit plans are net of estimated federal subsidies expected to be received under the Medicare Prescription Drug, Improvement and Modernization Act of 2003. Federal subsidies are estimated to be $3 million for the period 2019-2023 and $2 million for the period 2024-2028. The Company anticipates making pension contributions in 2019 of $32 million, all of which will be allocated to our international plans. The majority of these contributions are required by funding regulations or law.
`Instruct: Given a web search query, retrieve relevant passages that answer the query. Query: Title: Text: what's the total amount of No surrender charge of 2010 Individual Fixed Annuities, Change in cash of 2008, and Total reserves of 2010 Individual Variable Annuities ?`	Title: Text: 2010 and 2009 Comparison Surrender rates have improved compared to the prior year for group retirement products, individual fixed annuities and individual variable annuities as surrenders have returned to more normal levels. Surrender rates for individual fixed annuities have decreased significantly in 2010 due to the low interest rate environment and the relative competitiveness of interest credited rates on the existing block of fixed annuities versus interest rates on alternative investment options available in the marketplace. Surrender rates for group retirement products are expected to increase in 2011 as certain large group surrenders are anticipated.2009 and 2008 Comparison Surrenders and other withdrawals increased in 2009 for group retirement products primarily due to higher large group surrenders. However, surrender rates and withdrawals have improved for individual fixed annuities and individual variable annuities. The following table presents reserves by surrender charge category and surrender rates:
`Instruct: Given a web search query, retrieve relevant passages that answer the query. Query: Title: Text: What was the total amount of elements for RevPAR excluding those elements greater than 150 in 2016 ?`	`Title: Text: 2016 Compared to 2015 Comparable?Company-Operated North American Properties`

損失函數：使用MultipleNegativesRankingLoss，參數如下：

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

訓練超參數

非默認超參數

eval_strategy：按步驟評估
per_device_train_batch_size：16
per_device_eval_batch_size：16
num_train_epochs：2
fp16：True
batch_sampler：無重複採樣
multi_dataset_batch_sampler：循環採樣

所有超參數

點擊展開

overwrite_output_dir：False
do_predict：False
eval_strategy：steps
prediction_loss_only：True
per_device_train_batch_size：16
per_device_eval_batch_size：16
per_gpu_train_batch_size：None
per_gpu_eval_batch_size：None
gradient_accumulation_steps：1
eval_accumulation_steps：None
torch_empty_cache_steps：None
learning_rate：5e-05
weight_decay：0.0
adam_beta1：0.9
adam_beta2：0.999
adam_epsilon：1e-08
max_grad_norm：1
num_train_epochs：2
max_steps：-1
lr_scheduler_type：linear
lr_scheduler_kwargs：{}
warmup_ratio：0.0
warmup_steps：0
log_level：passive
log_level_replica：warning
log_on_each_node：True
logging_nan_inf_filter：True
save_safetensors：True
save_on_each_node：False
save_only_model：False
restore_callback_states_from_checkpoint：False
no_cuda：False
use_cpu：False
use_mps_device：False
seed：42
data_seed：None
jit_mode_eval：False
use_ipex：False
bf16：False
fp16：True
fp16_opt_level：O1
half_precision_backend：auto
bf16_full_eval：False
fp16_full_eval：False
tf32：None
local_rank：0
ddp_backend：None
tpu_num_cores：None
tpu_metrics_debug：False
debug：[]
dataloader_drop_last：False
dataloader_num_workers：0
dataloader_prefetch_factor：None
past_index：-1
disable_tqdm：False
remove_unused_columns：True
label_names：None
load_best_model_at_end：False
ignore_data_skip：False
fsdp：[]
fsdp_min_num_params：0
fsdp_config：{'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap：None
accelerator_config：{'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed：None
label_smoothing_factor：0.0
optim：adamw_torch
optim_args：None
adafactor：False
group_by_length：False
length_column_name：length
ddp_find_unused_parameters：None
ddp_bucket_cap_mb：None
ddp_broadcast_buffers：False
dataloader_pin_memory：True
dataloader_persistent_workers：False
skip_memory_metrics：True
use_legacy_prediction_loop：False
push_to_hub：False
resume_from_checkpoint：None
hub_model_id：None
hub_strategy：every_save
hub_private_repo：False
hub_always_push：False
gradient_checkpointing：False
gradient_checkpointing_kwargs：None
include_inputs_for_metrics：False
eval_do_concat_batches：True
fp16_backend：auto
push_to_hub_model_id：None
push_to_hub_organization：None
mp_parameters：
auto_find_batch_size：False
full_determinism：False
torchdynamo：None
ray_scope：last
ddp_timeout：1800
torch_compile：False
torch_compile_backend：None
torch_compile_mode：None
dispatch_batches：None
split_batches：None
include_tokens_per_second：False
include_num_input_tokens_seen：False
neftune_noise_alpha：None
optim_target_modules：None
batch_eval_metrics：False
eval_on_start：False
use_liger_kernel：False
eval_use_gather_object：False
batch_sampler：no_duplicates
multi_dataset_batch_sampler：round_robin

訓練日誌

輪次	步驟	Evaluate_cosine_map@100
0	0	0.4564
1.0	141	0.5233
2.0	282	0.5753

框架版本

Python：3.10.12
Sentence Transformers：3.1.1
Transformers：4.45.2
PyTorch：2.5.1+cu121
Accelerate：1.1.1
Datasets：3.1.0
Tokenizers：0.20.3

📄 許可證

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}