Finance2_embedding_small_en-V1.5開源模型 - 用於金融語義相似度和搜索任務

首頁

Finance2 Embedding Small En V1.5

由baconnier開發

這是一個基於BAAI/bge-small-en-v1.5在金融數據集上微調的句子嵌入模型，用於語義文本相似度、語義搜索等任務。

文本嵌入

Safetensors

#金融語義嵌入 #高精度相似度計算 #多重負樣本訓練

下載量 2,120

發布時間 : 6/9/2024

模型概述

該模型將句子和段落映射到384維的密集向量空間，特別適用於金融領域的文本處理任務，如語義相似度計算、文本分類和聚類分析。

模型特點

金融領域優化

在專業金融數據集上微調，對金融術語和概念有更好的理解

高效向量表示

將文本轉換為384維的密集向量，適合大規模語義搜索

多相似度度量支持

支持餘弦、點積、曼哈頓和歐幾里得等多種相似度計算方式

模型能力

語義文本相似度計算

金融文本特徵提取

語義搜索

文本分類

聚類分析

使用案例

金融信息檢索

金融問答系統

用於匹配用戶金融問題與知識庫中最相關的答案

高準確率的語義匹配

金融文檔處理

金融文檔聚類

對大量金融文檔進行自動分類和整理

提高文檔管理效率

🚀 基於BAAI/bge-small-en-v1.5的句子轉換器

本模型是基於 Sentence Transformers 框架，在 baconnier/finance_dataset_small_private 數據集上對 BAAI/bge-small-en-v1.5 進行微調得到的。它能夠將句子和段落映射到384維的密集向量空間，可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等任務。

✨ 主要特性

基於預訓練模型 BAAI/bge-small-en-v1.5 進行微調，在金融領域數據集上訓練，能更好地處理金融相關文本。
可將文本映射到384維的密集向量空間，便於進行語義相似度計算等任務。
支持多種相似度計算函數，如餘弦相似度。

📦 安裝指南

首先，安裝 Sentence Transformers 庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import SentenceTransformer

# 從 🤗 Hub 下載模型
model = SentenceTransformer("baconnier/Finance2_embedding_small_en-V1.5")
# 進行推理
sentences = [
    'What is industrial production, and how is it measured by the Federal Reserve Board?',
    'Industrial production is a statistic determined by the Federal Reserve Board that measures the total output of all US factories and mines on a monthly basis. The Fed collects data from various government agencies and trade associations to calculate the industrial production index, which serves as an important economic indicator, providing insight into the health of the manufacturing and mining sectors.\nIndustrial production is a monthly statistic calculated by the Federal Reserve Board, measuring the total output of US factories and mines using data from government agencies and trade associations, serving as a key economic indicator for the manufacturing and mining sectors.',
    'Industrial production is a statistic that measures the output of factories and mines in the US. It is released by the Federal Reserve Board every quarter.\nIndustrial production measures factory and mine output, released quarterly by the Fed.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 詳細文檔

模型詳情

模型描述

屬性	詳情
模型類型	句子轉換器
基礎模型	BAAI/bge-small-en-v1.5
最大序列長度	512個詞元
輸出維度	384個詞元
相似度函數	餘弦相似度
訓練數據集	baconnier/finance_dataset_small_private

模型來源

文檔：Sentence Transformers 文檔
倉庫：GitHub 上的 Sentence Transformers
Hugging Face：Hugging Face 上的 Sentence Transformers

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

評估

指標

三元組

數據集：Finance_Embedding_Metric
使用 TripletEvaluator 進行評估

指標	值
餘弦準確率	0.9791
點積準確率	0.0209
曼哈頓準確率	0.978
歐幾里得準確率	0.9791
最大準確率	0.9791

訓練詳情

訓練數據集

數據集：baconnier/finance_dataset_small_private，版本 d7e6492
大小：15,525 個訓練樣本
列：anchor、positive 和 negative
損失函數：MultipleNegativesRankingLoss，參數如下：

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

評估數據集

數據集：baconnier/finance_dataset_small_private，版本 d7e6492
大小：862 個評估樣本
列：anchor、positive 和 negative
損失函數：MultipleNegativesRankingLoss，參數如下：

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

訓練超參數

非默認超參數

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
num_train_epochs: 1
warmup_ratio: 0.1
bf16: True
batch_sampler: no_duplicates

訓練日誌

輪次	步數	訓練損失	損失	Finance_Embedding_Metric 最大準確率
0.0103	10	0.9918	-	-
0.0206	20	0.8866	-	-
...	...	...	...	...
1.0	971	-	-	0.9791

框架版本

Python: 3.10.12
Sentence Transformers: 3.0.1
Transformers: 4.41.2
PyTorch: 2.3.0+cu121
Accelerate: 0.31.0
Datasets: 2.19.2
Tokenizers: 0.19.1

📄 許可證

文檔中未提及相關許可證信息。

📖 引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}