Finance2_embedding_small_en-V1.5开源模型 - 用于金融语义相似度和搜索任务

首页

Finance2 Embedding Small En V1.5

由 baconnier 开发

这是一个基于BAAI/bge-small-en-v1.5在金融数据集上微调的句子嵌入模型，用于语义文本相似度、语义搜索等任务。

文本嵌入

Safetensors

#金融语义嵌入 #高精度相似度计算 #多重负样本训练

下载量 2,120

发布时间 : 6/9/2024

模型简介

该模型将句子和段落映射到384维的密集向量空间，特别适用于金融领域的文本处理任务，如语义相似度计算、文本分类和聚类分析。

模型特点

金融领域优化

在专业金融数据集上微调，对金融术语和概念有更好的理解

高效向量表示

将文本转换为384维的密集向量，适合大规模语义搜索

多相似度度量支持

支持余弦、点积、曼哈顿和欧几里得等多种相似度计算方式

模型能力

语义文本相似度计算

金融文本特征提取

语义搜索

文本分类

聚类分析

使用案例

金融信息检索

金融问答系统

用于匹配用户金融问题与知识库中最相关的答案

高准确率的语义匹配

金融文档处理

金融文档聚类

对大量金融文档进行自动分类和整理

提高文档管理效率

🚀 基于BAAI/bge-small-en-v1.5的句子转换器

本模型是基于 Sentence Transformers 框架，在 baconnier/finance_dataset_small_private 数据集上对 BAAI/bge-small-en-v1.5 进行微调得到的。它能够将句子和段落映射到384维的密集向量空间，可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等任务。

✨ 主要特性

基于预训练模型 BAAI/bge-small-en-v1.5 进行微调，在金融领域数据集上训练，能更好地处理金融相关文本。
可将文本映射到384维的密集向量空间，便于进行语义相似度计算等任务。
支持多种相似度计算函数，如余弦相似度。

📦 安装指南

首先，安装 Sentence Transformers 库：

pip install -U sentence-transformers

💻 使用示例

基础用法

from sentence_transformers import SentenceTransformer

# 从 🤗 Hub 下载模型
model = SentenceTransformer("baconnier/Finance2_embedding_small_en-V1.5")
# 进行推理
sentences = [
    'What is industrial production, and how is it measured by the Federal Reserve Board?',
    'Industrial production is a statistic determined by the Federal Reserve Board that measures the total output of all US factories and mines on a monthly basis. The Fed collects data from various government agencies and trade associations to calculate the industrial production index, which serves as an important economic indicator, providing insight into the health of the manufacturing and mining sectors.\nIndustrial production is a monthly statistic calculated by the Federal Reserve Board, measuring the total output of US factories and mines using data from government agencies and trade associations, serving as a key economic indicator for the manufacturing and mining sectors.',
    'Industrial production is a statistic that measures the output of factories and mines in the US. It is released by the Federal Reserve Board every quarter.\nIndustrial production measures factory and mine output, released quarterly by the Fed.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 详细文档

模型详情

模型描述

属性	详情
模型类型	句子转换器
基础模型	BAAI/bge-small-en-v1.5
最大序列长度	512个词元
输出维度	384个词元
相似度函数	余弦相似度
训练数据集	baconnier/finance_dataset_small_private

模型来源

文档：Sentence Transformers 文档
仓库：GitHub 上的 Sentence Transformers
Hugging Face：Hugging Face 上的 Sentence Transformers

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

评估

指标

三元组

数据集：Finance_Embedding_Metric
使用 TripletEvaluator 进行评估

指标	值
余弦准确率	0.9791
点积准确率	0.0209
曼哈顿准确率	0.978
欧几里得准确率	0.9791
最大准确率	0.9791

训练详情

训练数据集

数据集：baconnier/finance_dataset_small_private，版本 d7e6492
大小：15,525 个训练样本
列：anchor、positive 和 negative
损失函数：MultipleNegativesRankingLoss，参数如下：

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

评估数据集

数据集：baconnier/finance_dataset_small_private，版本 d7e6492
大小：862 个评估样本
列：anchor、positive 和 negative
损失函数：MultipleNegativesRankingLoss，参数如下：

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

训练超参数

非默认超参数

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
num_train_epochs: 1
warmup_ratio: 0.1
bf16: True
batch_sampler: no_duplicates

训练日志

轮次	步数	训练损失	损失	Finance_Embedding_Metric 最大准确率
0.0103	10	0.9918	-	-
0.0206	20	0.8866	-	-
...	...	...	...	...
1.0	971	-	-	0.9791

框架版本

Python: 3.10.12
Sentence Transformers: 3.0.1
Transformers: 4.41.2
PyTorch: 2.3.0+cu121
Accelerate: 0.31.0
Datasets: 2.19.2
Tokenizers: 0.19.1

📄 许可证

文档中未提及相关许可证信息。

📖 引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}