finance-embeddings-investopedia开源金融嵌入模型 - 免费部署用于金融语义搜索

首页

Finance Embeddings Investopedia

由 FinLang 开发

这是FinLang团队为金融应用开发的Investopedia嵌入模型，基于BAAI/bge-base-en-v1.5微调，可将句子和段落映射到768维稠密向量空间，适用于金融领域的语义搜索等任务。

文本嵌入

Safetensors

#金融语义嵌入 #RAG优化 #Investopedia语料

下载量 21.25k

发布时间 : 4/22/2024

模型简介

该模型是基于Investopedia金融数据集训练的嵌入模型，专为金融应用设计，适用于RAG应用中的聚类或语义搜索任务。

模型特点

金融领域优化

专门针对金融领域数据进行微调，能更好地理解金融术语和概念

高维向量空间

将文本映射到768维稠密向量空间，捕捉丰富的语义信息

RAG应用支持

特别适合用于检索增强生成(RAG)应用中的语义搜索和聚类任务

模型能力

文本嵌入

语义相似度计算

金融文本特征提取

金融文档检索

使用案例

金融信息检索

金融知识库搜索

在金融知识库中实现语义搜索，提高检索准确率

能更准确地匹配金融术语和概念

金融问答系统

用于构建金融领域的问答系统，提高问题与答案的匹配精度

示例测试显示相似度得分0.862

金融文档处理

金融文档聚类

对金融文档进行语义聚类分析

🚀 FinLang/finance-embeddings-investopedia

这是FinLang团队为金融应用打造的Investopedia嵌入模型。该模型基于团队从https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset 开源的金融数据集进行训练。

此模型是在BAAI/bge-base-en-v1.5基础上微调得到的嵌入模型。它能将句子和段落映射到768维的密集向量空间，可用于RAG应用中的聚类或语义搜索等任务。

本项目仅用于研究目的。第三方数据集可能需遵循其相关许可证下的额外条款和条件。

🚀 快速开始

本模型可通过不同方式使用，以下为你详细介绍：

LLamaIndex方式

在金融RAG应用的索引过程中，只需指定Finlang嵌入即可。

from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="FinLang/investopedia_embedding")

Sentence-Transformers方式

若你已安装sentence-transformers（详见https://huggingface.co/sentence-transformers ），使用该模型会非常简单。

pip install -U sentence-transformers

然后你可以按如下方式使用模型：

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('FinLang/investopedia_embedding')
embeddings = model.encode(sentences)
print(embeddings)

代码测试示例

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("FinLang/investopedia_embedding")

query_1 = "What is a potential concern with allowing someone else to store your cryptocurrency keys, and is it possible to decrypt a private key?"
query_2 = "A potential concern is that the entity holding your keys has control over your cryptocurrency in a custodial relationship. While it is theoretically possible to decrypt a private key, with current technology, it would take centuries or millennia for the 115 quattuorvigintillion possibilities. Most hacks and thefts occur in wallets, where private keys are stored."

embedding_1 = model.encode(query_1)
embedding_2 = model.encode(query_2)
scores = (embedding_1*embedding_2).sum()
print(scores) # 0.862

✨ 主要特性

基于开源数据集训练：使用从https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset 开源的金融数据集进行训练，保证数据的可获取性和透明度。
微调优化：在BAAI/bge-base-en-v1.5基础上进行微调，能更好地适应金融领域的应用需求。
多任务适用性：可将句子和段落映射到768维的密集向量空间，适用于聚类、语义搜索等多种任务。

📚 详细文档

评估结果

我们对模型在未见句子对的相似度以及未见打乱句子对的不相似度上进行了评估。评估套件包含来自以下来源的句子对：Investopedia（用于测试金融领域的熟练度），以及Gooaq、MSMARCO、stackexchange_duplicate_questions_title_title、yahoo_answers_title_answer（用于评估模型微调后避免遗忘的能力）。