finance-embeddings-investopedia開源金融嵌入模型 - 免費部署用於金融語義搜索

首頁

Finance Embeddings Investopedia

由FinLang開發

這是FinLang團隊為金融應用開發的Investopedia嵌入模型，基於BAAI/bge-base-en-v1.5微調，可將句子和段落映射到768維稠密向量空間，適用於金融領域的語義搜索等任務。

文本嵌入

Safetensors

#金融語義嵌入 #RAG優化 #Investopedia語料

下載量 21.25k

發布時間 : 4/22/2024

模型概述

該模型是基於Investopedia金融數據集訓練的嵌入模型，專為金融應用設計，適用於RAG應用中的聚類或語義搜索任務。

模型特點

金融領域優化

專門針對金融領域數據進行微調，能更好地理解金融術語和概念

高維向量空間

將文本映射到768維稠密向量空間，捕捉豐富的語義信息

RAG應用支持

特別適合用於檢索增強生成(RAG)應用中的語義搜索和聚類任務

模型能力

文本嵌入

語義相似度計算

金融文本特徵提取

金融文檔檢索

使用案例

金融信息檢索

金融知識庫搜索

在金融知識庫中實現語義搜索，提高檢索準確率

能更準確地匹配金融術語和概念

金融問答系統

用於構建金融領域的問答系統，提高問題與答案的匹配精度

示例測試顯示相似度得分0.862

金融文檔處理

金融文檔聚類

對金融文檔進行語義聚類分析

🚀 FinLang/finance-embeddings-investopedia

這是FinLang團隊為金融應用打造的Investopedia嵌入模型。該模型基於團隊從https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset 開源的金融數據集進行訓練。

此模型是在BAAI/bge-base-en-v1.5基礎上微調得到的嵌入模型。它能將句子和段落映射到768維的密集向量空間，可用於RAG應用中的聚類或語義搜索等任務。

本項目僅用於研究目的。第三方數據集可能需遵循其相關許可證下的額外條款和條件。

🚀 快速開始

本模型可通過不同方式使用，以下為你詳細介紹：

LLamaIndex方式

在金融RAG應用的索引過程中，只需指定Finlang嵌入即可。

from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="FinLang/investopedia_embedding")

Sentence-Transformers方式

若你已安裝sentence-transformers（詳見https://huggingface.co/sentence-transformers ），使用該模型會非常簡單。

pip install -U sentence-transformers

然後你可以按如下方式使用模型：

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('FinLang/investopedia_embedding')
embeddings = model.encode(sentences)
print(embeddings)

代碼測試示例

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("FinLang/investopedia_embedding")

query_1 = "What is a potential concern with allowing someone else to store your cryptocurrency keys, and is it possible to decrypt a private key?"
query_2 = "A potential concern is that the entity holding your keys has control over your cryptocurrency in a custodial relationship. While it is theoretically possible to decrypt a private key, with current technology, it would take centuries or millennia for the 115 quattuorvigintillion possibilities. Most hacks and thefts occur in wallets, where private keys are stored."

embedding_1 = model.encode(query_1)
embedding_2 = model.encode(query_2)
scores = (embedding_1*embedding_2).sum()
print(scores) # 0.862

✨ 主要特性

基於開源數據集訓練：使用從https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset 開源的金融數據集進行訓練，保證數據的可獲取性和透明度。
微調優化：在BAAI/bge-base-en-v1.5基礎上進行微調，能更好地適應金融領域的應用需求。
多任務適用性：可將句子和段落映射到768維的密集向量空間，適用於聚類、語義搜索等多種任務。

📚 詳細文檔

評估結果

我們對模型在未見句子對的相似度以及未見打亂句子對的不相似度上進行了評估。評估套件包含來自以下來源的句子對：Investopedia（用於測試金融領域的熟練度），以及Gooaq、MSMARCO、stackexchange_duplicate_questions_title_title、yahoo_answers_title_answer（用於評估模型微調後避免遺忘的能力）。