GIST-small-Embedding-v0開源文本嵌入模型 - 優化檢索查詢編碼能力免費使用

首頁

GIST Small Embedding V0

由avsolatorio開發

基於BAAI/bge-small-en-v1.5模型微調的文本嵌入模型，通過MEDI數據集與MTEB分類任務數據集訓練，優化了檢索任務的查詢編碼能力。

文本嵌入

Safetensors

英語開源協議:MIT #無指令嵌入 #跨任務微調 #語義相似度計算

下載量 945.68k

發布時間 : 2/3/2024

模型概述

該模型生成嵌入向量時無需指令輸入，可直接編碼查詢語句，適用於文本檢索和相似度計算任務。

模型特點

無需指令輸入

生成嵌入向量時無需構造提示語句，直接編碼查詢即可。

融合多數據集訓練

結合MEDI數據集與MTEB分類任務數據集進行微調，提升模型性能。

優化檢索任務

針對檢索任務優化，顯著提升部分任務的性能表現。

模型能力

文本嵌入生成

文本相似度計算

檢索任務優化

使用案例

信息檢索

文檔檢索

用於快速檢索相關文檔或段落。

在部分MTEB任務中表現顯著提升

相似度計算

文本相似度分析

計算兩段文本的語義相似度。

🚀 GIST small Embedding v0

GIST small Embedding v0是一個文本嵌入微調模型，基於Sentence Transformers庫，無需額外指令即可生成嵌入。該模型在特定數據集上微調，在部分任務上有顯著提升，可用於文本檢索、相似度計算等自然語言處理任務。

🚀 快速開始

該模型可以使用Sentence Transformers庫輕鬆加載。

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # 若模型更新，替換為特定版本以確保可重複性。

model = SentenceTransformer("avsolatorio/GIST-small-Embedding-v0", revision=revision)

texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# 計算嵌入
embeddings = model.encode(texts, convert_to_tensor=True)

# 計算每對句子的餘弦相似度
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)

print(scores.cpu().numpy())

✨ 主要特性

無需指令：模型在生成嵌入時不需要任何指令，檢索任務的查詢可以直接編碼，無需編寫指令。
微調提升：在特定數據集上進行微調，在某些任務上相比基礎模型有顯著性能提升，但在部分任務上可能會出現性能下降。

📦 安裝指南

使用以下命令安裝Sentence Transformers庫：

pip install sentence-transformers

📚 詳細文檔

數據

使用的數據集是MEDI和MTEB Classification訓練數據集的彙編。第三方數據集可能會根據其相關許可證受到額外的條款和條件限制。可以獲取編譯數據集的HuggingFace Dataset版本以及用於訓練模型的特定版本：

數據集：avsolatorio/medi-data-mteb_avs_triplets
版本：238a0499b6e6b690cc64ea56fde8461daa8341bb

數據集包含一個task_type鍵，可用於僅選擇mteb分類任務（以mteb_為前綴）。

MEDI數據集發表於以下論文：One Embedder, Any Task: Instruction-Finetuned Text Embeddings。

GIST嵌入模型與基礎模型相比的MTEB基準測試結果表明，微調數據集對模型產生了相當大的影響，導致在某些任務上有顯著改進，而在某些任務上性能下降。

值得注意的是TRECCOVID任務的檢索性能。微調數據集不包含關於COVID - 19的重要知識，這可能導致了觀察到的性能下降。我們在論文中詳細闡述了一些證據，表明微調數據的主題覆蓋範圍會影響下游性能。

訓練參數

以下是用於微調模型的訓練參數：

Epochs = 40
Warmup ratio = 0.1
Learning rate = 5e-6
Batch size = 16
Checkpoint step = 102000
Contrastive loss temperature = 0.01

評估

模型使用MTEB Evaluation套件進行評估。

🔧 技術細節

該模型基於BAAI/bge-small-en-v1.5，使用MEDI和MTEB Classification訓練數據集進行微調。微調過程中沒有使用額外的指令，直接對文本進行編碼生成嵌入。技術論文可參考：GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning。

📄 許可證

本項目遵循MIT許可證。

📖 引用

如果您在項目或研究中使用GISTEmbed或我們發佈的數據集，請引用我們的工作。🤗

@article{solatorio2024gistembed,
    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
    author={Aivin V. Solatorio},
    journal={arXiv preprint arXiv:2402.16829},
    year={2024},
    URL={https://arxiv.org/abs/2402.16829},
    eprint={2402.16829},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

🙏 致謝

這項工作得到了世界銀行知識促進發展計劃（KCP）資助的“KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)”項目的支持 - RA - P503405 - RESE - TF0C3444。

本材料中表達的研究結果、解釋和結論完全屬於作者，不一定代表國際復興開發銀行/世界銀行及其附屬組織的觀點，也不一定代表世界銀行執行董事或他們所代表的政府的觀點。