GIST-large-Embedding-v0開源文本嵌入模型 - 無需指令直接實現檢索查詢編碼

首頁

GIST Large Embedding V0

由avsolatorio開發

基於BAAI/bge-large-en-v1.5微調的文本嵌入模型，結合MEDI數據集與MTEB分類任務訓練集的挖掘三元組訓練，無需指令即可直接編碼檢索查詢。

文本嵌入

Safetensors

英語開源協議:MIT #無指令嵌入 #對比學習優化 #跨任務泛化

下載量 110.09k

發布時間 : 2/14/2024

模型概述

該模型主要用於文本嵌入任務，能夠將文本轉換為高維向量表示，適用於信息檢索、語義相似度計算等場景。

模型特點

無需指令

直接編碼檢索查詢，無需構造提示模板。

高性能

在多數檢索任務中表現顯著提升。

基於GISTEmbed技術

採用訓練負樣本引導式樣本內選擇技術，優化嵌入效果。

模型能力

文本嵌入

語義相似度計算

信息檢索

使用案例

信息檢索

文檔檢索

用於檢索與查詢語義相似的文檔。

在多數檢索任務中表現顯著提升。

語義相似度計算

文本相似度比較

計算兩段文本的語義相似度。

🚀 GIST Large Embedding v0

GIST Large Embedding v0 是一個文本嵌入微調模型，它基於特定數據集對基礎模型進行微調，能有效生成文本嵌入，在文本檢索、分類等任務中表現出色，且無需額外指令即可生成嵌入。

✨ 主要特性

微調優化：在 BAAI/bge-large-en-v1.5 基礎上，使用 MEDI 數據集並結合從 MTEB Classification 訓練數據集中挖掘的三元組進行微調（不包含亞馬遜極性分類任務的數據）。
無需指令：生成嵌入時無需額外指令，檢索任務的查詢可以直接編碼。
性能多樣：與基礎模型相比，在某些任務上有顯著改進，但在部分任務上性能有所下降。

📦 安裝指南

該模型可以使用 Sentence Transformers 庫輕鬆加載。示例代碼如下：

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # 若模型更新，可替換為具體版本以確保可重複性。

model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)

💻 使用示例

基礎用法

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # Replace with the specific revision to ensure reproducibility if the model is updated.

model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)

texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# 計算嵌入
embeddings = model.encode(texts, convert_to_tensor=True)

# 計算每對句子的餘弦相似度
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)

print(scores.cpu().numpy())

🔧 技術細節

該模型基於 BAAI/bge-large-en-v1.5 進行微調，使用的數據集是 MEDI 和 MTEB Classification 訓練數據集的組合。微調過程中，模型在某些任務上有顯著改進，但在部分任務上性能有所下降，如 TRECCOVID 任務。研究發現，微調數據的主題覆蓋會影響下游性能。

訓練參數

訓練輪數（Epochs） = 40
預熱比例（Warmup ratio） = 0.1
學習率（Learning rate） = 5e-6
批次大小（Batch size） = 16
檢查點步數（Checkpoint step） = 171000
對比損失溫度（Contrastive loss temperature） = 0.01

📚 詳細文檔

數據

使用的數據集是 MEDI 和 MTEB Classification 訓練數據集的組合。第三方數據集可能受其相關許可證的額外條款和條件約束。可獲取的編譯數據集的 HuggingFace 數據集版本以及用於訓練模型的具體版本如下：

數據集：avsolatorio/medi-data-mteb_avs_triplets
版本：238a0499b6e6b690cc64ea56fde8461daa8341bb

該數據集包含一個 task_type 鍵，可用於僅選擇 mteb 分類任務（以 mteb_ 為前綴）。

MEDI 數據集 發表於以下論文：One Embedder, Any Task: Instruction-Finetuned Text Embeddings。

評估

該模型使用 MTEB Evaluation 套件進行評估。

📄 許可證

該項目遵循 MIT 許可證。

📖 引用

如果您在項目或研究中使用了 GISTEmbed 或我們發佈的數據集，請引用我們的工作：

@article{solatorio2024gistembed,
    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
    author={Aivin V. Solatorio},
    journal={arXiv preprint arXiv:2402.16829},
    year={2024},
    URL={https://arxiv.org/abs/2402.16829},
    eprint={2402.16829},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}