GIST-Embedding-v0開源句子嵌入模型 - 免費用於句子相似度計算與特徵提取

首頁

GIST Embedding V0

由avsolatorio開發

GIST-Embedding-v0 是一個基於 sentence-transformers 的句子嵌入模型，主要用於句子相似度計算和特徵提取任務。

文本嵌入

Safetensors

英語開源協議:MIT #句子相似度計算 #多任務評估 #高精度特徵提取

下載量 252.21k

發布時間 : 4/25/2025

模型概述

該模型能夠將句子轉換為高維向量表示，適用於多種自然語言處理任務，如句子相似度計算、文本分類、信息檢索等。

模型特點

高性能句子嵌入

在多個基準測試中表現出色，能夠準確捕捉句子語義。

多功能應用

支持多種自然語言處理任務，包括分類、聚類、檢索等。

高效特徵提取

能夠快速將句子轉換為高維向量，便於後續處理和分析。

模型能力

句子相似度計算

文本分類

信息檢索

文本聚類

特徵提取

使用案例

電子商務

商品評論分類

用於對亞馬遜商品評論進行情感分析（正面/負面）。

準確率：93.51%

反事實評論檢測

識別亞馬遜平臺上的反事實評論。

準確率：75.96%

學術研究

論文聚類

對arXiv和biorxiv上的學術論文進行主題聚類。

v_measure：42.74-48.29

問答系統

重複問題識別

在AskUbuntu社區中識別重複的技術問題。

mrr：75.46

🚀 GIST Embedding v0

GISTEmbed：用於文本嵌入微調的訓練負樣本引導式樣本內選擇

該模型在 BAAI/bge-base-en-v1.5 的基礎上，使用 MEDI 數據集進行微調，並從 MTEB 分類訓練數據集中挖掘三元組進行增強（不包括亞馬遜極性分類任務的數據）。

該模型在生成嵌入時無需任何指令。這意味著檢索任務的查詢可以直接進行編碼，而無需編寫指令。

技術論文：GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

🚀 快速開始

模型加載

該模型可以使用 Sentence Transformers 庫輕鬆加載。

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # 如果模型更新，可替換為特定版本以確保可重複性。

model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)

texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# 計算嵌入
embeddings = model.encode(texts, convert_to_tensor=True)

# 計算每對句子的餘弦相似度
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)

print(scores.cpu().numpy())

✨ 主要特性

無需指令：該模型在生成嵌入時無需任何指令，檢索任務的查詢可直接編碼。
微調增強：在 BAAI/bge-base-en-v1.5 基礎上，使用 MEDI 數據集進行微調，並從 MTEB 分類訓練數據集中挖掘三元組進行增強。

📦 安裝指南

使用以下命令安裝 Sentence Transformers 庫：

pip install sentence-transformers

💻 使用示例

基礎用法

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # 如果模型更新，可替換為特定版本以確保可重複性。

model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)

texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# 計算嵌入
embeddings = model.encode(texts, convert_to_tensor=True)

# 計算每對句子的餘弦相似度
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)

print(scores.cpu().numpy())

📚 詳細文檔

數據

使用的數據集是 MEDI 和 MTEB 分類訓練數據集的組合。第三方數據集可能需要遵循其相關許可證的額外條款和條件。可獲取組合數據集的 HuggingFace 數據集版本以及用於訓練模型的特定版本：

數據集：avsolatorio/medi-data-mteb_avs_triplets
版本：238a0499b6e6b690cc64ea56fde8461daa8341bb

該數據集包含一個 task_type 鍵，可用於僅選擇 mteb 分類任務（前綴為 mteb_）。

MEDI 數據集 發表於以下論文：One Embedder, Any Task: Instruction-Finetuned Text Embeddings。

GIST 嵌入模型與基礎模型相比的 MTEB 基準測試結果表明，微調數據集對模型產生了相當大的影響，在某些任務中帶來了顯著改進，而在某些任務中則導致性能下降。

TRECCOVID 任務的檢索性能值得關注。微調數據集中關於 COVID-19 的知識有限，這可能導致了觀察到的性能下降。我們在論文中詳細闡述了一些證據，表明微調數據的主題覆蓋範圍會影響下游性能。

訓練參數

以下是用於微調模型的訓練參數：

Epochs = 80
Warmup ratio = 0.1
Learning rate = 5e-6
Batch size = 32
Checkpoint step = 103500
Contrastive loss temperature = 0.01

評估

該模型使用 MTEB 評估套件進行評估。

🔧 技術細節

該模型基於 BAAI/bge-base-en-v1.5 進行微調，使用了 MEDI 數據集並結合了從 MTEB 分類訓練數據集中挖掘的三元組。微調過程中，模型在多個任務上進行了優化，以提高其在文本嵌入任務中的性能。

在訓練過程中，使用了特定的訓練參數，如 80 個訓練週期、0.1 的熱身比例、5e-6 的學習率、32 的批次大小等。這些參數的選擇經過了精心調整，以確保模型能夠在不同任務中取得良好的性能。

模型的評估使用了 MTEB 評估套件，該套件包含了多個文本嵌入任務的評估指標，能夠全面評估模型的性能。

📄 許可證

該項目遵循 MIT 許可證。

📖 引用

如果您在項目或研究中使用了 GISTEmbed 或我們發佈的數據集，請引用我們的工作。🤗

@article{solatorio2024gistembed,
    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
    author={Aivin V. Solatorio},
    journal={arXiv preprint arXiv:2402.16829},
    year={2024},
    URL={https://arxiv.org/abs/2402.16829}
    eprint={2402.16829},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}