🚀 GIST Large Embedding v0
GIST Large Embedding v0 是一個文本嵌入微調模型,它基於特定數據集對基礎模型進行微調,能有效生成文本嵌入,在文本檢索、分類等任務中表現出色,且無需額外指令即可生成嵌入。
✨ 主要特性
📦 安裝指南
該模型可以使用 Sentence Transformers 庫輕鬆加載。示例代碼如下:
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
revision = None
model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)
💻 使用示例
基礎用法
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
revision = None
model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)
texts = [
"Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
"Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
"As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]
embeddings = model.encode(texts, convert_to_tensor=True)
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())
🔧 技術細節
該模型基於 BAAI/bge-large-en-v1.5 進行微調,使用的數據集是 MEDI 和 MTEB Classification 訓練數據集的組合。微調過程中,模型在某些任務上有顯著改進,但在部分任務上性能有所下降,如 TRECCOVID 任務。研究發現,微調數據的主題覆蓋會影響下游性能。
訓練參數
訓練輪數(Epochs) = 40
預熱比例(Warmup ratio) = 0.1
學習率(Learning rate) = 5e-6
批次大小(Batch size) = 16
檢查點步數(Checkpoint step) = 171000
對比損失溫度(Contrastive loss temperature) = 0.01
📚 詳細文檔
數據
使用的數據集是 MEDI 和 MTEB Classification 訓練數據集的組合。第三方數據集可能受其相關許可證的額外條款和條件約束。可獲取的編譯數據集的 HuggingFace 數據集版本以及用於訓練模型的具體版本如下:
該數據集包含一個 task_type
鍵,可用於僅選擇 mteb 分類任務(以 mteb_
為前綴)。
MEDI 數據集 發表於以下論文:One Embedder, Any Task: Instruction-Finetuned Text Embeddings。
評估
該模型使用 MTEB Evaluation 套件進行評估。
📄 許可證
該項目遵循 MIT 許可證。
📖 引用
如果您在項目或研究中使用了 GISTEmbed 或我們發佈的數據集,請引用我們的工作:
@article{solatorio2024gistembed,
title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
author={Aivin V. Solatorio},
journal={arXiv preprint arXiv:2402.16829},
year={2024},
URL={https://arxiv.org/abs/2402.16829},
eprint={2402.16829},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
🙏 致謝
這項工作得到了世界銀行 知識促進變革計劃(KCP) 資助的 “KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)” 項目的支持。
本材料中表達的研究結果、解釋和結論完全屬於作者,不一定代表國際復興開發銀行/世界銀行及其附屬組織的觀點,也不一定代表世界銀行執行董事或他們所代表的政府的觀點。