模型简介
模型特点
模型能力
使用案例
🚀 GIST Embedding v0
GISTEmbed:用于文本嵌入微调的训练负样本引导式样本内选择
该模型在 BAAI/bge-base-en-v1.5 的基础上,使用 MEDI 数据集 进行微调,并从 MTEB 分类 训练数据集中挖掘三元组进行增强(不包括亚马逊极性分类任务的数据)。
该模型在生成嵌入时无需任何指令。这意味着检索任务的查询可以直接进行编码,而无需编写指令。
技术论文:GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning
🚀 快速开始
模型加载
该模型可以使用 Sentence Transformers 库轻松加载。
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
revision = None # 如果模型更新,可替换为特定版本以确保可重复性。
model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)
texts = [
"Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
"Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
"As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]
# 计算嵌入
embeddings = model.encode(texts, convert_to_tensor=True)
# 计算每对句子的余弦相似度
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())
✨ 主要特性
- 无需指令:该模型在生成嵌入时无需任何指令,检索任务的查询可直接编码。
- 微调增强:在 BAAI/bge-base-en-v1.5 基础上,使用 MEDI 数据集 进行微调,并从 MTEB 分类 训练数据集中挖掘三元组进行增强。
📦 安装指南
使用以下命令安装 Sentence Transformers 库:
pip install sentence-transformers
💻 使用示例
基础用法
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
revision = None # 如果模型更新,可替换为特定版本以确保可重复性。
model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)
texts = [
"Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
"Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
"As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]
# 计算嵌入
embeddings = model.encode(texts, convert_to_tensor=True)
# 计算每对句子的余弦相似度
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())
📚 详细文档
数据
使用的数据集是 MEDI 和 MTEB 分类训练数据集的组合。第三方数据集可能需要遵循其相关许可证的额外条款和条件。可获取组合数据集的 HuggingFace 数据集版本以及用于训练模型的特定版本:
- 数据集:avsolatorio/medi-data-mteb_avs_triplets
- 版本:238a0499b6e6b690cc64ea56fde8461daa8341bb
该数据集包含一个 task_type
键,可用于仅选择 mteb 分类任务(前缀为 mteb_
)。
MEDI 数据集 发表于以下论文:One Embedder, Any Task: Instruction-Finetuned Text Embeddings。
GIST 嵌入模型与基础模型相比的 MTEB 基准测试结果表明,微调数据集对模型产生了相当大的影响,在某些任务中带来了显著改进,而在某些任务中则导致性能下降。
TRECCOVID 任务的检索性能值得关注。微调数据集中关于 COVID-19 的知识有限,这可能导致了观察到的性能下降。我们在论文中详细阐述了一些证据,表明微调数据的主题覆盖范围会影响下游性能。
训练参数
以下是用于微调模型的训练参数:
Epochs = 80
Warmup ratio = 0.1
Learning rate = 5e-6
Batch size = 32
Checkpoint step = 103500
Contrastive loss temperature = 0.01
评估
该模型使用 MTEB 评估 套件进行评估。
🔧 技术细节
该模型基于 BAAI/bge-base-en-v1.5 进行微调,使用了 MEDI 数据集 并结合了从 MTEB 分类 训练数据集中挖掘的三元组。微调过程中,模型在多个任务上进行了优化,以提高其在文本嵌入任务中的性能。
在训练过程中,使用了特定的训练参数,如 80 个训练周期、0.1 的热身比例、5e-6 的学习率、32 的批次大小等。这些参数的选择经过了精心调整,以确保模型能够在不同任务中取得良好的性能。
模型的评估使用了 MTEB 评估 套件,该套件包含了多个文本嵌入任务的评估指标,能够全面评估模型的性能。
📄 许可证
该项目遵循 MIT 许可证。
📖 引用
如果您在项目或研究中使用了 GISTEmbed 或我们发布的数据集,请引用我们的工作。🤗
@article{solatorio2024gistembed,
title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
author={Aivin V. Solatorio},
journal={arXiv preprint arXiv:2402.16829},
year={2024},
URL={https://arxiv.org/abs/2402.16829}
eprint={2402.16829},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
🙏 致谢
这项工作得到了世界银行知识促进发展计划(KCP)资助的 “KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)” 项目的支持。
本材料中表达的研究结果、解释和结论完全属于作者本人,不一定代表国际复兴开发银行/世界银行及其附属机构的观点,也不一定代表世界银行执行董事或他们所代表的政府的观点。







