🚀 GIST Large Embedding v0
GIST Large Embedding v0 是一个文本嵌入微调模型,它基于特定数据集对基础模型进行微调,能有效生成文本嵌入,在文本检索、分类等任务中表现出色,且无需额外指令即可生成嵌入。
✨ 主要特性
📦 安装指南
该模型可以使用 Sentence Transformers 库轻松加载。示例代码如下:
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
revision = None
model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)
💻 使用示例
基础用法
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
revision = None
model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)
texts = [
"Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
"Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
"As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]
embeddings = model.encode(texts, convert_to_tensor=True)
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())
🔧 技术细节
该模型基于 BAAI/bge-large-en-v1.5 进行微调,使用的数据集是 MEDI 和 MTEB Classification 训练数据集的组合。微调过程中,模型在某些任务上有显著改进,但在部分任务上性能有所下降,如 TRECCOVID 任务。研究发现,微调数据的主题覆盖会影响下游性能。
训练参数
训练轮数(Epochs) = 40
预热比例(Warmup ratio) = 0.1
学习率(Learning rate) = 5e-6
批次大小(Batch size) = 16
检查点步数(Checkpoint step) = 171000
对比损失温度(Contrastive loss temperature) = 0.01
📚 详细文档
数据
使用的数据集是 MEDI 和 MTEB Classification 训练数据集的组合。第三方数据集可能受其相关许可证的额外条款和条件约束。可获取的编译数据集的 HuggingFace 数据集版本以及用于训练模型的具体版本如下:
该数据集包含一个 task_type
键,可用于仅选择 mteb 分类任务(以 mteb_
为前缀)。
MEDI 数据集 发表于以下论文:One Embedder, Any Task: Instruction-Finetuned Text Embeddings。
评估
该模型使用 MTEB Evaluation 套件进行评估。
📄 许可证
该项目遵循 MIT 许可证。
📖 引用
如果您在项目或研究中使用了 GISTEmbed 或我们发布的数据集,请引用我们的工作:
@article{solatorio2024gistembed,
title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
author={Aivin V. Solatorio},
journal={arXiv preprint arXiv:2402.16829},
year={2024},
URL={https://arxiv.org/abs/2402.16829},
eprint={2402.16829},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
🙏 致谢
这项工作得到了世界银行 知识促进变革计划(KCP) 资助的 “KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)” 项目的支持。
本材料中表达的研究结果、解释和结论完全属于作者,不一定代表国际复兴开发银行/世界银行及其附属组织的观点,也不一定代表世界银行执行董事或他们所代表的政府的观点。