GIST-large-Embedding-v0开源文本嵌入模型 - 无需指令直接实现检索查询编码

首页

GIST Large Embedding V0

由 avsolatorio 开发

基于BAAI/bge-large-en-v1.5微调的文本嵌入模型，结合MEDI数据集与MTEB分类任务训练集的挖掘三元组训练，无需指令即可直接编码检索查询。

文本嵌入

Safetensors

英语开源协议:MIT #无指令嵌入 #对比学习优化 #跨任务泛化

下载量 110.09k

发布时间 : 2/14/2024

模型简介

该模型主要用于文本嵌入任务，能够将文本转换为高维向量表示，适用于信息检索、语义相似度计算等场景。

模型特点

无需指令

直接编码检索查询，无需构造提示模板。

高性能

在多数检索任务中表现显著提升。

基于GISTEmbed技术

采用训练负样本引导式样本内选择技术，优化嵌入效果。

模型能力

文本嵌入

语义相似度计算

信息检索

使用案例

信息检索

文档检索

用于检索与查询语义相似的文档。

在多数检索任务中表现显著提升。

语义相似度计算

文本相似度比较

计算两段文本的语义相似度。

🚀 GIST Large Embedding v0

GIST Large Embedding v0 是一个文本嵌入微调模型，它基于特定数据集对基础模型进行微调，能有效生成文本嵌入，在文本检索、分类等任务中表现出色，且无需额外指令即可生成嵌入。

✨ 主要特性

微调优化：在 BAAI/bge-large-en-v1.5 基础上，使用 MEDI 数据集并结合从 MTEB Classification 训练数据集中挖掘的三元组进行微调（不包含亚马逊极性分类任务的数据）。
无需指令：生成嵌入时无需额外指令，检索任务的查询可以直接编码。
性能多样：与基础模型相比，在某些任务上有显著改进，但在部分任务上性能有所下降。

📦 安装指南

该模型可以使用 Sentence Transformers 库轻松加载。示例代码如下：

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # 若模型更新，可替换为具体版本以确保可重复性。

model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)

💻 使用示例

基础用法

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # Replace with the specific revision to ensure reproducibility if the model is updated.

model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)

texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# 计算嵌入
embeddings = model.encode(texts, convert_to_tensor=True)

# 计算每对句子的余弦相似度
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)

print(scores.cpu().numpy())

🔧 技术细节

该模型基于 BAAI/bge-large-en-v1.5 进行微调，使用的数据集是 MEDI 和 MTEB Classification 训练数据集的组合。微调过程中，模型在某些任务上有显著改进，但在部分任务上性能有所下降，如 TRECCOVID 任务。研究发现，微调数据的主题覆盖会影响下游性能。

训练参数

训练轮数（Epochs） = 40
预热比例（Warmup ratio） = 0.1
学习率（Learning rate） = 5e-6
批次大小（Batch size） = 16
检查点步数（Checkpoint step） = 171000
对比损失温度（Contrastive loss temperature） = 0.01

📚 详细文档

数据

使用的数据集是 MEDI 和 MTEB Classification 训练数据集的组合。第三方数据集可能受其相关许可证的额外条款和条件约束。可获取的编译数据集的 HuggingFace 数据集版本以及用于训练模型的具体版本如下：

数据集：avsolatorio/medi-data-mteb_avs_triplets
版本：238a0499b6e6b690cc64ea56fde8461daa8341bb

该数据集包含一个 task_type 键，可用于仅选择 mteb 分类任务（以 mteb_ 为前缀）。

MEDI 数据集 发表于以下论文：One Embedder, Any Task: Instruction-Finetuned Text Embeddings。

评估

该模型使用 MTEB Evaluation 套件进行评估。

📄 许可证

该项目遵循 MIT 许可证。

📖 引用

如果您在项目或研究中使用了 GISTEmbed 或我们发布的数据集，请引用我们的工作：

@article{solatorio2024gistembed,
    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
    author={Aivin V. Solatorio},
    journal={arXiv preprint arXiv:2402.16829},
    year={2024},
    URL={https://arxiv.org/abs/2402.16829},
    eprint={2402.16829},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}