GIST-Embedding-v0开源句子嵌入模型 - 免费用于句子相似度计算与特征提取

首页

GIST Embedding V0

由 avsolatorio 开发

GIST-Embedding-v0 是一个基于 sentence-transformers 的句子嵌入模型，主要用于句子相似度计算和特征提取任务。

文本嵌入

Safetensors

英语开源协议:MIT #句子相似度计算 #多任务评估 #高精度特征提取

下载量 252.21k

发布时间 : 4/25/2025

模型简介

该模型能够将句子转换为高维向量表示，适用于多种自然语言处理任务，如句子相似度计算、文本分类、信息检索等。

模型特点

高性能句子嵌入

在多个基准测试中表现出色，能够准确捕捉句子语义。

多功能应用

支持多种自然语言处理任务，包括分类、聚类、检索等。

高效特征提取

能够快速将句子转换为高维向量，便于后续处理和分析。

模型能力

句子相似度计算

文本分类

信息检索

文本聚类

特征提取

使用案例

电子商务

商品评论分类

用于对亚马逊商品评论进行情感分析（正面/负面）。

准确率：93.51%

反事实评论检测

识别亚马逊平台上的反事实评论。

准确率：75.96%

学术研究

论文聚类

对arXiv和biorxiv上的学术论文进行主题聚类。

v_measure：42.74-48.29

问答系统

重复问题识别

在AskUbuntu社区中识别重复的技术问题。

mrr：75.46

🚀 GIST Embedding v0

GISTEmbed：用于文本嵌入微调的训练负样本引导式样本内选择

该模型在 BAAI/bge-base-en-v1.5 的基础上，使用 MEDI 数据集进行微调，并从 MTEB 分类训练数据集中挖掘三元组进行增强（不包括亚马逊极性分类任务的数据）。

该模型在生成嵌入时无需任何指令。这意味着检索任务的查询可以直接进行编码，而无需编写指令。

技术论文：GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

🚀 快速开始

模型加载

该模型可以使用 Sentence Transformers 库轻松加载。

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # 如果模型更新，可替换为特定版本以确保可重复性。

model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)

texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# 计算嵌入
embeddings = model.encode(texts, convert_to_tensor=True)

# 计算每对句子的余弦相似度
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)

print(scores.cpu().numpy())

✨ 主要特性

无需指令：该模型在生成嵌入时无需任何指令，检索任务的查询可直接编码。
微调增强：在 BAAI/bge-base-en-v1.5 基础上，使用 MEDI 数据集进行微调，并从 MTEB 分类训练数据集中挖掘三元组进行增强。

📦 安装指南

使用以下命令安装 Sentence Transformers 库：

pip install sentence-transformers

💻 使用示例

基础用法

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # 如果模型更新，可替换为特定版本以确保可重复性。

model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)

texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# 计算嵌入
embeddings = model.encode(texts, convert_to_tensor=True)

# 计算每对句子的余弦相似度
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)

print(scores.cpu().numpy())

📚 详细文档

数据

使用的数据集是 MEDI 和 MTEB 分类训练数据集的组合。第三方数据集可能需要遵循其相关许可证的额外条款和条件。可获取组合数据集的 HuggingFace 数据集版本以及用于训练模型的特定版本：

数据集：avsolatorio/medi-data-mteb_avs_triplets
版本：238a0499b6e6b690cc64ea56fde8461daa8341bb

该数据集包含一个 task_type 键，可用于仅选择 mteb 分类任务（前缀为 mteb_）。

MEDI 数据集 发表于以下论文：One Embedder, Any Task: Instruction-Finetuned Text Embeddings。

GIST 嵌入模型与基础模型相比的 MTEB 基准测试结果表明，微调数据集对模型产生了相当大的影响，在某些任务中带来了显著改进，而在某些任务中则导致性能下降。

TRECCOVID 任务的检索性能值得关注。微调数据集中关于 COVID-19 的知识有限，这可能导致了观察到的性能下降。我们在论文中详细阐述了一些证据，表明微调数据的主题覆盖范围会影响下游性能。

训练参数

以下是用于微调模型的训练参数：

Epochs = 80
Warmup ratio = 0.1
Learning rate = 5e-6
Batch size = 32
Checkpoint step = 103500
Contrastive loss temperature = 0.01

评估

该模型使用 MTEB 评估套件进行评估。

🔧 技术细节

该模型基于 BAAI/bge-base-en-v1.5 进行微调，使用了 MEDI 数据集并结合了从 MTEB 分类训练数据集中挖掘的三元组。微调过程中，模型在多个任务上进行了优化，以提高其在文本嵌入任务中的性能。

在训练过程中，使用了特定的训练参数，如 80 个训练周期、0.1 的热身比例、5e-6 的学习率、32 的批次大小等。这些参数的选择经过了精心调整，以确保模型能够在不同任务中取得良好的性能。

模型的评估使用了 MTEB 评估套件，该套件包含了多个文本嵌入任务的评估指标，能够全面评估模型的性能。

📄 许可证

该项目遵循 MIT 许可证。

📖 引用

如果您在项目或研究中使用了 GISTEmbed 或我们发布的数据集，请引用我们的工作。🤗

@article{solatorio2024gistembed,
    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
    author={Aivin V. Solatorio},
    journal={arXiv preprint arXiv:2402.16829},
    year={2024},
    URL={https://arxiv.org/abs/2402.16829}
    eprint={2402.16829},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}