langcache-embed-v2开源句子转换器模型 - 免费生成768维句子嵌入向量

首页

Langcache Embed V2

由 redis 开发

基于Redis Langcache Embed v1微调的句子转换器模型，用于生成768维句子嵌入向量

文本嵌入 #语义相似度计算 #长文本嵌入 #三元组微调

下载量 126

发布时间 : 5/21/2025

模型简介

该模型基于sentence-transformers框架，在三元组数据集上微调，可将文本映射到768维向量空间，支持语义相似度计算、搜索、分类等任务

模型特点

高维向量映射

可将句子和段落映射到768维密集向量空间

长文本支持

支持最大8192标记的序列长度

多任务适配

适用于相似度计算、语义搜索、文本分类等多种NLP任务

高效训练

使用MatryoshkaLoss和三元组数据进行优化训练

模型能力

语义文本相似度计算

语义搜索

释义挖掘

文本分类

文本聚类

使用案例

信息检索

语义搜索系统

构建基于语义而非关键词的搜索系统

可识别语义相似的查询和文档

内容分析

文本相似度分析

比较不同文本之间的语义相似度

可识别语义相近的文本对

文本聚类

将语义相似的文档自动分组

实现无监督的文档组织

🚀 基于 redis/langcache-embed-v1 的句子转换器

该项目基于 sentence-transformers 框架，在三元组数据集上对 redis/langcache-embed-v1 模型进行微调。它能够将句子和段落映射到 768 维的密集向量空间，可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等任务。

🚀 快速开始

安装 Sentence Transformers 库

首先，你需要安装 Sentence Transformers 库：

pip install -U sentence-transformers

加载模型并进行推理

安装完成后，你可以加载此模型并进行推理：

from sentence_transformers import SentenceTransformer

# 从 Hugging Face Hub 下载模型
model = SentenceTransformer("redis/langcache-embed-v2")
# 进行推理
sentences = [
    'What are some examples of crimes understood as a moral turpitude?',
    'What are some examples of crimes of moral turpitude?',
    'What are some examples of crimes understood as a legal aptitude?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主要特性

高维向量映射：能够将句子和段落映射到 768 维的密集向量空间，便于进行语义分析。
多任务支持：可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等多种自然语言处理任务。

📦 安装指南

安装 Sentence Transformers 库：

pip install -U sentence-transformers

💻 使用示例

基础用法

from sentence_transformers import SentenceTransformer

# 从 Hugging Face Hub 下载模型
model = SentenceTransformer("redis/langcache-embed-v2")
# 进行推理
sentences = [
    'What are some examples of crimes understood as a moral turpitude?',
    'What are some examples of crimes of moral turpitude?',
    'What are some examples of crimes understood as a legal aptitude?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 详细文档

模型详情

模型描述

属性	详情
模型类型	句子转换器
基础模型	redis/langcache-embed-v1
最大序列长度	8192 个标记
输出维度	768 维
相似度函数	余弦相似度
训练数据集	三元组数据集

模型来源

文档：Sentence Transformers 文档
仓库：GitHub 上的 Sentence Transformers
Hugging Face：Hugging Face 上的 Sentence Transformers

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

训练详情

数据集：三元组数据集
大小：36,864 个训练样本
列：anchor、positive、negative_1、negative_2 和 negative_3

样本示例

anchor	positive	negative_1	negative_2	negative_3
`Is life really what I make of it?`	`Life is what you make it?`	`Is life hardly what I take of it?`	`Life is not entirely what I make of it.`	`Is life not what I make of it?`
`When you visit a website, can a person running the website see your IP address?`	`Does every website I visit knows my public ip address?`	`When you avoid a website, can a person hiding the website see your MAC address?`	`When you send an email, can the recipient see your physical location?`	`When you visit a website, a person running the website cannot see your IP address.`
`What are some cool features about iOS 10?`	`What are the best new features of iOS 10?`	`iOS 10 received criticism for its initial bugs and performance issues, and some users found the redesigned apps less intuitive compared to previous versions.`	`What are the drawbacks of using Android 14?`	`iOS 10 was widely criticized for its bugs, removal of beloved features, and generally being a downgrade from previous versions.`

损失函数

使用 MatryoshkaLoss，参数如下：

{
    "loss": "CachedMultipleNegativesRankingLoss",
    "matryoshka_dims": [768,512,256,128,64],
    "matryoshka_weights": [1,1,1,1,1],
    "n_dims_per_step": -1
}

评估

medical redis quora negation

🔧 技术细节

训练日志

轮次	步数	训练损失	三元组损失
0.0556	1	6.4636	-
0.1111	2	6.1076	-
0.1667	3	5.8323	-
0.2222	4	5.6861	-
0.2778	5	5.5694	-
0.3333	6	5.2121	-
0.3889	7	5.0695	-
0.4444	8	4.81	-
0.5	9	4.6698	-
0.5556	10	4.3546	1.2224
0.6111	11	4.1922	-
0.6667	12	4.1434	-
0.7222	13	3.9918	-
0.7778	14	3.702	-
0.8333	15	3.6501	-
0.8889	16	3.6641	-
0.9444	17	3.3196	-
1.0	18	2.7108	-

框架版本

Python：3.11.11
Sentence Transformers：4.1.0
Transformers：4.51.3
PyTorch：2.6.0+cu124
Accelerate：1.6.0
Datasets：3.5.1
Tokenizers：0.21.1

📄 许可证

文档中未提及相关许可证信息。

📚 引用

Redis Langcache-embed 模型

如果您使用了我们的模型或基于我们的研究成果进行开发，我们鼓励您引用我们的工作：

@inproceedings{langcache-embed-v1,
    title = "Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data",
    author = "Gill, Cechmanek, Hutcherson, Rajamohan, Agarwal, Gulzar, Singh, Dion",
    month = "04",
    year = "2025",
    url = "https://arxiv.org/abs/2504.02268",
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}