langcache-embed-v2開源句子轉換器模型 - 免費生成768維句子嵌入向量

Home

Langcache Embed V2

Developed by redis

基於Redis Langcache Embed v1微調的句子轉換器模型，用於生成768維句子嵌入向量

文本嵌入 #語義相似度計算 #長文本嵌入 #三元組微調

Downloads 126

Release Time : 5/21/2025

Model Overview

該模型基於sentence-transformers框架，在三元組數據集上微調，可將文本映射到768維向量空間，支持語義相似度計算、搜索、分類等任務

Model Features

高維向量映射

可將句子和段落映射到768維密集向量空間

長文本支持

支持最大8192標記的序列長度

多任務適配

適用於相似度計算、語義搜索、文本分類等多種NLP任務

高效訓練

使用MatryoshkaLoss和三元組數據進行優化訓練

Model Capabilities

語義文本相似度計算

語義搜索

釋義挖掘

文本分類

文本聚類

Use Cases

信息檢索

語義搜索系統

構建基於語義而非關鍵詞的搜索系統

可識別語義相似的查詢和文檔

內容分析

文本相似度分析

比較不同文本之間的語義相似度

可識別語義相近的文本對

文本聚類

將語義相似的文檔自動分組

實現無監督的文檔組織

🚀 基於 redis/langcache-embed-v1 的句子轉換器

該項目基於 sentence-transformers 框架，在三元組數據集上對 redis/langcache-embed-v1 模型進行微調。它能夠將句子和段落映射到 768 維的密集向量空間，可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等任務。

🚀 快速開始

安裝 Sentence Transformers 庫

首先，你需要安裝 Sentence Transformers 庫：

pip install -U sentence-transformers

加載模型並進行推理

安裝完成後，你可以加載此模型並進行推理：

from sentence_transformers import SentenceTransformer

# 從 Hugging Face Hub 下載模型
model = SentenceTransformer("redis/langcache-embed-v2")
# 進行推理
sentences = [
    'What are some examples of crimes understood as a moral turpitude?',
    'What are some examples of crimes of moral turpitude?',
    'What are some examples of crimes understood as a legal aptitude?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主要特性

高維向量映射：能夠將句子和段落映射到 768 維的密集向量空間，便於進行語義分析。
多任務支持：可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等多種自然語言處理任務。

📦 安裝指南

安裝 Sentence Transformers 庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import SentenceTransformer

# 從 Hugging Face Hub 下載模型
model = SentenceTransformer("redis/langcache-embed-v2")
# 進行推理
sentences = [
    'What are some examples of crimes understood as a moral turpitude?',
    'What are some examples of crimes of moral turpitude?',
    'What are some examples of crimes understood as a legal aptitude?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 詳細文檔

模型詳情

模型描述

屬性	詳情
模型類型	句子轉換器
基礎模型	redis/langcache-embed-v1
最大序列長度	8192 個標記
輸出維度	768 維
相似度函數	餘弦相似度
訓練數據集	三元組數據集

模型來源

文檔：Sentence Transformers 文檔
倉庫：GitHub 上的 Sentence Transformers
Hugging Face：Hugging Face 上的 Sentence Transformers

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

訓練詳情

數據集：三元組數據集
大小：36,864 個訓練樣本
列：anchor、positive、negative_1、negative_2 和 negative_3

樣本示例

anchor	positive	negative_1	negative_2	negative_3
`Is life really what I make of it?`	`Life is what you make it?`	`Is life hardly what I take of it?`	`Life is not entirely what I make of it.`	`Is life not what I make of it?`
`When you visit a website, can a person running the website see your IP address?`	`Does every website I visit knows my public ip address?`	`When you avoid a website, can a person hiding the website see your MAC address?`	`When you send an email, can the recipient see your physical location?`	`When you visit a website, a person running the website cannot see your IP address.`
`What are some cool features about iOS 10?`	`What are the best new features of iOS 10?`	`iOS 10 received criticism for its initial bugs and performance issues, and some users found the redesigned apps less intuitive compared to previous versions.`	`What are the drawbacks of using Android 14?`	`iOS 10 was widely criticized for its bugs, removal of beloved features, and generally being a downgrade from previous versions.`

損失函數

使用 MatryoshkaLoss，參數如下：

{
    "loss": "CachedMultipleNegativesRankingLoss",
    "matryoshka_dims": [768,512,256,128,64],
    "matryoshka_weights": [1,1,1,1,1],
    "n_dims_per_step": -1
}

評估

medical redis quora negation

🔧 技術細節

訓練日誌

輪次	步數	訓練損失	三元組損失
0.0556	1	6.4636	-
0.1111	2	6.1076	-
0.1667	3	5.8323	-
0.2222	4	5.6861	-
0.2778	5	5.5694	-
0.3333	6	5.2121	-
0.3889	7	5.0695	-
0.4444	8	4.81	-
0.5	9	4.6698	-
0.5556	10	4.3546	1.2224
0.6111	11	4.1922	-
0.6667	12	4.1434	-
0.7222	13	3.9918	-
0.7778	14	3.702	-
0.8333	15	3.6501	-
0.8889	16	3.6641	-
0.9444	17	3.3196	-
1.0	18	2.7108	-

框架版本

Python：3.11.11
Sentence Transformers：4.1.0
Transformers：4.51.3
PyTorch：2.6.0+cu124
Accelerate：1.6.0
Datasets：3.5.1
Tokenizers：0.21.1

📄 許可證

文檔中未提及相關許可證信息。

📚 引用

Redis Langcache-embed 模型

如果您使用了我們的模型或基於我們的研究成果進行開發，我們鼓勵您引用我們的工作：

@inproceedings{langcache-embed-v1,
    title = "Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data",
    author = "Gill, Cechmanek, Hutcherson, Rajamohan, Agarwal, Gulzar, Singh, Dion",
    month = "04",
    year = "2025",
    url = "https://arxiv.org/abs/2504.02268",
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}