MiniCPM-Embedding开源嵌入模型 - 免费助力中英文双语检索任务

首页

Minicpm Embedding

由 openbmb 开发

MiniCPM-Embedding 是基于 MiniCPM-2B-sft-bf16 基础模型开发的嵌入模型，专注于检索任务，支持中英文双语。

文本嵌入

Transformers

支持多种语言#多语言检索 #高精度嵌入 #中文优化

下载量 315

发布时间 : 9/4/2024

模型简介

该模型主要用于文本检索任务，能够生成高质量的文本嵌入，适用于多种信息检索场景。

模型特点

双语支持

支持中文和英文的文本检索任务。

高效检索

在多个检索任务中表现出色，尤其是在中文检索任务中表现优异。

轻量级

基于 MiniCPM-2B-sft-bf16，参数规模相对较小，适合资源有限的环境。

模型能力

文本嵌入生成

信息检索

双语检索

使用案例

信息检索

学术文献检索

用于检索学术文献，如 SCIDOCS 数据集中的科学文档。

NDCG@10 为 22.38

医疗问答检索

用于检索医疗相关的问答数据，如 CmedqaRetrieval 数据集。

NDCG@10 为 46.05

电商产品检索

用于检索电商平台上的产品信息，如 EcomRetrieval 数据集。

NDCG@10 为 70.21

问答系统

事实问答

用于回答事实性问题，如 FEVER 数据集中的任务。

NDCG@10 为 90.76

开放域问答

用于开放域问答任务，如 NQ 数据集。

NDCG@10 为 69.29

🚀 MiniCPM-Embedding

MiniCPM-Embedding 是面壁智能与清华大学自然语言处理实验室（THUNLP）、东北大学信息检索小组（NEUIR）共同开发的中英双语言文本嵌入模型。它具备出色的中文、英文检索能力，以及出色的中英跨语言检索能力，能为文本检索任务提供高效且精准的解决方案。

🚀 快速开始

输入格式

本模型支持 query 侧指令，格式如下：

Instruction: {{ instruction }} Query: {{ query }}

例如：

Instruction: 为这个医学问题检索相关回答。Query: 咽喉癌的成因是什么？

Instruction: Given a claim about climate change, retrieve documents that support or refute the claim. Query: However the warming trend is slower than most climate models have forecast.

也可以不提供指令，即采取如下格式：

Query: {{ query }}

我们在 BEIR 与 C-MTEB/Retrieval 上测试时使用的指令见 instructions.json，其他测试不使用指令。文档侧直接输入文档原文。

环境要求

transformers==4.37.2

示例脚本

Huggingface Transformers

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "openbmb/MiniCPM-Embedding"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
# You can also use the following line to enable the Flash Attention 2 implementation
# model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()

# 由于在 `model.forward` 中缩放了最终隐层表示，此处的 mean pooling 实际上起到了 weighted mean pooling 的作用
# As we scale hidden states in `model.forward`, mean pooling here actually works as weighted mean pooling
def mean_pooling(hidden, attention_mask):
    s = torch.sum(hidden * attention_mask.unsqueeze(-1).float(), dim=1)
    d = attention_mask.sum(dim=1, keepdim=True).float()
    reps = s / d
    return reps

@torch.no_grad()
def encode(input_texts):
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True).to("cuda")
    
    outputs = model(**batch_dict)
    attention_mask = batch_dict["attention_mask"]
    hidden = outputs.last_hidden_state

    reps = mean_pooling(hidden, attention_mask)   
    embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
    return embeddings

queries = ["中国的首都是哪里？"]
passages = ["beijing", "shanghai"]


INSTRUCTION = "Query: "
queries = [INSTRUCTION + query for query in queries]

embeddings_query = encode(queries)
embeddings_doc = encode(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())  # [[0.3535913825035095, 0.18596848845481873]]

Sentence Transformers

import torch
from sentence_transformers import SentenceTransformer

model_name = "openbmb/MiniCPM-Embedding"
model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={ "torch_dtype": torch.float16})
# You can also use the following line to enable the Flash Attention 2 implementation
# model = SentenceTransformer(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", model_kwargs={ "torch_dtype": torch.float16})

queries = ["中国的首都是哪里？"]
passages = ["beijing", "shanghai"]

INSTRUCTION = "Query: "

embeddings_query = model.encode(queries, prompt=INSTRUCTION)
embeddings_doc = model.encode(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())  # [[0.35365450382232666, 0.18592746555805206]]

✨ 主要特性

出色的中文、英文检索能力。
出色的中英跨语言检索能力。

📦 安装指南

确保你的环境中安装了 transformers==4.37.2，可使用以下命令进行安装：

pip install transformers==4.37.2

📚 详细文档

模型训练

MiniCPM-Embedding 基于 MiniCPM-2B-sft-bf16 训练，结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式，共使用包括开源数据、机造数据、闭源数据在内的约 600 万条训练数据。

RAG 套件系列

欢迎关注 RAG 套件系列：

检索模型：MiniCPM-Embedding
重排模型：MiniCPM-Reranker
面向 RAG 场景的 LoRA 插件：MiniCPM3-RAG-LoRA

模型信息

属性	详情
模型类型	中英双语言文本嵌入模型
模型大小	2.4B
嵌入维度	2304
最大输入token数	512
基础模型	openbmb/MiniCPM-2B-sft-bf16

🔧 技术细节

模型结构上采取双向注意力和 Weighted Mean Pooling [1]，并采取多阶段训练方式。在 model.forward 中缩放了最终隐层表示，使得示例脚本中的 mean pooling 实际上起到了 weighted mean pooling 的作用。

[1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.

📄 许可证

本仓库中代码依照 Apache-2.0 协议开源。
MiniCPM-Embedding 模型权重的使用则需要遵循 MiniCPM 模型协议。
MiniCPM-Embedding 模型权重对学术研究完全开放。如需将模型用于商业用途，请填写此问卷。

💻 使用示例

基础用法

以下是使用 Huggingface Transformers 库调用 MiniCPM-Embedding 模型进行文本嵌入编码的基础示例：

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "openbmb/MiniCPM-Embedding"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
model.eval()

def mean_pooling(hidden, attention_mask):
    s = torch.sum(hidden * attention_mask.unsqueeze(-1).float(), dim=1)
    d = attention_mask.sum(dim=1, keepdim=True).float()
    reps = s / d
    return reps

@torch.no_grad()
def encode(input_texts):
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True).to("cuda")
    outputs = model(**batch_dict)
    attention_mask = batch_dict["attention_mask"]
    hidden = outputs.last_hidden_state
    reps = mean_pooling(hidden, attention_mask)   
    embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
    return embeddings

queries = ["中国的首都是哪里？"]
passages = ["beijing", "shanghai"]

INSTRUCTION = "Query: "
queries = [INSTRUCTION + query for query in queries]

embeddings_query = encode(queries)
embeddings_doc = encode(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())

高级用法

若要启用 Flash Attention 2 实现，可在加载模型时添加相应参数：

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "openbmb/MiniCPM-Embedding"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()

# 后续代码与基础用法相同

📊 实验结果

中文与英文检索结果

模型	C-MTEB/Retrieval (NDCG@10)	BEIR (NDCG@10)
bge-large-zh-v1.5	70.46	-
gte-large-zh	72.49	-
Zhihui_LLM_Embedding	76.74
bge-large-en-v1.5	-	54.29
gte-en-large-v1.5	-	57.91
NV-Retriever-v1	-	60.9
bge-en-icl	-	62.16
NV-Embed-v2	-	62.65
me5-large	63.66	51.43
bge-m3(Dense)	65.43	48.82
gte-multilingual-base(Dense)	71.95	51.08
gte-Qwen2-1.5B-instruct	71.86	58.29
gte-Qwen2-7B-instruct	76.03	60.25
bge-multilingual-gemma2	73.73	59.24
MiniCPM-Embedding	76.76	58.56
MiniCPM-Embedding+MiniCPM-Reranker	77.08	61.61

中英跨语言检索结果

模型	MKQA En-Zh_CN (Recall@20)	NeuCLIR22 (NDCG@10)	NeuCLIR23 (NDCG@10)
me5-large	44.3	9.01	25.33
bge-m3(Dense)	66.4	30.49	41.09
gte-multilingual-base(Dense)	68.2	39.46	45.86
gte-Qwen2-1.5B-instruct	68.52	49.11	45.05
gte-Qwen2-7B-instruct	68.27	49.14	49.6
MiniCPM-Embedding	72.95	52.65	49.95
MiniCPM-Embedding+MiniCPM-Reranker	74.33	53.21	54.12