MiniCPM-Embedding Open-Source Embedding Model - Freely Empower English-Chinese Bilingual Retrieval Tasks

Minicpm Embedding

Developed by openbmb

MiniCPM-Embedding is an embedding model developed based on the MiniCPM-2B-sft-bf16 foundation model, specializing in retrieval tasks and supporting both Chinese and English.

Text Embedding

Transformers

Supports Multiple Languages#Multilingual Retrieval #High-precision Embedding #Chinese Optimization

Downloads 315

Release Time : 9/4/2024

Model Overview

This model is primarily used for text retrieval tasks, capable of generating high-quality text embeddings suitable for various information retrieval scenarios.

Model Features

Bilingual Support

Supports text retrieval tasks in both Chinese and English.

Efficient Retrieval

Performs excellently in multiple retrieval tasks, especially in Chinese retrieval tasks.

Lightweight

Based on MiniCPM-2B-sft-bf16, with a relatively small parameter size, making it suitable for resource-limited environments.

Model Capabilities

Text Embedding Generation

Information Retrieval

Bilingual Retrieval

Use Cases

Information Retrieval

Academic Literature Retrieval

Used for retrieving academic literature, such as scientific documents in the SCIDOCS dataset.

NDCG@10 of 22.38

Medical Q&A Retrieval

Used for retrieving medical-related Q&A data, such as the CmedqaRetrieval dataset.

NDCG@10 of 46.05

E-commerce Product Retrieval

Used for retrieving product information on e-commerce platforms, such as the EcomRetrieval dataset.

NDCG@10 of 70.21

Q&A Systems

Fact-based Q&A

Used for answering factual questions, such as tasks in the FEVER dataset.

NDCG@10 of 90.76

Open-domain Q&A

Used for open-domain Q&A tasks, such as the NQ dataset.

NDCG@10 of 69.29

🚀 MiniCPM-Embedding

MiniCPM-Embedding is a bilingual Chinese-English text embedding model jointly developed by ModelBest Inc., the Natural Language Processing Laboratory of Tsinghua University (THUNLP), and the Information Retrieval Group of Northeastern University (NEUIR). It offers the following features:

Exceptional retrieval capabilities for both Chinese and English texts.
Outstanding cross - lingual retrieval capabilities between Chinese and English.

MiniCPM-Embedding is trained based on MiniCPM-2B-sft-bf16. Structurally, it adopts bidirectional attention and Weighted Mean Pooling [1]. It uses a multi - stage training approach, with approximately 6 million training data, including open - source data, machine - generated data, and closed - source data.

We invite you to explore the RAG toolkit series:

Retrieval Model: MiniCPM-Embedding
Re - ranking Model: MiniCPM-Reranker
LoRA Plugin for RAG scenarios: MiniCPM3-RAG-LoRA

[1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.

✨ Features

Bilingual Retrieval: Exceptional retrieval capabilities for both Chinese and English texts.
Cross - lingual Retrieval: Outstanding cross - lingual retrieval capabilities between Chinese and English.

📦 Installation

Requirements

transformers==4.37.2

💻 Usage Examples

Basic Usage

Huggingface Transformers

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "openbmb/MiniCPM-Embedding"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
# You can also use the following line to enable the Flash Attention 2 implementation
# model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()

# As we scale hidden states in `model.forward`, mean pooling here actually works as weighted mean pooling
def mean_pooling(hidden, attention_mask):
    s = torch.sum(hidden * attention_mask.unsqueeze(-1).float(), dim=1)
    d = attention_mask.sum(dim=1, keepdim=True).float()
    reps = s / d
    return reps

@torch.no_grad()
def encode(input_texts):
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True).to("cuda")
    
    outputs = model(**batch_dict)
    attention_mask = batch_dict["attention_mask"]
    hidden = outputs.last_hidden_state

    reps = mean_pooling(hidden, attention_mask)   
    embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
    return embeddings

queries = ["中国的首都是哪里？"]
passages = ["beijing", "shanghai"]


INSTRUCTION = "Query: "
queries = [INSTRUCTION + query for query in queries]

embeddings_query = encode(queries)
embeddings_doc = encode(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())  # [[0.3535913825035095, 0.18596848845481873]]

Sentence Transformers

import torch
from sentence_transformers import SentenceTransformer

model_name = "openbmb/MiniCPM-Embedding"
model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={ "torch_dtype": torch.float16})
# You can also use the following line to enable the Flash Attention 2 implementation
# model = SentenceTransformer(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", model_kwargs={ "torch_dtype": torch.float16})

queries = ["中国的首都是哪里？"]
passages = ["beijing", "shanghai"]

INSTRUCTION = "Query: "

embeddings_query = model.encode(queries, prompt=INSTRUCTION)
embeddings_doc = model.encode(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())  # [[0.35365450382232666, 0.18592746555805206]]

Input Format

This model supports query - side instructions in the following format:

Instruction: {{ instruction }} Query: {{ query }}

For example:

Instruction: 为这个医学问题检索相关回答。Query: 咽喉癌的成因是什么？

Instruction: Given a claim about climate change, retrieve documents that support or refute the claim. Query: However the warming trend is slower than most climate models have forecast.

It also works in instruction - free mode in the following format:

Query: {{ query }}

When running evaluation on BEIR and C - MTEB/Retrieval, we use instructions in instructions.json. For other evaluations, we do not use instructions. On the document side, we directly use the bare document as the input.

📚 Documentation

Model Information

Property	Details
Model Size	2.4B
Embedding Dimension	2304
Max Input Tokens	512

Evaluation Results

CN/EN Retrieval Results

Model	C - MTEB/Retrieval (NDCG@10)	BEIR (NDCG@10)
bge - large - zh - v1.5	70.46	-
gte - large - zh	72.49	-
Zhihui_LLM_Embedding	76.74
bge - large - en - v1.5	-	54.29
gte - en - large - v1.5	-	57.91
NV - Retriever - v1	-	60.9
bge - en - icl	-	62.16
NV - Embed - v2	-	62.65
me5 - large	63.66	51.43
bge - m3(Dense)	65.43	48.82
gte - multilingual - base(Dense)	71.95	51.08
gte - Qwen2 - 1.5B - instruct	71.86	58.29
gte - Qwen2 - 7B - instruct	76.03	60.25
bge - multilingual - gemma2	73.73	59.24
MiniCPM - Embedding	76.76	58.56
MiniCPM - Embedding+MiniCPM - Reranker	77.08	61.61

CN - EN Cross - lingual Retrieval Results

Model	MKQA En - Zh_CN (Recall@20)	NeuCLIR22 (NDCG@10)	NeuCLIR23 (NDCG@10)
me5 - large	44.3	9.01	25.33
bge - m3(Dense)	...	...	...

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご