🚀 MiniCPM-Embedding
MiniCPM-Embedding is a bilingual Chinese-English text embedding model jointly developed by ModelBest Inc., the Natural Language Processing Laboratory of Tsinghua University (THUNLP), and the Information Retrieval Group of Northeastern University (NEUIR). It offers the following features:
- Exceptional retrieval capabilities for both Chinese and English texts.
- Outstanding cross - lingual retrieval capabilities between Chinese and English.
MiniCPM-Embedding is trained based on MiniCPM-2B-sft-bf16. Structurally, it adopts bidirectional attention and Weighted Mean Pooling [1]. It uses a multi - stage training approach, with approximately 6 million training data, including open - source data, machine - generated data, and closed - source data.
We invite you to explore the RAG toolkit series:
[1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
✨ Features
- Bilingual Retrieval: Exceptional retrieval capabilities for both Chinese and English texts.
- Cross - lingual Retrieval: Outstanding cross - lingual retrieval capabilities between Chinese and English.
📦 Installation
Requirements
transformers==4.37.2
💻 Usage Examples
Basic Usage
Huggingface Transformers
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
model_name = "openbmb/MiniCPM-Embedding"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
model.eval()
def mean_pooling(hidden, attention_mask):
s = torch.sum(hidden * attention_mask.unsqueeze(-1).float(), dim=1)
d = attention_mask.sum(dim=1, keepdim=True).float()
reps = s / d
return reps
@torch.no_grad()
def encode(input_texts):
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True).to("cuda")
outputs = model(**batch_dict)
attention_mask = batch_dict["attention_mask"]
hidden = outputs.last_hidden_state
reps = mean_pooling(hidden, attention_mask)
embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
return embeddings
queries = ["中国的首都是哪里?"]
passages = ["beijing", "shanghai"]
INSTRUCTION = "Query: "
queries = [INSTRUCTION + query for query in queries]
embeddings_query = encode(queries)
embeddings_doc = encode(passages)
scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())
Sentence Transformers
import torch
from sentence_transformers import SentenceTransformer
model_name = "openbmb/MiniCPM-Embedding"
model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={ "torch_dtype": torch.float16})
queries = ["中国的首都是哪里?"]
passages = ["beijing", "shanghai"]
INSTRUCTION = "Query: "
embeddings_query = model.encode(queries, prompt=INSTRUCTION)
embeddings_doc = model.encode(passages)
scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())
Input Format
This model supports query - side instructions in the following format:
Instruction: {{ instruction }} Query: {{ query }}
For example:
Instruction: 为这个医学问题检索相关回答。Query: 咽喉癌的成因是什么?
Instruction: Given a claim about climate change, retrieve documents that support or refute the claim. Query: However the warming trend is slower than most climate models have forecast.
It also works in instruction - free mode in the following format:
Query: {{ query }}
When running evaluation on BEIR and C - MTEB/Retrieval, we use instructions in instructions.json
. For other evaluations, we do not use instructions. On the document side, we directly use the bare document as the input.
📚 Documentation
Model Information
Property |
Details |
Model Size |
2.4B |
Embedding Dimension |
2304 |
Max Input Tokens |
512 |
Evaluation Results
CN/EN Retrieval Results
Model |
C - MTEB/Retrieval (NDCG@10) |
BEIR (NDCG@10) |
bge - large - zh - v1.5 |
70.46 |
- |
gte - large - zh |
72.49 |
- |
Zhihui_LLM_Embedding |
76.74 |
|
bge - large - en - v1.5 |
- |
54.29 |
gte - en - large - v1.5 |
- |
57.91 |
NV - Retriever - v1 |
- |
60.9 |
bge - en - icl |
- |
62.16 |
NV - Embed - v2 |
- |
62.65 |
me5 - large |
63.66 |
51.43 |
bge - m3(Dense) |
65.43 |
48.82 |
gte - multilingual - base(Dense) |
71.95 |
51.08 |
gte - Qwen2 - 1.5B - instruct |
71.86 |
58.29 |
gte - Qwen2 - 7B - instruct |
76.03 |
60.25 |
bge - multilingual - gemma2 |
73.73 |
59.24 |
MiniCPM - Embedding |
76.76 |
58.56 |
MiniCPM - Embedding+MiniCPM - Reranker |
77.08 |
61.61 |
CN - EN Cross - lingual Retrieval Results
Model |
MKQA En - Zh_CN (Recall@20) |
NeuCLIR22 (NDCG@10) |
NeuCLIR23 (NDCG@10) |
me5 - large |
44.3 |
9.01 |
25.33 |
bge - m3(Dense) |
... |
... |
... |