đ BCEmbedding: Bilingual and Crosslingual Embedding for RAG
BCEmbedding is a library of bilingual and cross - lingual semantic representation algorithm models developed by NetEase Youdao. It contains two types of basic models, EmbeddingModel
and RerankerModel
, which play crucial roles in semantic search and question - answering scenarios.
The latest and most detailed information about bce - reranker - base_v1 can be found at:
⨠Features
- Multilingual and Cross - lingual Capability: It supports four languages, English, Chinese, Japanese, and Korean, and has cross - lingual capabilities among them.
- RAG Adaptation: Optimized for RAG, it can be adapted to more real - world business scenarios, including Education, Law, Finance, Medical, Literature, FAQ, Textbook, Wikipedia, etc.
- Long - Text Reranking:
BCEmbedding
is adapted to rerank long texts.
- Meaningful Similarity Scores: The
RerankerModel
provides "meaningful" similarity scores for filtering bad passages, with a recommended threshold of 0.35 or 0.4.
- Best Practice: First, use [bce - embedding - base_v1](https://huggingface.co/maidalun1020/bce - embedding - base_v1) to recall the top 50 - 100 passages. Then, use [bce - reranker - base_v1](https://huggingface.co/maidalun1020/bce - reranker - base_v1) to rerank these passages and finally select the top 5 - 10 passages.
News
Third - party Examples

đ Bilingual and Crosslingual Superiority
Existing embedding models often face performance challenges in bilingual and cross - lingual scenarios, especially in Chinese, English, and their cross - lingual tasks. Leveraging the strength of Youdao's translation engine, BCEmbedding
delivers superior performance in monolingual, bilingual, and cross - lingual settings.
EmbeddingModel
supports Chinese (ch) and English (en) (more languages will be supported soon), while RerankerModel
supports Chinese (ch), English (en), Japanese (ja), and Korean (ko).
đĄ Key Features
- Bilingual and Cross - lingual Proficiency: Powered by Youdao's translation engine, it excels in Chinese, English, and their cross - lingual retrieval tasks, and will support additional languages in the future.
- RAG - Optimized: Tailored for diverse RAG tasks such as translation, summarization, and question - answering. It also has targeted optimization for query understanding. See RAG Evaluations in LlamaIndex.
- Efficient and Precise Retrieval: The
EmbeddingModel
uses a dual - encoder for efficient retrieval in the first stage, and the RerankerModel
uses a cross - encoder for higher - precision semantic reranking in the second stage.
- Broad Domain Adaptability: Trained on diverse datasets to achieve better performance across various fields.
- User - Friendly Design: No special instruction prefix is required for semantic retrieval.
- Meaningful Reranking Scores: The
RerankerModel
provides relevant scores to improve result quality and optimize large language model performance.
- Proven in Production: It has been successfully implemented and validated in Youdao's products.
đ Latest Updates
- 2024 - 01 - 03: Model Releases - [bce - embedding - base_v1](https://huggingface.co/maidalun1020/bce - embedding - base_v1) and [bce - reranker - base_v1](https://huggingface.co/maidalun1020/bce - reranker - base_v1) are available.
- 2024 - 01 - 03: Eval Datasets [CrosslingualMultiDomainsDataset] - Evaluate the performance of RAG using LlamaIndex.
- 2024 - 01 - 03: Eval Datasets [Details] - Evaluate the performance of cross - lingual semantic representation using [MTEB](https://github.com/embeddings - benchmark/mteb).
đ Model List
Property |
Details |
Model Name |
bce - embedding - base_v1, bce - reranker - base_v1 |
Model Type |
EmbeddingModel , RerankerModel |
Languages |
ch, en; ch, en, ja, ko |
Parameters |
279M |
Weights |
[download](https://huggingface.co/maidalun1020/bce - embedding - base_v1), [download](https://huggingface.co/maidalun1020/bce - reranker - base_v1) |
đ Documentation
đĻ Installation
First, create a conda environment and activate it.
conda create --name bce python=3.10 -y
conda activate bce
Then install BCEmbedding
for minimal installation:
pip install BCEmbedding==0.1.1
Or install from source:
git clone git@github.com:netease-youdao/BCEmbedding.git
cd BCEmbedding
pip install -v -e .
đģ Usage Examples
Basic Usage
Based on BCEmbedding
from BCEmbedding import EmbeddingModel
sentences = ['sentence_0', 'sentence_1', ...]
model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")
embeddings = model.encode(sentences)
from BCEmbedding import RerankerModel
query = 'input_query'
passages = ['passage_0', 'passage_1', ...]
sentence_pairs = [[query, passage] for passage in passages]
model = RerankerModel(model_name_or_path="maidalun1020/bce-reranker-base_v1")
scores = model.compute_score(sentence_pairs)
rerank_results = model.rerank(query, passages)
â ī¸ Important Note
In the RerankerModel.rerank
method, an advanced pre - process is provided for making sentence_pairs
when the "passages" are very long.
Based on transformers
from transformers import AutoModel, AutoTokenizer
sentences = ['sentence_0', 'sentence_1', ...]
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1')
model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1')
device = 'cuda'
model.to(device)
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(self.device) for k, v in inputs.items()}
outputs = model(**inputs_on_device, return_dict=True)
embeddings = outputs.last_hidden_state[:, 0]
embeddings = embeddings / embeddings.norm(dim=1, keepdim=True)
```markdown
## đ License
This project is licensed under the Apache - 2.0 license. You can find the detailed license information at [LICENSE](https://github.com/netease-youdao/BCEmbedding/blob/master/LICENSE).