BCE-reranker-base_v1 Open-Source Re-ranking Model - Supports Chinese, English, Japanese, and Korean, RAG Optimization with Absolute Scores

Bce Reranker Base V1

Developed by maidalun1020

A bilingual and cross-language reranking model optimized for RAG, supporting Chinese, English, Japanese, and Korean, providing explainable absolute scores

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Cross-language RAG optimization #Bilingual semantic embedding #Explainable reranking

Downloads 68.29k

Release Time : 12/29/2023

Model Overview

BCEmbedding is a bilingual and cross-language embedding model developed by NetEase Youdao, including the EmbeddingModel for generating semantic vectors and the RerankerModel for optimizing search results. As a core component of Youdao's RAG technology stack, it has been applied to open-source projects like QAnything and Youdao products.

Model Features

Cross-language capability

Supports Chinese, English, Japanese, and Korean, as well as cross-language retrieval, optimized based on Youdao's translation engine

RAG-specific optimization

Optimized for real-world scenarios in education, law, finance, and other domains, suitable for tasks like Q&A and summarization

Long text processing

Breaks the 512-character limit, supporting long text reranking

Explainable scoring

Provides absolute quality scores (suggested threshold for filtering low-quality paragraphs is 0.35-0.4)

Two-stage retrieval

Best practice combining bce-embedding-base_v1 for recall and bce-reranker-base_v1 for fine ranking

Model Capabilities

Cross-language semantic matching

Text relevance scoring

Low-quality content filtering

Long text processing

Multi-domain adaptation

Use Cases

Retrieval-Augmented Generation (RAG)

Education Q&A

Textbook content retrieval and answer generation

Performed excellently in LlamaIndex RAG evaluations

Multilingual document processing

Semantic retrieval of mixed Chinese, English, Japanese, and Korean documents

Leading in cross-language scenario evaluations

Information filtering

Low-quality content identification

Filter irrelevant text fragments using absolute score thresholds

Suggested threshold is 0.35-0.4

🚀 BCEmbedding: Bilingual and Crosslingual Embedding for RAG

BCEmbedding is a library of bilingual and cross - lingual semantic representation algorithm models developed by NetEase Youdao. It contains two types of basic models, EmbeddingModel and RerankerModel, which play crucial roles in semantic search and question - answering scenarios.

The latest and most detailed information about bce - reranker - base_v1 can be found at:

GitHub

✨ Features

Multilingual and Cross - lingual Capability: It supports four languages, English, Chinese, Japanese, and Korean, and has cross - lingual capabilities among them.
RAG Adaptation: Optimized for RAG, it can be adapted to more real - world business scenarios, including Education, Law, Finance, Medical, Literature, FAQ, Textbook, Wikipedia, etc.
Long - Text Reranking: BCEmbedding is adapted to rerank long texts.
Meaningful Similarity Scores: The RerankerModel provides "meaningful" similarity scores for filtering bad passages, with a recommended threshold of 0.35 or 0.4.
Best Practice: First, use [bce - embedding - base_v1](https://huggingface.co/maidalun1020/bce - embedding - base_v1) to recall the top 50 - 100 passages. Then, use [bce - reranker - base_v1](https://huggingface.co/maidalun1020/bce - reranker - base_v1) to rerank these passages and finally select the top 5 - 10 passages.

News

Technical Blog: BCEmbedding Technical Report for RAG
Related Link for EmbeddingModel: [bce - embedding - base_v1](https://huggingface.co/maidalun1020/bce - embedding - base_v1)

Third - party Examples

RAG Applications: QAnything, HuixiangDou, ChatPDF.
Efficient Inference Framework: ChatLLM.cpp, Xinference, mindnlp (Huawei GPU).

image/jpeg

🌐 Bilingual and Crosslingual Superiority

Existing embedding models often face performance challenges in bilingual and cross - lingual scenarios, especially in Chinese, English, and their cross - lingual tasks. Leveraging the strength of Youdao's translation engine, BCEmbedding delivers superior performance in monolingual, bilingual, and cross - lingual settings.

EmbeddingModel supports Chinese (ch) and English (en) (more languages will be supported soon), while RerankerModel supports Chinese (ch), English (en), Japanese (ja), and Korean (ko).

💡 Key Features

Bilingual and Cross - lingual Proficiency: Powered by Youdao's translation engine, it excels in Chinese, English, and their cross - lingual retrieval tasks, and will support additional languages in the future.
RAG - Optimized: Tailored for diverse RAG tasks such as translation, summarization, and question - answering. It also has targeted optimization for query understanding. See RAG Evaluations in LlamaIndex.
Efficient and Precise Retrieval: The EmbeddingModel uses a dual - encoder for efficient retrieval in the first stage, and the RerankerModel uses a cross - encoder for higher - precision semantic reranking in the second stage.
Broad Domain Adaptability: Trained on diverse datasets to achieve better performance across various fields.
User - Friendly Design: No special instruction prefix is required for semantic retrieval.
Meaningful Reranking Scores: The RerankerModel provides relevant scores to improve result quality and optimize large language model performance.
Proven in Production: It has been successfully implemented and validated in Youdao's products.

🚀 Latest Updates

2024 - 01 - 03: Model Releases - [bce - embedding - base_v1](https://huggingface.co/maidalun1020/bce - embedding - base_v1) and [bce - reranker - base_v1](https://huggingface.co/maidalun1020/bce - reranker - base_v1) are available.
2024 - 01 - 03: Eval Datasets [CrosslingualMultiDomainsDataset] - Evaluate the performance of RAG using LlamaIndex.
2024 - 01 - 03: Eval Datasets [Details] - Evaluate the performance of cross - lingual semantic representation using [MTEB](https://github.com/embeddings - benchmark/mteb).

🍎 Model List

Property	Details
Model Name	bce - embedding - base_v1, bce - reranker - base_v1
Model Type	`EmbeddingModel`, `RerankerModel`
Languages	ch, en; ch, en, ja, ko
Parameters	279M
Weights	[download](https://huggingface.co/maidalun1020/bce - embedding - base_v1), [download](https://huggingface.co/maidalun1020/bce - reranker - base_v1)

📚 Documentation

📦 Installation

First, create a conda environment and activate it.

conda create --name bce python=3.10 -y
conda activate bce

Then install BCEmbedding for minimal installation:

pip install BCEmbedding==0.1.1

Or install from source:

git clone git@github.com:netease-youdao/BCEmbedding.git
cd BCEmbedding
pip install -v -e .

💻 Usage Examples

Basic Usage

Based on `BCEmbedding`

# Use EmbeddingModel
from BCEmbedding import EmbeddingModel

# list of sentences
sentences = ['sentence_0', 'sentence_1', ...]

# init embedding model
model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")

# extract embeddings
embeddings = model.encode(sentences)

# Use RerankerModel
from BCEmbedding import RerankerModel

# your query and corresponding passages
query = 'input_query'
passages = ['passage_0', 'passage_1', ...]

# construct sentence pairs
sentence_pairs = [[query, passage] for passage in passages]

# init reranker model
model = RerankerModel(model_name_or_path="maidalun1020/bce-reranker-base_v1")

# method 0: calculate scores of sentence pairs
scores = model.compute_score(sentence_pairs)

# method 1: rerank passages
rerank_results = model.rerank(query, passages)

⚠️ Important Note

In the RerankerModel.rerank method, an advanced pre - process is provided for making sentence_pairs when the "passages" are very long.

Based on `transformers`

# For EmbeddingModel
from transformers import AutoModel, AutoTokenizer

# list of sentences
sentences = ['sentence_0', 'sentence_1', ...]

# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1')
model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1')

device = 'cuda'  # if no GPU, set "cpu"
model.to(device)

# get inputs
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(self.device) for k, v in inputs.items()}

# get embeddings
outputs = model(**inputs_on_device, return_dict=True)
embeddings = outputs.last_hidden_state[:, 0]  # cls pooler
embeddings = embeddings / embeddings.norm(dim=1, keepdim=True)  # normalize

```markdown
## 📄 License
This project is licensed under the Apache - 2.0 license. You can find the detailed license information at [LICENSE](https://github.com/netease-youdao/BCEmbedding/blob/master/LICENSE).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご