Qwen3-Reranker-4B-seq开源文本重排序模型 - 支持多语言，文本检索超给力！

首页

Qwen3 Reranker 4B Seq

由 michaelfeil 开发

Qwen3-Reranker-4B是通义家族最新推出的4B参数规模文本重排序模型，支持100+种语言，在文本检索任务中表现卓越。

文本嵌入

Transformers

开源协议:Apache-2.0 #多语言重排序 #长文本理解 #指令定制化

下载量 122

发布时间 : 6/6/2025

模型简介

基于Qwen3系列开发的文本重排序模型，专为优化检索结果排序设计，支持自定义指令和多语言场景。

模型特点

多语言能力

支持100多种语言和编程语言的文本重排序

指令感知

支持通过自定义指令优化特定任务场景下的性能

长文本处理

支持长达32k tokens的上下文窗口

灵活规模

提供0.6B/4B/8B多种参数规模选择

模型能力

文本重排序

跨语言检索

代码检索

长文档处理

使用案例

信息检索

搜索引擎结果优化

对搜索引擎返回的前100个结果进行智能重排序

在MTEB多语言检索任务中达到69.76分

知识管理

企业知识库优化

提升内部知识库检索结果的相关性排序

🚀 Qwen3-Reranker-4B

Qwen3 Embedding模型系列是通义家族的最新自研模型，专为文本嵌入和排序任务设计。它基于Qwen3系列的密集基础模型，提供了多种规模（0.6B、4B和8B）的文本嵌入和重排序模型。该系列继承了基础模型出色的多语言能力、长文本理解和推理能力，在文本检索、代码检索、文本分类、文本聚类和双语挖掘等多个文本嵌入和排序任务中取得了显著进展。

🚀 快速开始

在使用Transformers版本早于4.51.0时，可能会遇到如下错误：

KeyError: 'qwen3'

✨ 主要特性

卓越的通用性：嵌入模型在广泛的下游应用评估中达到了先进水平。8B规模的嵌入模型在MTEB多语言排行榜上排名第一（截至2025年6月5日，得分70.58），而重排序模型在各种文本检索场景中表现出色。
全面的灵活性：Qwen3 Embedding系列为嵌入和重排序模型提供了全范围的规模（从0.6B到8B），满足了对效率和效果有不同优先级的各种用例。开发者可以无缝组合这两个模块。此外，嵌入模型允许在所有维度上灵活定义向量，嵌入和重排序模型都支持用户自定义指令，以提高特定任务、语言或场景的性能。
多语言能力：由于Qwen3模型的多语言能力，Qwen3 Embedding系列支持100多种语言，包括各种编程语言，并提供强大的多语言、跨语言和代码检索能力。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

# Requires transformers>=4.51.0
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

def format_instruction(instruction, query, doc):
    if instruction is None:
        instruction = 'Given a web search query, retrieve relevant passages that answer the query'
    output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(instruction=instruction,query=query, doc=doc)
    return output

def process_inputs(pairs):
    inputs = tokenizer(
        pairs, padding=False, truncation='longest_first',
        return_attention_mask=False, max_length=max_length - len(prefix_tokens) - len(suffix_tokens)
    )
    for i, ele in enumerate(inputs['input_ids']):
        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
    inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
    for key in inputs:
        inputs[key] = inputs[key].to(model.device)
    return inputs

@torch.no_grad()
def compute_logits(inputs, **kwargs):
    batch_scores = model(**inputs).logits[:, -1, :]
    true_vector = batch_scores[:, token_true_id]
    false_vector = batch_scores[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-4B", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-4B").eval()

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-4B", torch_dtype=torch.float16, attn_implementation="flash_attention_2").cuda().eval()

token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
max_length = 8192

prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
        
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = ["What is the capital of China?",
    "Explain gravity",
]

documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

pairs = [format_instruction(task, query, doc) for query, doc in zip(queries, documents)]

# Tokenize the input texts
inputs = process_inputs(pairs)
scores = compute_logits(inputs)

print("scores: ", scores)

高级用法

# Requires vllm>=0.8.5
import logging
from typing import Dict, Optional, List

import json
import logging

import torch

from transformers import AutoTokenizer, is_torch_npu_available
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel
import gc
import math
from vllm.inputs.data import TokensPrompt


        
def format_instruction(instruction, query, doc):
    text = [
        {"role": "system", "content": "Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\"."},
        {"role": "user", "content": f"<Instruct>: {instruction}\n\n<Query>: {query}\n\n<Document>: {doc}"}
    ]
    return text

def process_inputs(pairs, instruction, max_length, suffix_tokens):
    messages = [format_instruction(instruction, query, doc) for query, doc in pairs]
    messages =  tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=False, enable_thinking=False
    )
    messages = [ele[:max_length] + suffix_tokens for ele in messages]
    messages = [TokensPrompt(prompt_token_ids=ele) for ele in messages]
    return messages

def compute_logits(model, messages, sampling_params, true_token, false_token):
    outputs = model.generate(messages, sampling_params, use_tqdm=False)
    scores = []
    for i in range(len(outputs)):
        final_logits = outputs[i].outputs[0].logprobs[-1]
        token_count = len(outputs[i].outputs[0].token_ids)
        if true_token not in final_logits:
            true_logit = -10
        else:
            true_logit = final_logits[true_token].logprob
        if false_token not in final_logits:
            false_logit = -10
        else:
            false_logit = final_logits[false_token].logprob
        true_score = math.exp(true_logit)
        false_score = math.exp(false_logit)
        score = true_score / (true_score + false_score)
        scores.append(score)
    return scores

number_of_gpu = torch.cuda.device_count()
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Reranker-4B')
model = LLM(model='Qwen/Qwen3-Reranker-4B', tensor_parallel_size=number_of_gpu, max_model_len=10000, enable_prefix_caching=True, gpu_memory_utilization=0.8)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
max_length=8192
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
true_token = tokenizer("yes", add_special_tokens=False).input_ids[0]
false_token = tokenizer("no", add_special_tokens=False).input_ids[0]
sampling_params = SamplingParams(temperature=0, 
    max_tokens=1,
    logprobs=20, 
    allowed_token_ids=[true_token, false_token],
)

        
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = ["What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

pairs = list(zip(queries, documents))
inputs = process_inputs(pairs, task, max_length-len(suffix_tokens), suffix_tokens)
scores = compute_logits(model, inputs, sampling_params, true_token, false_token)
print('scores', scores)

destroy_model_parallel()

📚 详细文档

Qwen3-Reranker-4B 模型信息

属性	详情
模型类型	文本重排序
支持语言	100+ 种语言
参数数量	4B
上下文长度	32k

Qwen3 Embedding 系列模型列表

模型类型	模型	规模	层数	序列长度	嵌入维度	是否支持MRL	是否支持指令感知
文本嵌入	Qwen3-Embedding-0.6B	0.6B	28	32K	1024	是	是
文本嵌入	Qwen3-Embedding-4B	4B	36	32K	2560	是	是
文本嵌入	Qwen3-Embedding-8B	8B	36	32K	4096	是	是
文本重排序	Qwen3-Reranker-0.6B	0.6B	28	32K	-	-	是
文本重排序	Qwen3-Reranker-4B	4B	36	32K	-	-	是
文本重排序	Qwen3-Reranker-8B	8B	36	32K	-	-	是

⚠️ 重要提示

MRL Support 表示嵌入模型是否支持最终嵌入的自定义维度。

Instruction Aware 表示嵌入或重排序模型是否支持根据不同任务自定义输入指令。

评估表明，对于大多数下游任务，使用指令（instruct）通常比不使用指令的性能提高1%到5%。因此，建议开发者根据其任务和场景创建定制指令。在多语言环境中，也建议用户用英语编写指令，因为模型训练过程中使用的大多数指令最初是用英语编写的。

评估结果

模型	参数	MTEB-R	CMTEB-R	MMTEB-R	MLDR	MTEB-Code	FollowIR
Qwen3-Embedding-0.6B	0.6B	61.82	71.02	64.64	50.26	75.41	5.09
Jina-multilingual-reranker-v2-base	0.3B	58.22	63.37	63.73	39.66	58.98	-0.68
gte-multilingual-reranker-base	0.3B	59.51	74.08	59.44	66.33	54.18	-1.64
BGE-reranker-v2-m3	0.6B	57.03	72.16	58.36	59.51	41.38	-0.01
Qwen3-Reranker-0.6B	0.6B	65.80	71.31	66.36	67.28	73.42	5.41
Qwen3-Reranker-4B	4B	69.76	75.94	72.74	69.97	81.20	14.84
Qwen3-Reranker-8B	8B	69.02	77.45	72.94	70.19	81.22	8.05

⚠️ 重要提示

上述为重排序模型的评估结果。使用了MTEB（英语，v2）、MTEB（中文，v1）、MMTEB和MTEB（代码）的检索子集，即MTEB-R、CMTEB-R、MMTEB-R和MTEB-Code。

所有分数均基于密集嵌入模型 Qwen3-Embedding-0.6B 检索到的前100个候选结果得出。

🔧 技术细节

文档未提及技术实现细节，故跳过此章节。

📄 许可证

本项目采用 Apache-2.0 许可证。

常用提示信息

⚠️ 重要提示

使用Transformers版本早于4.51.0时，可能会遇到 KeyError: 'qwen3' 错误。

💡 使用建议

建议开发者根据具体场景、任务和语言自定义 instruct。测试表明，在大多数检索场景中，查询端不使用 instruct 会导致检索性能下降约1%到5%。

引用信息

如果您觉得我们的工作有帮助，请引用以下内容：

@misc{qwen3-embedding,
    title  = {Qwen3-Embedding},
    url    = {https://qwenlm.github.io/blog/qwen3/},
    author = {Qwen Team},
    month  = {May},
    year   = {2025}
}

更多详细信息，包括基准评估、硬件要求和推理性能，请参考我们的博客和 GitHub。