Qwen3-Reranker-4B-seq開源文本重排序模型 - 支持多語言，文本檢索超給力！

首頁

Qwen3 Reranker 4B Seq

由michaelfeil開發

Qwen3-Reranker-4B是通義家族最新推出的4B參數規模文本重排序模型，支持100+種語言，在文本檢索任務中表現卓越。

文本嵌入

Transformers

開源協議:Apache-2.0 #多語言重排序 #長文本理解 #指令定製化

下載量 122

發布時間 : 6/6/2025

模型概述

基於Qwen3系列開發的文本重排序模型，專為優化檢索結果排序設計，支持自定義指令和多語言場景。

模型特點

多語言能力

支持100多種語言和編程語言的文本重排序

指令感知

支持通過自定義指令優化特定任務場景下的性能

長文本處理

支持長達32k tokens的上下文窗口

靈活規模

提供0.6B/4B/8B多種參數規模選擇

模型能力

文本重排序

跨語言檢索

代碼檢索

長文檔處理

使用案例

信息檢索

搜索引擎結果優化

對搜索引擎返回的前100個結果進行智能重排序

在MTEB多語言檢索任務中達到69.76分

知識管理

企業知識庫優化

提升內部知識庫檢索結果的相關性排序

🚀 Qwen3-Reranker-4B

Qwen3 Embedding模型系列是通義家族的最新自研模型，專為文本嵌入和排序任務設計。它基於Qwen3系列的密集基礎模型，提供了多種規模（0.6B、4B和8B）的文本嵌入和重排序模型。該系列繼承了基礎模型出色的多語言能力、長文本理解和推理能力，在文本檢索、代碼檢索、文本分類、文本聚類和雙語挖掘等多個文本嵌入和排序任務中取得了顯著進展。

🚀 快速開始

在使用Transformers版本早於4.51.0時，可能會遇到如下錯誤：

KeyError: 'qwen3'

✨ 主要特性

卓越的通用性：嵌入模型在廣泛的下游應用評估中達到了先進水平。8B規模的嵌入模型在MTEB多語言排行榜上排名第一（截至2025年6月5日，得分70.58），而重排序模型在各種文本檢索場景中表現出色。
全面的靈活性：Qwen3 Embedding系列為嵌入和重排序模型提供了全範圍的規模（從0.6B到8B），滿足了對效率和效果有不同優先級的各種用例。開發者可以無縫組合這兩個模塊。此外，嵌入模型允許在所有維度上靈活定義向量，嵌入和重排序模型都支持用戶自定義指令，以提高特定任務、語言或場景的性能。
多語言能力：由於Qwen3模型的多語言能力，Qwen3 Embedding系列支持100多種語言，包括各種編程語言，並提供強大的多語言、跨語言和代碼檢索能力。

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

# Requires transformers>=4.51.0
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

def format_instruction(instruction, query, doc):
    if instruction is None:
        instruction = 'Given a web search query, retrieve relevant passages that answer the query'
    output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(instruction=instruction,query=query, doc=doc)
    return output

def process_inputs(pairs):
    inputs = tokenizer(
        pairs, padding=False, truncation='longest_first',
        return_attention_mask=False, max_length=max_length - len(prefix_tokens) - len(suffix_tokens)
    )
    for i, ele in enumerate(inputs['input_ids']):
        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
    inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
    for key in inputs:
        inputs[key] = inputs[key].to(model.device)
    return inputs

@torch.no_grad()
def compute_logits(inputs, **kwargs):
    batch_scores = model(**inputs).logits[:, -1, :]
    true_vector = batch_scores[:, token_true_id]
    false_vector = batch_scores[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-4B", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-4B").eval()

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-4B", torch_dtype=torch.float16, attn_implementation="flash_attention_2").cuda().eval()

token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
max_length = 8192

prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
        
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = ["What is the capital of China?",
    "Explain gravity",
]

documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

pairs = [format_instruction(task, query, doc) for query, doc in zip(queries, documents)]

# Tokenize the input texts
inputs = process_inputs(pairs)
scores = compute_logits(inputs)

print("scores: ", scores)

高級用法

# Requires vllm>=0.8.5
import logging
from typing import Dict, Optional, List

import json
import logging

import torch

from transformers import AutoTokenizer, is_torch_npu_available
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel
import gc
import math
from vllm.inputs.data import TokensPrompt


        
def format_instruction(instruction, query, doc):
    text = [
        {"role": "system", "content": "Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\"."},
        {"role": "user", "content": f"<Instruct>: {instruction}\n\n<Query>: {query}\n\n<Document>: {doc}"}
    ]
    return text

def process_inputs(pairs, instruction, max_length, suffix_tokens):
    messages = [format_instruction(instruction, query, doc) for query, doc in pairs]
    messages =  tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=False, enable_thinking=False
    )
    messages = [ele[:max_length] + suffix_tokens for ele in messages]
    messages = [TokensPrompt(prompt_token_ids=ele) for ele in messages]
    return messages

def compute_logits(model, messages, sampling_params, true_token, false_token):
    outputs = model.generate(messages, sampling_params, use_tqdm=False)
    scores = []
    for i in range(len(outputs)):
        final_logits = outputs[i].outputs[0].logprobs[-1]
        token_count = len(outputs[i].outputs[0].token_ids)
        if true_token not in final_logits:
            true_logit = -10
        else:
            true_logit = final_logits[true_token].logprob
        if false_token not in final_logits:
            false_logit = -10
        else:
            false_logit = final_logits[false_token].logprob
        true_score = math.exp(true_logit)
        false_score = math.exp(false_logit)
        score = true_score / (true_score + false_score)
        scores.append(score)
    return scores

number_of_gpu = torch.cuda.device_count()
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Reranker-4B')
model = LLM(model='Qwen/Qwen3-Reranker-4B', tensor_parallel_size=number_of_gpu, max_model_len=10000, enable_prefix_caching=True, gpu_memory_utilization=0.8)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
max_length=8192
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
true_token = tokenizer("yes", add_special_tokens=False).input_ids[0]
false_token = tokenizer("no", add_special_tokens=False).input_ids[0]
sampling_params = SamplingParams(temperature=0, 
    max_tokens=1,
    logprobs=20, 
    allowed_token_ids=[true_token, false_token],
)

        
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = ["What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

pairs = list(zip(queries, documents))
inputs = process_inputs(pairs, task, max_length-len(suffix_tokens), suffix_tokens)
scores = compute_logits(model, inputs, sampling_params, true_token, false_token)
print('scores', scores)

destroy_model_parallel()

📚 詳細文檔

Qwen3-Reranker-4B 模型信息

屬性	詳情
模型類型	文本重排序
支持語言	100+ 種語言
參數數量	4B
上下文長度	32k

Qwen3 Embedding 系列模型列表

模型類型	模型	規模	層數	序列長度	嵌入維度	是否支持MRL	是否支持指令感知
文本嵌入	Qwen3-Embedding-0.6B	0.6B	28	32K	1024	是	是
文本嵌入	Qwen3-Embedding-4B	4B	36	32K	2560	是	是
文本嵌入	Qwen3-Embedding-8B	8B	36	32K	4096	是	是
文本重排序	Qwen3-Reranker-0.6B	0.6B	28	32K	-	-	是
文本重排序	Qwen3-Reranker-4B	4B	36	32K	-	-	是
文本重排序	Qwen3-Reranker-8B	8B	36	32K	-	-	是

⚠️ 重要提示

MRL Support 表示嵌入模型是否支持最終嵌入的自定義維度。

Instruction Aware 表示嵌入或重排序模型是否支持根據不同任務自定義輸入指令。

評估表明，對於大多數下游任務，使用指令（instruct）通常比不使用指令的性能提高1%到5%。因此，建議開發者根據其任務和場景創建定製指令。在多語言環境中，也建議用戶用英語編寫指令，因為模型訓練過程中使用的大多數指令最初是用英語編寫的。

評估結果

模型	參數	MTEB-R	CMTEB-R	MMTEB-R	MLDR	MTEB-Code	FollowIR
Qwen3-Embedding-0.6B	0.6B	61.82	71.02	64.64	50.26	75.41	5.09
Jina-multilingual-reranker-v2-base	0.3B	58.22	63.37	63.73	39.66	58.98	-0.68
gte-multilingual-reranker-base	0.3B	59.51	74.08	59.44	66.33	54.18	-1.64
BGE-reranker-v2-m3	0.6B	57.03	72.16	58.36	59.51	41.38	-0.01
Qwen3-Reranker-0.6B	0.6B	65.80	71.31	66.36	67.28	73.42	5.41
Qwen3-Reranker-4B	4B	69.76	75.94	72.74	69.97	81.20	14.84
Qwen3-Reranker-8B	8B	69.02	77.45	72.94	70.19	81.22	8.05

⚠️ 重要提示

上述為重排序模型的評估結果。使用了MTEB（英語，v2）、MTEB（中文，v1）、MMTEB和MTEB（代碼）的檢索子集，即MTEB-R、CMTEB-R、MMTEB-R和MTEB-Code。

所有分數均基於密集嵌入模型 Qwen3-Embedding-0.6B 檢索到的前100個候選結果得出。

🔧 技術細節

文檔未提及技術實現細節，故跳過此章節。

📄 許可證

本項目採用 Apache-2.0 許可證。

常用提示信息

⚠️ 重要提示

使用Transformers版本早於4.51.0時，可能會遇到 KeyError: 'qwen3' 錯誤。

💡 使用建議

建議開發者根據具體場景、任務和語言自定義 instruct。測試表明，在大多數檢索場景中，查詢端不使用 instruct 會導致檢索性能下降約1%到5%。

引用信息

如果您覺得我們的工作有幫助，請引用以下內容：

@misc{qwen3-embedding,
    title  = {Qwen3-Embedding},
    url    = {https://qwenlm.github.io/blog/qwen3/},
    author = {Qwen Team},
    month  = {May},
    year   = {2025}
}

更多詳細信息，包括基準評估、硬件要求和推理性能，請參考我們的博客和 GitHub。