Qwen3-Embedding-8B-GGUF开源模型 - 支持多语言长文本理解的文本嵌入排序利器

首页

Qwen3 Embedding 8B GGUF

由 Mungert 开发

Qwen3-Embedding-8B 是 Qwen 家族的最新专有模型，专为文本嵌入和排序任务设计，基于 Qwen3 系列的密集基础模型构建，具有卓越的多语言能力和长文本理解能力。

文本嵌入开源协议:Apache-2.0 #多语言嵌入 #长文本理解 #指令优化检索

下载量 612

发布时间 : 6/10/2025

模型简介

Qwen3-Embedding-8B 是一个高性能的文本嵌入模型，适用于文本检索、代码检索、文本分类、文本聚类和双语挖掘等多种任务。

模型特点

卓越的通用性

嵌入模型在广泛的下游应用评估中达到了最先进的性能，8B 大小的嵌入模型在 MTEB 多语言排行榜上排名第一。

全面的灵活性

Qwen3 嵌入系列为嵌入和重排序模型提供了全范围的大小（从 0.6B 到 8B），满足了优先考虑效率和效果的各种用例。

多语言能力

支持 100 多种语言，包括各种编程语言，并提供强大的多语言、跨语言和代码检索能力。

模型能力

文本检索

代码检索

文本分类

文本聚类

双语挖掘

使用案例

信息检索

网页搜索

给定一个网页搜索查询，检索相关的段落以回答查询。

在多个文本检索任务中表现出色

自然语言处理

文本分类

对文本进行分类，如情感分析、主题分类等。

在多个文本分类任务中取得了显著进展

🚀 Qwen/Qwen3-Embedding-8B GGUF 模型

Qwen3 嵌入模型系列是 Qwen 家族的最新专有模型，专为文本嵌入和排序任务设计。它基于 Qwen3 系列的密集基础模型构建，提供了多种大小（0.6B、4B 和 8B）的文本嵌入和重排序模型。该系列继承了基础模型出色的多语言能力、长文本理解和推理能力，在多个文本嵌入和排序任务中取得了显著进展，包括文本检索、代码检索、文本分类、文本聚类和双语挖掘。

✨ 主要特性

卓越的通用性：嵌入模型在广泛的下游应用评估中达到了最先进的性能。8B 大小的嵌入模型在 MTEB 多语言排行榜上排名第一（截至 2025 年 6 月 5 日，得分 70.58），而重排序模型在各种文本检索场景中表现出色。
全面的灵活性：Qwen3 嵌入系列为嵌入和重排序模型提供了全范围的大小（从 0.6B 到 8B），满足了优先考虑效率和效果的各种用例。开发人员可以无缝组合这两个模块。此外，嵌入模型允许在所有维度上灵活定义向量，并且嵌入和重排序模型都支持用户定义的指令，以提高特定任务、语言或场景的性能。
多语言能力：由于 Qwen3 模型的多语言能力，Qwen3 嵌入系列支持 100 多种语言，包括各种编程语言，并提供强大的多语言、跨语言和代码检索能力。

📦 安装指南

使用早于 4.51.0 版本的 Transformers 时，可能会遇到以下错误：

KeyError: 'qwen3'

请确保安装符合要求的依赖库：

transformers>=4.51.0
sentence-transformers>=2.7.0
vllm>=0.8.5

💻 使用示例

基础用法

Sentence Transformers 使用示例

# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Qwen/Qwen3-Embedding-8B")

# We recommend enabling flash_attention_2 for better acceleration and memory saving,
# together with setting `padding_side` to "left":
# model = SentenceTransformer(
#     "Qwen/Qwen3-Embedding-8B",
#     model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto"},
#     tokenizer_kwargs={"padding_side": "left"},
# )

# The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.7493, 0.0751],
#         [0.0880, 0.6318]])

Transformers 使用示例

# Requires transformers>=4.51.0

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Embedding-8B', padding_side='left')
model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-8B')

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-8B', attn_implementation="flash_attention_2", torch_dtype=torch.float16).cuda()

max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7493016123771667, 0.0750647559762001], [0.08795969933271408, 0.6318399906158447]]

vLLM 使用示例

# Requires vllm>=0.8.5
import torch
import vllm
from vllm import LLM
def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
model = LLM(model="Qwen/Qwen3-Embedding-8B", task="embed")
outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7482624650001526, 0.07556197047233582], [0.08875375241041183, 0.6300010681152344]]

高级用法

⚠️ 重要提示

我们建议开发人员根据具体场景、任务和语言自定义 instruct。我们的测试表明，在大多数检索场景中，查询端不使用 instruct 会导致检索性能下降约 1% 到 5%。

💡 使用建议

对于大多数下游任务，使用指令（instruct）通常比不使用指令能提高 1% 到 5% 的性能。因此，建议开发人员根据自己的任务和场景创建定制的指令。在多语言环境中，建议用户用英语编写指令，因为模型训练过程中使用的大多数指令最初都是用英语编写的。

📚 详细文档

模型生成细节

该模型使用 llama.cpp 在提交版本 1f63e75f 时生成。

选择合适的模型格式

选择正确的模型格式取决于你的 硬件能力 和 内存限制。

BF16（Brain Float 16）—— 如果有 BF16 加速功能则使用

一种 16 位浮点格式，专为 更快的计算 而设计，同时保持良好的精度。
提供与 FP32 相似的动态范围，但 内存使用更低。
如果你的硬件支持 BF16 加速（检查设备规格），建议使用。
与 FP32 相比，适用于 高性能推理 且 内存占用减少。

使用 BF16 的情况：

你的硬件具有原生 BF16 支持（例如，较新的 GPU、TPU）。
你想在节省内存的同时获得 更高的精度。
你计划将模型 重新量化 为其他格式。

避免使用 BF16 的情况：

你的硬件 不支持 BF16（可能会回退到 FP32 并运行得更慢）。
你需要与缺乏 BF16 优化的旧设备兼容。

F16（Float 16）—— 比 BF16 更广泛支持

一种 16 位浮点格式，具有 高精度，但取值范围比 BF16 小。
适用于大多数支持 FP16 加速 的设备（包括许多 GPU 和一些 CPU）。
数值精度略低于 BF16，但通常足以用于推理。

使用 F16 的情况：

你的硬件支持 FP16 但 不支持 BF16。
你需要在 速度、内存使用和准确性 之间取得平衡。
你在 GPU 或其他针对 FP16 计算优化的设备上运行。

避免使用 F16 的情况：

你的设备缺乏 原生 FP16 支持（可能运行得比预期慢）。
你有内存限制。

混合精度模型（例如，`bf16_q8_0`，`f16_q4_K`）—— 两全其美

这些格式选择性地 量化非关键层，同时保持 关键层的全精度（例如，注意力和输出层）。

命名方式如 bf16_q8_0（意味着 全精度 BF16 核心层 + 量化 Q8_0 其他层）。
在 内存效率和准确性 之间取得平衡，比完全量化的模型有所改进，而无需 BF16/F16 的全部内存。

使用混合模型的情况：

你需要比仅量化模型 更高的准确性，但无法承受在所有地方使用全 BF16/F16。
你的设备支持 混合精度推理。
你想在受限硬件上为生产级模型 优化权衡。

避免使用混合模型的情况：

你的目标设备不支持 混合或全精度加速。
你在 超严格的内存限制 下操作（在这种情况下，使用完全量化的格式）。

量化模型（Q4_K，Q6_K，Q8 等）—— 用于 CPU 和低 VRAM 推理

量化在尽可能保持准确性的同时减小模型大小和内存使用。

低比特模型（Q4_K） —— 内存使用最少，但精度可能较低。
高比特模型（Q6_K，Q8_0） —— 准确性更好，但需要更多内存。

使用量化模型的情况：

你在 CPU 上运行推理，需要优化的模型。
你的设备 VRAM 较低，无法加载全精度模型。
你想在保持合理准确性的同时减少 内存占用。

避免使用量化模型的情况：

你需要 最高的准确性（全精度模型更适合）。
你的硬件有足够的 VRAM 用于更高精度的格式（BF16/F16）。

极低比特量化（IQ3_XS，IQ3_S，IQ3_M，Q4_K，Q4_0）

这些模型针对 非常高的内存效率 进行了优化，使其非常适合 低功耗设备 或 大规模部署，其中内存是关键限制因素。

IQ3_XS：超低比特量化（3 位），具有 非常高的内存效率。
- 用例：最适合 超低内存设备，即使 Q4_K 也太大。
- 权衡：与更高比特量化相比，准确性较低。
IQ3_S：小块大小，实现 最大内存效率。
- 用例：最适合 低内存设备，其中 IQ3_XS 过于激进。
IQ3_M：中等块大小，比 IQ3_S 具有更好的准确性。
- 用例：适用于 低内存设备，其中 IQ3_S 限制太大。
Q4_K：4 位量化，具有 逐块优化 以提高准确性。
- 用例：最适合 低内存设备，其中 Q6_K 太大。
Q4_0：纯 4 位量化，针对 ARM 设备 进行了优化。
- 用例：最适合 基于 ARM 的设备 或 低内存环境。

超低位量化（IQ1_S、IQ1_M、IQ2_S、IQ2_M、IQ2_XS、IQ2_XSS）

超低位量化（1、2 位），具有 极高的内存效率。
- 用例：最适合需要将模型放入非常受限内存的情况。
- 权衡：准确性非常低。可能无法按预期运行。使用前请充分测试。

模型格式选择总结表

属性	详情
模型类型	文本嵌入
支持语言	100+ 种语言
参数数量	8B
上下文长度	32k
嵌入维度	最大 4096，支持用户定义的输出维度范围从 32 到 4096

模型格式	精度	内存使用	设备要求	最佳用例
BF16	非常高	高	支持 BF16 的 GPU/CPU	减少内存的高速推理
F16	高	高	支持 FP16 的 GPU/CPU	BF16 不可用时的推理
Q4_K	中低	低	CPU 或低 VRAM 设备	内存受限的推理
Q6_K	中	中等	内存更多的 CPU	量化下更好的准确性
Q8_0	高	中等	具有中等 VRAM 的 GPU/CPU	量化模型中最高的准确性
IQ3_XS	低	非常低	超低内存设备	最大内存效率，低准确性
IQ3_S	低	非常低	低内存设备	比 IQ3_XS 更可用
IQ3_M	中低	低	低内存设备	比 IQ3_S 准确性更好
Q4_0	低	低	基于 ARM/嵌入式设备	Llama.cpp 自动为 ARM 推理优化
*超低位（IQ1/2_）**	非常低	极低	小型边缘/嵌入式设备	将模型放入极紧的内存中；低准确性
混合（例如，`bf16_q8_0`）	中高	中等	支持混合精度的硬件	平衡性能和内存，关键层接近 FP 准确性

Qwen3 嵌入系列模型列表

模型类型	模型	大小	层数	序列长度	嵌入维度	MRL 支持	指令感知
文本嵌入	Qwen3-Embedding-0.6B	0.6B	28	32K	1024	是	是
文本嵌入	Qwen3-Embedding-4B	4B	36	32K	2560	是	是
文本嵌入	Qwen3-Embedding-8B	8B	36	32K	4096	是	是
文本重排序	Qwen3-Reranker-0.6B	0.6B	28	32K	-	-	是
文本重排序	Qwen3-Reranker-4B	4B	36	32K	-	-	是
文本重排序	Qwen3-Reranker-8B	8B	36	32K	-	-	是

注意：

MRL 支持 表示嵌入模型是否支持最终嵌入的自定义维度。

指令感知 表示嵌入或重排序模型是否支持根据不同任务自定义输入指令。

我们的评估表明，对于大多数下游任务，使用指令（instruct）通常比不使用指令能提高 1% 到 5% 的性能。因此，建议开发人员根据自己的任务和场景创建定制的指令。在多语言环境中，也建议用户用英语编写指令，因为模型训练过程中使用的大多数指令最初都是用英语编写的。