Qwen2.5-7B-embed-base開源文本嵌入模型 - 免費生成高質量文本向量

首頁

Qwen2.5 7B Embed Base

由ssmits開發

Qwen2.5-7B-embed-base是基於Transformer架構的預訓練語言模型，專為生成高質量文本嵌入向量而設計。

文本嵌入

Safetensors

英語開源協議:Apache-2.0 #多語言嵌入 #大模型微調適配 #高維語義編碼

下載量 85

發布時間 : 11/24/2024

模型概述

該模型是Qwen2.5系列的一部分，移除了'lm_head'層，適用於生成文本嵌入向量，可用於文本相似度計算、信息檢索等任務。

模型特點

改進的分詞器

分詞器能自適應多種自然語言和代碼，提高處理效率

高效注意力機制

採用分組查詢注意力等先進機制，優化計算效率

嵌入向量生成

專為生成高質量文本嵌入向量而優化，適合下游任務微調

模型能力

文本嵌入生成

文本相似度計算

語義搜索

使用案例

信息檢索

文檔相似度匹配

計算不同文檔之間的語義相似度

可準確識別語義相似的文檔對

推薦系統

內容推薦

基於用戶歷史行為和內容嵌入向量進行個性化推薦

🚀 Qwen2.5-7B-embed-base

Qwen2.5-7B-embed-base 是基於 Qwen2.5 語言模型系列的嵌入模型，可用於文本分類等任務，能將文本轉換為向量表示，為後續的自然語言處理任務提供基礎支持。

🚀 快速開始

安裝依賴

Qwen2.5 的代碼已集成在最新的 Hugging face transformers 中，建議安裝 transformers>=4.37.0，否則可能會遇到以下錯誤：

KeyError: 'Qwen2.5'

模型推理

使用 sentence-transformers 庫

from sentence_transformers import SentenceTransformer
import torch

# 1. 加載預訓練的 Sentence Transformer 模型
model = SentenceTransformer("ssmits/Qwen2.5-7B-embed-base") # 當顯存 <= 24 GB 時，可設置 device = "cpu"

# 待編碼的句子
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. 通過調用 model.encode() 計算嵌入向量
embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 3584)

# 3. 計算嵌入向量的相似度
# 假設 embeddings 是一個 numpy 數組，將其轉換為 torch 張量
embeddings_tensor = torch.tensor(embeddings)

# 使用 torch 計算餘弦相似度矩陣
similarities = torch.nn.functional.cosine_similarity(embeddings_tensor.unsqueeze(0), embeddings_tensor.unsqueeze(1), dim=2)

print(similarities)
# tensor([[1.0000, 0.8608, 0.6609],
#         [0.8608, 1.0000, 0.7046],
#         [0.6609, 0.7046, 1.0000]])

不使用 sentence-transformers 庫

from transformers import AutoTokenizer, AutoModel
import torch

# 均值池化 - 考慮注意力掩碼以進行正確的平均計算
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] # 模型輸出的第一個元素包含所有詞元嵌入
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# 我們需要獲取句子嵌入的句子
sentences = ['This is an example sentence', 'Each sentence is converted']

# 從 HuggingFace Hub 加載模型
tokenizer = AutoTokenizer.from_pretrained('ssmits/Qwen2.5-7B-embed-base')
model = AutoModel.from_pretrained('ssmits/Qwen2.5-7B-embed-base') # 當顯存 <= 24 GB 時，可設置 device = "cpu"

# 對句子進行分詞
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# 計算詞元嵌入
with torch.no_grad():
    model_output = model(**encoded_input)

# 進行池化操作。在這種情況下，使用均值池化
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

啟用多 GPU

from transformers import AutoModel
from torch.nn import DataParallel

model = AutoModel.from_pretrained("ssmits/Qwen2.5-7B-embed-base")
for module_key, module in model._modules.items():
    model._modules[module_key] = DataParallel(module)