Jina-embeddings-v2-base-de開源模型 - 免費實現德英文本特徵提取與相似度計算

首頁

Jina Embeddings V2 Base De

由jinaai開發

Jina Embeddings V2 Base German 是一個基於轉換器的句子嵌入模型，專注於德語和英語文本的特徵提取和句子相似度計算。

文本嵌入

Transformers

支持多種語言開源協議:Apache-2.0 #德語英語雙語句子嵌入 #高精度文本相似度計算 #MTEB基準測試優化

下載量 124.30k

發布時間 : 1/12/2024

模型概述

該模型主要用於生成高質量的句子嵌入，支持德語和英語文本，適用於句子相似度計算、文本分類和信息檢索等任務。

模型特點

多語言支持

支持德語和英語文本的嵌入生成，適合跨語言應用場景。

高性能特徵提取

在MTEB基準測試中表現出色，特別是在德語和英語的分類和檢索任務中。

靈活的適用性

適用於多種自然語言處理任務，包括文本分類、聚類和信息檢索。

模型能力

句子嵌入生成

文本分類

句子相似度計算

信息檢索

文本聚類

使用案例

電子商務

產品評論分類

對德語和英語的產品評論進行分類，識別正面和負面評價。

在MTEB AmazonPolarityClassification測試中達到77.52%的準確率。

信息檢索

跨語言文檔檢索

在德語和英語文檔中檢索相關內容。

在MTEB BUCC (de-en)測試中達到98.98%的準確率。

🚀 jina-embeddings-v2-base-de

jina-embeddings-v2-base-de 是一款由 Jina AI 訓練的德英雙語文本嵌入模型，支持長達 8192 的序列長度。該模型在單語言和跨語言應用中表現出色，能無偏處理德英混合輸入。

🚀 快速開始

使用 jina-embeddings-v2-base-de 最簡單的方法是使用 Jina AI 的 Embedding API。

✨ 主要特性

雙語支持：支持德語和英語兩種語言，能無偏處理德英混合輸入。
長序列處理：支持長達 8192 的序列長度。
高性能：基於 BERT 架構（JinaBERT），採用對稱雙向的 ALiBi 變體，在單語言和跨語言應用中表現出色。

此外，還提供以下嵌入模型：

jina-embeddings-v2-small-en：3300 萬個參數。
jina-embeddings-v2-base-en：1.37 億個參數。
jina-embeddings-v2-base-zh：1.61 億個參數，中英雙語嵌入。
jina-embeddings-v2-base-de：1.61 億個參數，德英雙語嵌入（當前模型）。
jina-embeddings-v2-base-es：西英雙語嵌入（即將推出）。
jina-embeddings-v2-base-code：1.61 億個參數，代碼嵌入。

📦 安裝指南

使用前需安裝 transformers 庫：

!pip install transformers

若使用 sentence-transformers，需安裝並更新：

!pip install -U sentence-transformers

💻 使用示例

基礎用法

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ['How is the weather today?', 'What is the current weather like today?']

tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-de')
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True, torch_dtype=torch.bfloat16)

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)

高級用法

import torch
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True, torch_dtype=torch.bfloat16)
embeddings = model.encode(['How is the weather today?', 'Wie ist das Wetter heute?'])
print(cos_sim(embeddings[0], embeddings[1]))

處理短序列

embeddings = model.encode(
    ['Very long ... document'],
    max_length=2048
)

使用 `sentence-transformers`

!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(
    "jinaai/jina-embeddings-v2-base-de", # 切換為 en/zh 以使用英語或中文模型
    trust_remote_code=True
)

# 控制輸入序列長度，最大 8192
model.max_seq_length = 1024

embeddings = model.encode([
    'How is the weather today?',
    'Wie ist das Wetter heute?'
])
print(cos_sim(embeddings[0], embeddings[1]))

📚 詳細文檔

數據與參數

數據和訓練細節詳見此技術報告。

替代使用方式

託管 SaaS：在 Jina AI 的 Embedding API 上獲取免費密鑰開始使用。
私有高性能部署：從模型套件中選擇模型，並在 AWS Sagemaker 上進行部署。

基準測試結果

在 MTEB 基準測試上對雙語模型進行了所有可用的德語和英語評估任務的評估。此外，還在額外的德語評估任務中與其他幾個德語、英語和多語言模型進行了對比評估：基準測試結果

在 RAG 中使用 Jina Embeddings

根據 LLamaIndex 最新博客文章：

綜上所述，為了在命中率和 MRR 方面達到最佳性能，將 OpenAI 或 JinaAI-Base 嵌入與 CohereRerank/bge-reranker-large 重排器結合使用效果最佳。

🔧 技術細節

為什麼使用平均池化？

mean pooling 會獲取模型輸出的所有詞元嵌入，並在句子/段落級別對其進行平均。實踐證明，這是生成高質量句子嵌入最有效的方法。提供了一個 encode 函數來處理此操作。

若不使用默認的 encode 函數，可參考以下代碼：

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ['How is the weather today?', 'What is the current weather like today?']

tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-de')
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True, torch_dtype=torch.bfloat16)

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)

📄 許可證

本項目採用 Apache 2.0 許可證。

聯繫我們

加入我們的 Discord 社區，與其他社區成員交流想法。

引用

如果您在研究中發現 Jina Embeddings 很有用，請引用以下論文：

@article{mohr2024multi,
  title={Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings},
  author={Mohr, Isabelle and Krimmel, Markus and Sturua, Saba and Akram, Mohammad Kalim and Koukounas, Andreas and G{\"u}nther, Michael and Mastrapas, Georgios and Ravishankar, Vinit and Mart{\'\i}nez, Joan Fontanals and Wang, Feng and others},
  journal={arXiv preprint arXiv:2402.17016},
  year={2024}
}