gte-multilingual-base開源多語言句子嵌入模型 - 免費部署支持超50種語言相似度計算

首頁

Gte Multilingual Base

由Alibaba-NLP開發

GTE Multilingual Base 是一個多語言的句子嵌入模型，支持超過50種語言，適用於句子相似度計算等任務。

文本嵌入

Transformers

支持多種語言開源協議:Apache-2.0 #多語言句子相似度 #密集向量檢索 #跨語言文本匹配

下載量 1.2M

發布時間 : 7/20/2024

模型概述

該模型是一個基於Transformer架構的多語言句子嵌入模型，能夠將不同語言的句子映射到統一的向量空間，便於跨語言句子相似度計算和信息檢索。

模型特點

多語言支持

支持超過50種語言的句子嵌入，實現跨語言語義理解

多功能任務適配

適用於句子相似度、聚類、分類、檢索等多種自然語言處理任務

高性能表現

在多個基準測試中展現出優秀的性能指標

模型能力

句子相似度計算

文本聚類

文本分類

信息檢索

文本重排序

雙語文本挖掘

使用案例

信息檢索

跨語言文檔檢索

在不同語言的文檔集合中檢索相關文檔

在AlloprofRetrieval測試中NDCG@10達到53.638

文本分類

產品評論分類

對多語言產品評論進行情感分類

在AmazonPolarityClassification中準確率達到80.72%

句子相似度

跨語言句子匹配

計算不同語言句子之間的語義相似度

在BIOSSES測試中Spearman相關係數達到81.21

🚀 gte-multilingual-base

gte-multilingual-base 模型是 GTE（通用文本嵌入）系列模型中的最新成員，具有以下關鍵特性：

高性能：在多語言檢索任務和多任務表示模型評估中，與同規模的模型相比，達到了當前最優（SOTA）的效果。
訓練架構：採用僅編碼器的 Transformer 架構進行訓練，模型規模更小。與之前基於僅解碼器的大語言模型（LLM）架構的模型（如 gte-qwen2-1.5b-instruct）不同，該模型推理時對硬件的要求更低，推理速度提升了 10 倍。
長上下文支持：支持最長達 8192 個標記的文本。
多語言能力：支持超過 70 種語言。
彈性密集嵌入：在保持下游任務有效性的同時，支持彈性輸出密集表示，顯著降低了存儲成本，提高了執行效率。
稀疏向量生成：除了密集表示外，還可以生成稀疏向量。

🚀 快速開始

模型信息

屬性	詳情
模型規模	3.05 億參數
嵌入維度	768
最大輸入標記數	8192

使用說明

建議安裝 xformers 並啟用解填充以加速推理，請參考 enable-unpadding-and-xformers。
離線使用方法：new-impl/discussions/2
與 TEI 一起使用的方法：refs/pr/7

代碼示例

💻 使用 Transformers 獲取密集嵌入

# Requires transformers>=4.36.0

import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "北京",
    "快排算法介紹"
]

model_name_or_path = 'Alibaba-NLP/gte-multilingual-base'
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)

dimension=768 # The output dimension of the output embedding, should be in [128, 768]
embeddings = outputs.last_hidden_state[:, 0][:dimension]

embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

# [[0.3016996383666992, 0.7503870129585266, 0.3203084468841553]]

使用 sentence-transformers

# Requires sentence-transformers>=3.0.0

from sentence_transformers import SentenceTransformer

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "北京",
    "快排算法介紹"
]

model_name_or_path="Alibaba-NLP/gte-multilingual-base"
model = SentenceTransformer(model_name_or_path, trust_remote_code=True)
embeddings = model.encode(input_texts, normalize_embeddings=True) # embeddings.shape (4, 768)

# sim scores
scores = model.similarity(embeddings[:1], embeddings[1:])

print(scores.tolist())
# [[0.301699697971344, 0.7503870129585266, 0.32030850648880005]]

使用 infinity

通過 Docker 和 infinity 使用，該項目採用 MIT 許可證。

docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" \
michaelf34/infinity:0.0.69 \
v2 --model-id Alibaba-NLP/gte-multilingual-base --revision "main" --dtype float16 --batch-size 32 --device cuda --engine torch --port 7997

使用自定義代碼獲取密集嵌入和稀疏標記權重

# You can find the script gte_embedding.py in https://huggingface.co/Alibaba-NLP/gte-multilingual-base/blob/main/scripts/gte_embedding.py

from gte_embedding import GTEEmbeddidng

model_name_or_path = 'Alibaba-NLP/gte-multilingual-base'
model = GTEEmbeddidng(model_name_or_path)
query = "中國的首都在哪兒"

docs = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "北京",
    "快排算法介紹"
]

embs = model.encode(docs, return_dense=True,return_sparse=True)
print('dense_embeddings vecs', embs['dense_embeddings'])
print('token_weights', embs['token_weights'])
pairs = [(query, doc) for doc in docs]
dense_scores = model.compute_scores(pairs, dense_weight=1.0, sparse_weight=0.0)
sparse_scores = model.compute_scores(pairs, dense_weight=0.0, sparse_weight=1.0)
hybrid_scores = model.compute_scores(pairs, dense_weight=1.0, sparse_weight=0.3)

print('dense_scores', dense_scores)
print('sparse_scores', sparse_scores)
print('hybrid_scores', hybrid_scores)

# dense_scores [0.85302734375, 0.257568359375, 0.76953125, 0.325439453125]
# sparse_scores [0.0, 0.0, 4.600879669189453, 1.570279598236084]
# hybrid_scores [0.85302734375, 0.257568359375, 2.1497951507568356, 0.7965233325958252]

📚 詳細文檔

評估

我們在多個下游任務中驗證了 gte-multilingual-base 模型的性能，包括多語言檢索、跨語言檢索、長文本檢索，以及在 MTEB 排行榜上進行的通用文本表示評估等。

檢索任務

在 MIRACL 和 MLDR（多語言）、MKQA（跨語言）、BEIR 和 LoCo（英語）上的檢索結果。

MLDR 上的詳細結果

LoCo 上的詳細結果

MTEB 評估

在 MTEB 英語、中文、法語、波蘭語任務上的結果。

更多詳細的實驗結果可在論文中查看。

雲 API 服務

除了開源的 GTE 系列模型外，GTE 系列模型還在阿里雲上提供商業 API 服務。

嵌入模型：提供三種版本的文本嵌入模型：text-embedding-v1/v2/v3，其中 v3 是最新的 API 服務。
重排序模型：提供 gte-rerank 模型服務。

請注意，商業 API 背後的模型與開源模型並不完全相同。

🔧 技術細節

引用

如果您覺得我們的論文或模型有幫助，請考慮引用：

@inproceedings{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  pages={1393--1412},
  year={2024}
}