gte-base-zh開源中文文本嵌入模型 - 支持語義計算、分類及信息檢索

首頁

Gte Base Zh

由thenlper開發

gte-base-zh 是一個針對中文優化的通用文本嵌入模型，支持多種自然語言處理任務，包括語義相似度計算、文本分類、信息檢索等。

文本嵌入

Safetensors

英語開源協議:MIT #中文語義相似度 #金融文本匹配 #醫療問答重排序

下載量 22.03k

發布時間 : 11/8/2023

模型概述

該模型專注於生成高質量的句子嵌入，適用於中文文本的語義理解和表示。它在多箇中文基準測試中表現出色，特別擅長處理語義相似度和信息檢索任務。

模型特點

多任務支持

支持多種自然語言處理任務，包括語義相似度計算、文本分類、信息檢索等。

中文優化

專門針對中文文本進行優化，在多箇中文基準測試中表現優異。

高效檢索

在信息檢索任務中表現出色，特別是在醫療領域的問題檢索方面。

模型能力

語義文本相似度計算

文本分類

信息檢索

文本聚類

重排序

使用案例

金融領域

金融問題匹配

用於匹配相似的金融問題，如AFQMC數據集中的任務

餘弦相似度皮爾遜相關係數達到44.46

醫療領域

醫療問答檢索

用於檢索與醫療問題相關的答案

在CMedQAv1和CMedQAv2數據集上MAP分別達到86.79和87.20

電子商務

商品評論分類

對亞馬遜中文商品評論進行分類

準確率達到45.82%

🚀 gte-base-zh

General Text Embeddings (GTE) 是由阿里巴巴達摩院基於 BERT 框架訓練的文本嵌入模型，在大規模相關文本對語料上進行訓練，適用於信息檢索、語義文本相似度、文本重排序等多種下游任務。

🚀 快速開始

安裝依賴

你可以使用以下命令安裝所需的庫：

pip install transformers torch sentence-transformers

代碼示例

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

input_texts = [
    "中國的首都是哪裡",
    "你喜歡去哪裡旅遊",
    "北京",
    "今天中午吃什麼"
]

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base-zh")
model = AutoModel.from_pretrained("thenlper/gte-base-zh")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
 
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

使用 sentence-transformers

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['中國的首都是哪裡', '中國的首都是北京']

model = SentenceTransformer('thenlper/gte-base-zh')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

✨ 主要特性

多語言支持：目前提供了中文和英文不同大小的模型。
廣泛的適用性：在大規模相關文本對語料上訓練，可應用於信息檢索、語義文本相似度、文本重排序等多種下游任務。

📦 安裝指南

使用 pip 安裝所需的庫：

pip install transformers torch sentence-transformers

💻 使用示例

基礎用法

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

input_texts = [
    "中國的首都是哪裡",
    "你喜歡去哪裡旅遊",
    "北京",
    "今天中午吃什麼"
]

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base-zh")
model = AutoModel.from_pretrained("thenlper/gte-base-zh")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
 
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

高級用法

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['中國的首都是哪裡', '中國的首都是北京']

model = SentenceTransformer('thenlper/gte-base-zh')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

📚 詳細文檔

模型列表

模型	語言	最大序列長度	維度	模型大小
GTE-large-zh	中文	512	1024	0.67GB
GTE-base-zh	中文	512	1024	0.67GB
GTE-small-zh	中文	512	1024	0.67GB
GTE-large	英文	512	1024	0.67GB
GTE-base	英文	512	1024	0.67GB
GTE-small	英文	512	1024	0.67GB

評估指標

在 MTEB（中文為 CMTEB）基準上，將 GTE 模型與其他流行的文本嵌入模型進行了性能比較。更多詳細的比較結果，請參考 MTEB 排行榜。

模型	模型大小 (GB)	嵌入維度	序列長度	平均 (35 個數據集)	分類 (9 個數據集)	聚類 (4 個數據集)	成對分類 (2 個數據集)	重排序 (4 個數據集)	檢索 (8 個數據集)	STS (8 個數據集)
gte-large-zh	0.65	1024	512	66.72	71.34	53.07	81.14	67.42	72.49	57.82
gte-base-zh	0.20	768	512	65.92	71.26	53.86	80.44	67.00	71.71	55.96
stella-large-zh-v2	0.65	1024	1024	65.13	69.05	49.16	82.68	66.41	70.14	58.66
stella-large-zh	0.65	1024	1024	64.54	67.62	48.65	78.72	65.98	71.02	58.3
bge-large-zh-v1.5	1.3	1024	512	64.53	69.13	48.99	81.6	65.84	70.46	56.25
stella-base-zh-v2	0.21	768	1024	64.36	68.29	49.4	79.96	66.1	70.08	56.92
stella-base-zh	0.21	768	1024	64.16	67.77	48.7	76.09	66.95	71.07	56.54
piccolo-large-zh	0.65	1024	512	64.11	67.03	47.04	78.38	65.98	70.93	58.02
piccolo-base-zh	0.2	768	512	63.66	66.98	47.12	76.61	66.68	71.2	55.9
gte-small-zh	0.1	512	512	60.08	64.49	48.95	69.99	66.21	65.50	49.72
bge-small-zh-v1.5	0.1	512	512	57.82	63.96	44.18	70.4	60.92	61.77	49.1
m3e-base	0.41	768	512	57.79	67.52	47.68	63.99	59.54	56.91	50.47
text-embedding-ada-002(openai)	-	1536	8192	53.02	64.31	45.68	69.56	54.28	52.0	43.35

🔧 技術細節

General Text Embeddings (GTE) 模型由阿里巴巴達摩院訓練，主要基於 BERT 框架，在大規模相關文本對語料上進行訓練，涵蓋了廣泛的領域和場景，使其能夠應用於各種文本嵌入的下游任務。

📄 許可證

本項目採用 MIT 許可證。

⚠️ 重要提示

該模型僅適用於中文文本，任何長文本將被截斷為最多 512 個標記。

💡 使用建議

在使用模型時，可根據具體任務需求選擇合適大小的模型。

引用

如果你發現我們的論文或模型有幫助，請考慮按以下方式引用：

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}