gte-large開源句子轉換器模型 - 免費可用精準處理句子相似度與文本嵌入

首頁

Gte Large

由thenlper開發

GTE-Large 是一個強大的句子轉換器模型，專注於句子相似度和文本嵌入任務，在多個基準測試中表現出色。

文本嵌入英語開源協議:MIT #多任務文本嵌入 #高精度語義相似度 #跨領域分類

下載量 1.5M

發布時間 : 7/27/2023

模型概述

GTE-Large 是一個通用的文本嵌入模型，能夠將句子轉換為高維向量表示，用於相似度計算、分類、聚類和信息檢索等任務。

模型特點

多任務性能優異

在分類、檢索、聚類等多種NLP任務上表現均衡

高質量句子嵌入

生成的句子嵌入能有效捕捉語義信息

廣泛基準測試驗證

在MTEB等多個基準測試集上進行了全面評估

模型能力

句子相似度計算

文本分類

信息檢索

文本聚類

句子向量化

使用案例

電子商務

產品評論情感分析

分析亞馬遜產品評論的情感傾向

在AmazonPolarity數據集上達到92.5%準確率

反事實評論檢測

識別亞馬遜上的反事實評論

在AmazonCounterfactual數據集上達到72.6%準確率

學術研究

論文聚類

對arXiv和biorxiv論文進行主題聚類

在arXiv P2P聚類任務上V-measure達到48.6

客戶服務

銀行問題分類

對銀行客戶問題進行自動分類

在Banking77數據集上達到86.1%準確率

🚀 gte-large

General Text Embeddings (GTE) 模型，可將文本轉換為向量表示，在信息檢索、語義相似度計算等自然語言處理任務中表現出色。該模型基於多階段對比學習，在大規模相關文本對上進行訓練，具有廣泛的適用性和強大的泛化能力。

🚀 快速開始

代碼示例

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large")
model = AutoModel.from_pretrained("thenlper/gte-large")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

與 sentence-transformers 結合使用

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']

model = SentenceTransformer('thenlper/gte-large')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

✨ 主要特性

多規模選擇：GTE 模型由阿里巴巴達摩院訓練，基於 BERT 框架，提供了三種不同大小的模型，分別是 GTE-large、GTE-base 和 GTE-small，可根據不同需求進行選擇。
廣泛適用性：在大規模相關文本對語料庫上進行訓練，涵蓋了廣泛的領域和場景，適用於各種文本嵌入的下游任務，包括信息檢索、語義文本相似度、文本重排序等。

📚 詳細文檔

指標對比

我們在 MTEB 基準測試中比較了 GTE 模型與其他流行文本嵌入模型的性能。更多詳細的比較結果，請參考 MTEB 排行榜。

模型名稱	模型大小 (GB)	維度	序列長度	平均 (56)	聚類 (11)	成對分類 (3)	重排序 (4)	檢索 (15)	STS (10)	摘要 (1)	分類 (12)
gte-large	0.67	1024	512	63.13	46.84	85.00	59.13	52.22	83.35	31.66	73.33
gte-base	0.22	768	512	62.39	46.2	84.57	58.61	51.14	82.3	31.17	73.01
e5-large-v2	1.34	1024	512	62.25	44.49	86.03	56.61	50.56	82.05	30.19	75.24
e5-base-v2	0.44	768	512	61.5	43.80	85.73	55.91	50.29	81.05	30.28	73.84
gte-small	0.07	384	512	61.36	44.89	83.54	57.7	49.46	82.07	30.42	72.31
text-embedding-ada-002	-	1536	8192	60.99	45.9	84.89	56.32	49.25	80.97	30.8	70.93
e5-small-v2	0.13	384	512	59.93	39.92	84.67	54.32	49.04	80.39	31.16	72.94
sentence-t5-xxl	9.73	768	512	59.51	43.72	85.06	56.42	42.24	82.63	30.08	73.42
all-mpnet-base-v2	0.44	768	514	57.78	43.69	83.04	59.36	43.81	80.28	27.49	65.07
sgpt-bloom-7b1-msmarco	28.27	4096	2048	57.59	38.93	81.9	55.65	48.22	77.74	33.6	66.19
all-MiniLM-L12-v2	0.13	384	512	56.53	41.81	82.41	58.44	42.69	79.8	27.9	63.21
all-MiniLM-L6-v2	0.09	384	512	56.26	42.35	82.37	58.04	41.95	78.9	30.81	63.05
contriever-base-msmarco	0.44	768	512	56.00	41.1	82.54	53.14	41.88	76.51	30.36	66.68
sentence-t5-base	0.22	768	512	55.27	40.21	85.18	53.09	33.63	81.14	31.39	69.81

侷限性

本模型僅適用於英文文本，並且任何長文本將被截斷為最多 512 個標記。

引用

如果您發現我們的論文或模型有幫助，請考慮按以下方式引用它們：

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}