gte-small開源文本嵌入模型 - 免費助力句子相似度計算及文本分類檢索

首頁

Gte Small

由thenlper開發

GTE-small 是一個小型通用文本嵌入模型，適用於多種自然語言處理任務，包括句子相似度計算、文本分類和檢索等。

文本嵌入英語開源協議:MIT #句子相似度計算 #多任務文本嵌入 #高精度分類

下載量 450.86k

發布時間 : 7/27/2023

模型概述

GTE-small 是一個基於句子轉換器架構的文本嵌入模型，主要用於生成高質量的句子級嵌入表示，支持多種下游NLP任務。

模型特點

多任務支持

支持多種自然語言處理任務，包括分類、檢索和聚類等。

高效性能

在多個基準測試中表現出色，特別是在文本分類任務上。

通用文本嵌入

能夠生成高質量的句子級嵌入表示，適用於多種下游應用。

模型能力

句子相似度計算

文本分類

信息檢索

文本聚類

語義文本相似度評估

使用案例

電子商務

產品評論分類

對亞馬遜產品評論進行情感極性分類

在AmazonPolarity分類任務上達到91.8%的準確率

反事實評論識別

識別亞馬遜平臺上的反事實評論

在AmazonCounterfactual分類任務上達到73.2%的準確率

學術研究

論文聚類

對arXiv和biorxiv論文進行主題聚類

在arXiv論文聚類任務上V-measure達到47.9

問答系統

重複問題檢測

識別AskUbuntu論壇中的重複問題

重排序任務中平均精度達到61.7

🚀 gte-small

General Text Embeddings (GTE) 模型是由阿里巴巴達摩院訓練的一系列模型，主要基於 BERT 框架，目前提供三種不同大小的模型，包括 GTE-large、GTE-base 和 GTE-small。這些模型在大規模的相關文本對語料庫上進行訓練，覆蓋了廣泛的領域和場景，可應用於文本嵌入的各種下游任務，如信息檢索、語義文本相似度、文本重排序等。Towards General Text Embeddings with Multi-stage Contrastive Learning

🚀 快速開始

General Text Embeddings (GTE) 模型主要基於 BERT 框架，在大規模相關文本對語料庫上訓練，適用於信息檢索、語義文本相似度、文本重排序等多種文本嵌入下游任務。

代碼示例

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
model = AutoModel.from_pretrained("thenlper/gte-small")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

結合 sentence-transformers 使用

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']

model = SentenceTransformer('thenlper/gte-large')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

✨ 主要特性

多尺寸模型：提供三種不同大小的模型，包括 GTE-large、GTE-base 和 GTE-small，可根據不同需求選擇。
廣泛適用性：在大規模的相關文本對語料庫上進行訓練，覆蓋了廣泛的領域和場景，可應用於文本嵌入的各種下游任務，包括信息檢索、語義文本相似度、文本重排序等。

📦 安裝指南

文檔未提及具體安裝命令，故跳過此章節。

💻 使用示例

基礎用法

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
model = AutoModel.from_pretrained("thenlper/gte-small")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

高級用法

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']

model = SentenceTransformer('thenlper/gte-large')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

📚 詳細文檔

指標

我們在 MTEB 基準測試中比較了 GTE 模型與其他流行文本嵌入模型的性能。如需更詳細的比較結果，請參考 MTEB 排行榜。

模型名稱	模型大小 (GB)	維度	序列長度	平均值 (56)	聚類 (11)	成對分類 (3)	重排序 (4)	檢索 (15)	STS (10)	摘要 (1)	分類 (12)
gte-large	0.67	1024	512	63.13	46.84	85.00	59.13	52.22	83.35	31.66	73.33
gte-base	0.22	768	512	62.39	46.2	84.57	58.61	51.14	82.3	31.17	73.01
e5-large-v2	1.34	1024	512	62.25	44.49	86.03	56.61	50.56	82.05	30.19	75.24
e5-base-v2	0.44	768	512	61.5	43.80	85.73	55.91	50.29	81.05	30.28	73.84
gte-small	0.07	384	512	61.36	44.89	83.54	57.7	49.46	82.07	30.42	72.31
text-embedding-ada-002	-	1536	8192	60.99	45.9	84.89	56.32	49.25	80.97	30.8	70.93
e5-small-v2	0.13	384	512	59.93	39.92	84.67	54.32	49.04	80.39	31.16	72.94
sentence-t5-xxl	9.73	768	512	59.51	43.72	85.06	56.42	42.24	82.63	30.08	73.42
all-mpnet-base-v2	0.44	768	514	57.78	43.69	83.04	59.36	43.81	80.28	27.49	65.07
sgpt-bloom-7b1-msmarco	28.27	4096	2048	57.59	38.93	81.9	55.65	48.22	77.74	33.6	66.19
all-MiniLM-L12-v2	0.13	384	512	56.53	41.81	82.41	58.44	42.69	79.8	27.9	63.21
all-MiniLM-L6-v2	0.09	384	512	56.26	42.35	82.37	58.04	41.95	78.9	30.81	63.05
contriever-base-msmarco	0.44	768	512	56.00	41.1	82.54	53.14	41.88	76.51	30.36	66.68
sentence-t5-base	0.22	768	512	55.27	40.21	85.18	53.09	33.63	81.14	31.39	69.81

侷限性

此模型僅適用於英文文本，並且任何長文本將被截斷為最多 512 個標記。

引用

如果您發現我們的論文或模型有幫助，請考慮按以下方式引用：

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}