gte-small开源文本嵌入模型 - 免费助力句子相似度计算及文本分类检索

首页

Gte Small

由 thenlper 开发

GTE-small 是一个小型通用文本嵌入模型，适用于多种自然语言处理任务，包括句子相似度计算、文本分类和检索等。

文本嵌入英语开源协议:MIT #句子相似度计算 #多任务文本嵌入 #高精度分类

下载量 450.86k

发布时间 : 7/27/2023

模型简介

GTE-small 是一个基于句子转换器架构的文本嵌入模型，主要用于生成高质量的句子级嵌入表示，支持多种下游NLP任务。

模型特点

多任务支持

支持多种自然语言处理任务，包括分类、检索和聚类等。

高效性能

在多个基准测试中表现出色，特别是在文本分类任务上。

通用文本嵌入

能够生成高质量的句子级嵌入表示，适用于多种下游应用。

模型能力

句子相似度计算

文本分类

信息检索

文本聚类

语义文本相似度评估

使用案例

电子商务

产品评论分类

对亚马逊产品评论进行情感极性分类

在AmazonPolarity分类任务上达到91.8%的准确率

反事实评论识别

识别亚马逊平台上的反事实评论

在AmazonCounterfactual分类任务上达到73.2%的准确率

学术研究

论文聚类

对arXiv和biorxiv论文进行主题聚类

在arXiv论文聚类任务上V-measure达到47.9

问答系统

重复问题检测

识别AskUbuntu论坛中的重复问题

重排序任务中平均精度达到61.7

🚀 gte-small

General Text Embeddings (GTE) 模型是由阿里巴巴达摩院训练的一系列模型，主要基于 BERT 框架，目前提供三种不同大小的模型，包括 GTE-large、GTE-base 和 GTE-small。这些模型在大规模的相关文本对语料库上进行训练，覆盖了广泛的领域和场景，可应用于文本嵌入的各种下游任务，如信息检索、语义文本相似度、文本重排序等。Towards General Text Embeddings with Multi-stage Contrastive Learning

🚀 快速开始

General Text Embeddings (GTE) 模型主要基于 BERT 框架，在大规模相关文本对语料库上训练，适用于信息检索、语义文本相似度、文本重排序等多种文本嵌入下游任务。

代码示例

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
model = AutoModel.from_pretrained("thenlper/gte-small")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

结合 sentence-transformers 使用

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']

model = SentenceTransformer('thenlper/gte-large')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

✨ 主要特性

多尺寸模型：提供三种不同大小的模型，包括 GTE-large、GTE-base 和 GTE-small，可根据不同需求选择。
广泛适用性：在大规模的相关文本对语料库上进行训练，覆盖了广泛的领域和场景，可应用于文本嵌入的各种下游任务，包括信息检索、语义文本相似度、文本重排序等。

📦 安装指南

文档未提及具体安装命令，故跳过此章节。

💻 使用示例

基础用法

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
model = AutoModel.from_pretrained("thenlper/gte-small")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

高级用法

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']

model = SentenceTransformer('thenlper/gte-large')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

📚 详细文档

指标

我们在 MTEB 基准测试中比较了 GTE 模型与其他流行文本嵌入模型的性能。如需更详细的比较结果，请参考 MTEB 排行榜。

模型名称	模型大小 (GB)	维度	序列长度	平均值 (56)	聚类 (11)	成对分类 (3)	重排序 (4)	检索 (15)	STS (10)	摘要 (1)	分类 (12)
gte-large	0.67	1024	512	63.13	46.84	85.00	59.13	52.22	83.35	31.66	73.33
gte-base	0.22	768	512	62.39	46.2	84.57	58.61	51.14	82.3	31.17	73.01
e5-large-v2	1.34	1024	512	62.25	44.49	86.03	56.61	50.56	82.05	30.19	75.24
e5-base-v2	0.44	768	512	61.5	43.80	85.73	55.91	50.29	81.05	30.28	73.84
gte-small	0.07	384	512	61.36	44.89	83.54	57.7	49.46	82.07	30.42	72.31
text-embedding-ada-002	-	1536	8192	60.99	45.9	84.89	56.32	49.25	80.97	30.8	70.93
e5-small-v2	0.13	384	512	59.93	39.92	84.67	54.32	49.04	80.39	31.16	72.94
sentence-t5-xxl	9.73	768	512	59.51	43.72	85.06	56.42	42.24	82.63	30.08	73.42
all-mpnet-base-v2	0.44	768	514	57.78	43.69	83.04	59.36	43.81	80.28	27.49	65.07
sgpt-bloom-7b1-msmarco	28.27	4096	2048	57.59	38.93	81.9	55.65	48.22	77.74	33.6	66.19
all-MiniLM-L12-v2	0.13	384	512	56.53	41.81	82.41	58.44	42.69	79.8	27.9	63.21
all-MiniLM-L6-v2	0.09	384	512	56.26	42.35	82.37	58.04	41.95	78.9	30.81	63.05
contriever-base-msmarco	0.44	768	512	56.00	41.1	82.54	53.14	41.88	76.51	30.36	66.68
sentence-t5-base	0.22	768	512	55.27	40.21	85.18	53.09	33.63	81.14	31.39	69.81

局限性

此模型仅适用于英文文本，并且任何长文本将被截断为最多 512 个标记。

引用

如果您发现我们的论文或模型有帮助，请考虑按以下方式引用：

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}