NoInstruct小型嵌入模型v0开源上线 - 免费助力提升检索任务性能

首页

Noinstruct Small Embedding V0

由 avsolatorio 开发

NoInstruct小型嵌入模型v0是一种改进的嵌入模型，专注于提升检索任务性能，同时保持对任意指令编码的独立性。

文本嵌入

Transformers

英语开源协议:MIT #非对称池化 #检索优化 #无指令依赖

下载量 90.76k

发布时间 : 5/1/2024

模型简介

该模型通过非对称池化策略优化检索性能，查询使用均值池化，句子/文档嵌入使用[CLS]表示，相比GIST-small-Embedding-v0具有更优的检索表现。

模型特点

非对称池化策略

查询使用均值池化，句子/文档嵌入使用[CLS]表示，优化不同场景下的嵌入效果

指令编码独立性

保持对任意指令编码的独立性，符合当前检索任务嵌入模型的流行范式

检索性能优化

相比GIST-small-Embedding-v0模型，在检索任务上表现更优

模型能力

文本嵌入生成

语义相似度计算

信息检索

使用案例

信息检索

文档检索

根据查询语句从大量文档中检索相关内容

相比GIST-small-Embedding-v0具有更高的检索准确率

语义相似度计算

计算不同文本之间的语义相似度

通过非对称池化策略获得更准确的相似度评分

🚀 NoInstruct small Embedding v0

NoInstruct Embedding：非对称池化就是你所需要的一切

该模型与 avsolatorio/GIST-small-Embedding-v0 模型相比，在检索性能上有所提升。

GIST 系列模型在检索任务上的表现存在不足。我们提出了一种方法，在对查询进行编码时，该方法在保持不依赖于为检索任务的嵌入模型设计任意指令（这是当前嵌入模型中的一种流行范式）的同时，提高了检索性能。

该模型的技术细节将很快公布。

🚀 快速开始

环境依赖

该项目依赖于 transformers、torch 库，你可以使用以下命令进行安装：

pip install transformers torch

代码运行

以下是使用该模型的示例代码：

from typing import Union
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")
tokenizer = AutoTokenizer.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")


def get_embedding(text: Union[str, list[str]], mode: str = "sentence"):
    model.eval()

    assert mode in ("query", "sentence"), f"mode={mode} was passed but only `query` and `sentence` are the supported modes."

    if isinstance(text, str):
        text = [text]

    inp = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    with torch.no_grad():
        output = model(**inp)

    # The model is optimized to use the mean pooling for queries,
    # while the sentence / document embedding uses the [CLS] representation.

    if mode == "query":
        vectors = output.last_hidden_state * inp["attention_mask"].unsqueeze(2)
        vectors = vectors.sum(dim=1) / inp["attention_mask"].sum(dim=-1).view(-1, 1)
    else:
        vectors = output.last_hidden_state[:, 0, :]

    return vectors


texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# Compute embeddings
embeddings = get_embedding(texts, mode="sentence")

# Compute cosine-similarity for each pair of sentences
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())

# Test the retrieval performance.
query = get_embedding("Which sentence talks about concept on jobs?", mode="query")

scores = F.cosine_similarity(query, embeddings, dim=-1)
print(scores.cpu().numpy())

后续支持

后续将支持 Sentence Transformers 库。

💻 使用示例

基础用法

以下代码展示了如何使用该模型获取文本嵌入，并计算文本之间的余弦相似度：

from typing import Union
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")
tokenizer = AutoTokenizer.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")


def get_embedding(text: Union[str, list[str]], mode: str = "sentence"):
    model.eval()

    assert mode in ("query", "sentence"), f"mode={mode} was passed but only `query` and `sentence` are the supported modes."

    if isinstance(text, str):
        text = [text]

    inp = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    with torch.no_grad():
        output = model(**inp)

    # The model is optimized to use the mean pooling for queries,
    # while the sentence / document embedding uses the [CLS] representation.

    if mode == "query":
        vectors = output.last_hidden_state * inp["attention_mask"].unsqueeze(2)
        vectors = vectors.sum(dim=1) / inp["attention_mask"].sum(dim=-1).view(-1, 1)
    else:
        vectors = output.last_hidden_state[:, 0, :]

    return vectors


texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# Compute embeddings
embeddings = get_embedding(texts, mode="sentence")

# Compute cosine-similarity for each pair of sentences
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())

# Test the retrieval performance.
query = get_embedding("Which sentence talks about concept on jobs?", mode="query")

scores = F.cosine_similarity(query, embeddings, dim=-1)
print(scores.cpu().numpy())

高级用法

你可以根据实际需求修改 get_embedding 函数的参数，以适应不同的应用场景。例如，你可以修改 mode 参数来指定是获取查询嵌入还是句子嵌入：

# 以下代码展示了如何获取查询嵌入
query_embedding = get_embedding("这是一个查询示例", mode="query")

📚 详细文档

模型性能

该模型在多个数据集上进行了测试，以下是部分任务和数据集的性能指标：

任务类型	数据集名称	准确率	平均精度	F1值
分类	MTEB AmazonCounterfactualClassification (en)	75.76119402985074	39.03628777559392	69.85860402259618
分类	MTEB AmazonPolarityClassification	93.29920000000001	90.03479490717608	93.28554395248467
分类	MTEB AmazonReviewsClassification (en)	49.98799999999999	-	49.46151232451642
...	...	...	...	...