NoInstruct小型嵌入模型v0開源上線 - 免費助力提升檢索任務性能

首頁

Noinstruct Small Embedding V0

由avsolatorio開發

NoInstruct小型嵌入模型v0是一種改進的嵌入模型，專注於提升檢索任務性能，同時保持對任意指令編碼的獨立性。

文本嵌入

Transformers

英語開源協議:MIT #非對稱池化 #檢索優化 #無指令依賴

下載量 90.76k

發布時間 : 5/1/2024

模型概述

該模型通過非對稱池化策略優化檢索性能，查詢使用均值池化，句子/文檔嵌入使用[CLS]表示，相比GIST-small-Embedding-v0具有更優的檢索表現。

模型特點

非對稱池化策略

查詢使用均值池化，句子/文檔嵌入使用[CLS]表示，優化不同場景下的嵌入效果

指令編碼獨立性

保持對任意指令編碼的獨立性，符合當前檢索任務嵌入模型的流行範式

檢索性能優化

相比GIST-small-Embedding-v0模型，在檢索任務上表現更優

模型能力

文本嵌入生成

語義相似度計算

信息檢索

使用案例

信息檢索

文檔檢索

根據查詢語句從大量文檔中檢索相關內容

相比GIST-small-Embedding-v0具有更高的檢索準確率

語義相似度計算

計算不同文本之間的語義相似度

通過非對稱池化策略獲得更準確的相似度評分

🚀 NoInstruct small Embedding v0

NoInstruct Embedding：非對稱池化就是你所需要的一切

該模型與 avsolatorio/GIST-small-Embedding-v0 模型相比，在檢索性能上有所提升。

GIST 系列模型在檢索任務上的表現存在不足。我們提出了一種方法，在對查詢進行編碼時，該方法在保持不依賴於為檢索任務的嵌入模型設計任意指令（這是當前嵌入模型中的一種流行範式）的同時，提高了檢索性能。

該模型的技術細節將很快公佈。

🚀 快速開始

環境依賴

該項目依賴於 transformers、torch 庫，你可以使用以下命令進行安裝：

pip install transformers torch

代碼運行

以下是使用該模型的示例代碼：

from typing import Union
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")
tokenizer = AutoTokenizer.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")


def get_embedding(text: Union[str, list[str]], mode: str = "sentence"):
    model.eval()

    assert mode in ("query", "sentence"), f"mode={mode} was passed but only `query` and `sentence` are the supported modes."

    if isinstance(text, str):
        text = [text]

    inp = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    with torch.no_grad():
        output = model(**inp)

    # The model is optimized to use the mean pooling for queries,
    # while the sentence / document embedding uses the [CLS] representation.

    if mode == "query":
        vectors = output.last_hidden_state * inp["attention_mask"].unsqueeze(2)
        vectors = vectors.sum(dim=1) / inp["attention_mask"].sum(dim=-1).view(-1, 1)
    else:
        vectors = output.last_hidden_state[:, 0, :]

    return vectors


texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# Compute embeddings
embeddings = get_embedding(texts, mode="sentence")

# Compute cosine-similarity for each pair of sentences
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())

# Test the retrieval performance.
query = get_embedding("Which sentence talks about concept on jobs?", mode="query")

scores = F.cosine_similarity(query, embeddings, dim=-1)
print(scores.cpu().numpy())

後續支持

後續將支持 Sentence Transformers 庫。

💻 使用示例

基礎用法

以下代碼展示瞭如何使用該模型獲取文本嵌入，並計算文本之間的餘弦相似度：

from typing import Union
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")
tokenizer = AutoTokenizer.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")


def get_embedding(text: Union[str, list[str]], mode: str = "sentence"):
    model.eval()

    assert mode in ("query", "sentence"), f"mode={mode} was passed but only `query` and `sentence` are the supported modes."

    if isinstance(text, str):
        text = [text]

    inp = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    with torch.no_grad():
        output = model(**inp)

    # The model is optimized to use the mean pooling for queries,
    # while the sentence / document embedding uses the [CLS] representation.

    if mode == "query":
        vectors = output.last_hidden_state * inp["attention_mask"].unsqueeze(2)
        vectors = vectors.sum(dim=1) / inp["attention_mask"].sum(dim=-1).view(-1, 1)
    else:
        vectors = output.last_hidden_state[:, 0, :]

    return vectors


texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# Compute embeddings
embeddings = get_embedding(texts, mode="sentence")

# Compute cosine-similarity for each pair of sentences
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())

# Test the retrieval performance.
query = get_embedding("Which sentence talks about concept on jobs?", mode="query")

scores = F.cosine_similarity(query, embeddings, dim=-1)
print(scores.cpu().numpy())

高級用法

你可以根據實際需求修改 get_embedding 函數的參數，以適應不同的應用場景。例如，你可以修改 mode 參數來指定是獲取查詢嵌入還是句子嵌入：

# 以下代碼展示瞭如何獲取查詢嵌入
query_embedding = get_embedding("這是一個查詢示例", mode="query")

📚 詳細文檔

模型性能

該模型在多個數據集上進行了測試，以下是部分任務和數據集的性能指標：

任務類型	數據集名稱	準確率	平均精度	F1值
分類	MTEB AmazonCounterfactualClassification (en)	75.76119402985074	39.03628777559392	69.85860402259618
分類	MTEB AmazonPolarityClassification	93.29920000000001	90.03479490717608	93.28554395248467
分類	MTEB AmazonReviewsClassification (en)	49.98799999999999	-	49.46151232451642
...	...	...	...	...