roberta-large-InBedder開源文本嵌入器 - 精準捕捉指定文本特徵，按指令作答

首頁

Roberta Large InBedder

由BrandonZYW開發

InBedder是一款專為遵循指令而設計的文本嵌入器，能夠通過回答問題的方式捕捉用戶指令指定的文本特徵。

文本嵌入

Transformers

英語開源協議:MIT #指令感知嵌入 #動態文本表徵 #問答式編碼

下載量 17

發布時間 : 2/15/2024

模型概述

InBedder通過將指令視為關於輸入文本的問題，並通過編碼預期答案來獲取表示，能夠識別不同評估任務中的指令。

模型特點

指令跟隨能力

能夠理解並執行用戶提供的指令，根據指令提取特定的文本特徵

問答式嵌入

將指令轉化為問題，通過編碼預期答案的方式獲取文本表示

多任務適應性

能夠識別並適應不同評估任務中的指令要求

模型能力

指令感知的文本嵌入

語義相似度計算

情感分析

實體識別

使用案例

語義分析

動物識別

識別文本中提到的動物

能準確區分不同動物相關的文本

情感分析

識別文本中表達的情感

能區分不同情感傾向的文本

🚀 [ACL2024] 答案即所需：通過回答問題實現遵循指令的文本嵌入

InBedder🛌 是一個專為遵循指令而設計的文本嵌入器。遵循指令的文本嵌入器能夠捕捉用戶指令所指定的文本特徵。InBedder 提供了一個新穎的視角，即將指令視為關於輸入文本的問題，並對預期答案進行編碼，從而相應地獲得文本表示。我們的研究表明，InBedder 在不同的評估任務中都能感知指令。

image/png

🚀 快速開始

InBedder 是一個能夠遵循指令的文本嵌入器，它可以根據用戶指令捕捉文本特徵。下面是一個使用示例，展示瞭如何使用 InBedder 進行文本嵌入並計算餘弦相似度。

💻 使用示例

基礎用法

import torch
from torch import nn
from torch.nn.functional import gelu, cosine_similarity
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

import numpy as np

class InBedder():
    
    def __init__(self, path='KomeijiForce/inbedder-roberta-large', device='cuda:0'):
        
        model = AutoModelForMaskedLM.from_pretrained(path)
    
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        self.model = model.roberta
        self.dense = model.lm_head.dense
        self.layer_norm = model.lm_head.layer_norm
        
        self.device = torch.device(device)
        self.model = self.model.to(self.device)
        self.dense = self.dense.to(self.device)
        self.layer_norm = self.layer_norm.to(self.device)
        
        self.vocab = self.tokenizer.get_vocab()
        self.vocab = {self.vocab[key]:key for key in self.vocab}
        
    def encode(self, input_texts, instruction, n_mask):
        
        if type(instruction) == str:
            prompts = [instruction + self.tokenizer.mask_token*n_mask for input_text in input_texts]
        elif type(instruction) == list:
            prompts = [inst + self.tokenizer.mask_token*n_mask for inst in instruction]
    
        inputs = self.tokenizer(input_texts, prompts, padding=True, truncation=True, return_tensors='pt').to(self.device)

        mask = inputs.input_ids.eq(self.tokenizer.mask_token_id)
        
        outputs = self.model(**inputs)

        logits = outputs.last_hidden_state[mask]
        
        logits = self.layer_norm(gelu(self.dense(logits)))
        
        logits = logits.reshape(len(input_texts), n_mask, -1)
        
        logits = logits.mean(1)
            
        logits = (logits - logits.mean(1, keepdim=True)) / logits.std(1, keepdim=True)
        
        return logits

inbedder = InBedder(path='KomeijiForce/inbedder-roberta-large', device='cpu')

texts = ["I love cat!", "I love dog!", "I dislike cat!"]
instruction = "What is the animal mentioned here?"
embeddings = inbedder.encode(texts, instruction, 3)

cosine_similarity(embeddings[:1], embeddings[1:], dim=1)
# tensor([0.9374, 0.9917], grad_fn=<SumBackward1>)

texts = ["I love cat!", "I love dog!", "I dislike cat!"]
instruction = "What is emotion expressed here?"
embeddings = inbedder.encode(texts, instruction, 3)

cosine_similarity(embeddings[:1], embeddings[1:], dim=1)
# tensor([0.9859, 0.8537], grad_fn=<SumBackward1>)