contra-bottleneck-t5-xl-wikipedia開源模型 - 免費實現文本編碼、重構及語義編輯插值

首頁

Contra Bottleneck T5 Xl Wikipedia

由thesephist開發

瓶頸T5模型是一個文本自動編碼器，能夠將文本編碼為嵌入向量並重構原始文本，支持語義編輯和插值操作。

文本嵌入

Transformers

英語開源協議:MIT #文本自動編碼 #潛在空間編輯 #語義插值

下載量 95

發布時間 : 9/30/2023

模型概述

該模型是一個基於T5的文本自動編碼器，專門設計用於將文本編碼為嵌入向量並從中重構文本。其生成的嵌入空間支持語義編輯和文本插值，適用於百科全書類文本的處理。

模型特點

文本自動編碼

能夠將最多512個標記的文本編碼為嵌入向量，並從中重構原始文本。

語義編輯

通過在潛在空間中進行向量運算，實現對文本的語義編輯（如長度、語氣、結構或主題）。

文本插值

支持在文本片段之間進行語義插值，生成過渡文本。

歸一化嵌入

生成的嵌入向量始終歸一化為長度1，便於向量運算和比較。

模型能力

文本編碼

文本重構

語義編輯

文本插值

使用案例

文本處理

文本語義編輯

通過修改潛在空間中的嵌入向量，實現對文本語氣、長度等屬性的編輯。

可生成語義相似但屬性不同的文本變體。

文本插值

在兩個文本之間進行語義插值，生成過渡文本。

可生成連貫的中間文本，展示語義漸變過程。

潛在空間探索

潛在空間分析

分析文本在潛在空間中的分佈和結構。

幫助理解模型如何組織和表示文本語義。

🚀 瓶頸T5模型 ⏳

瓶頸T5模型為我許多探索潛在空間中檢查和編輯文本接口的實驗與演示提供了支持。該模型是一個文本自動編碼器，能夠將長達512個標記的文本編碼為嵌入向量，然後從該嵌入向量中重建原始文本。該模型生成的嵌入空間結構還允許通過潛在空間中的向量運算對文本進行語義編輯。

🚀 快速開始

本模型當前處於基於T5語言模型實現的原型階段，因此我們需要圍繞它創建一個小包裝類，以便用於文本嵌入和生成：

import os
import torch
import torch.nn as nn
import torch.nn.functional as F

from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM

class BottleneckT5Autoencoder:
    def __init__(self, model_path: str, device='cpu'):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device)
        self.model.eval()

    @torch.no_grad()
    def embed(self, text: str) -> torch.FloatTensor:
        inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
        decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
        return self.model(
            **inputs,
            decoder_input_ids=decoder_inputs['input_ids'],
            encode_only=True,
        )[0]

    @torch.no_grad()
    def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
        dummy_text = '.'
        dummy = self.embed(dummy_text)
        perturb_vector = latent - dummy
        self.model.perturb_vector = perturb_vector
        input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
        output = self.model.generate(
            input_ids=input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
        )
        return self.tokenizer.decode(output[0], skip_special_tokens=True)

然後，我們可以基於模型類初始化這個自動編碼器類：

device = 'cuda' if torch.cuda.is_available() else 'cpu'
autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-large-wikipedia', device=device)

使用 .embed(text: str) 和 .generate_from_latent(embedding: torch.FloatTensor) 對文本進行嵌入和反嵌入操作：

texts = [
    'The quick brown fox jumps over the lazy dog',
    'Hi there! My name is Linus, and I spend a lot of my time thinking about latent spaces of neural network models.',
    'Notion is a single space where you can think, write, and plan. Capture thoughts, manage projects, or even run an entire company — and do it exactly the way you want.',
]

for t in texts:
    embedding = autoencoder.embed(t)
    reconstruction = autoencoder.generate_from_latent(embedding)
    print(reconstruction)

上述代碼將輸出以下文本：

The quick brown fox jumps over the lazy dog
I'm named after Linus, and I spend a lot of my time thinking about neural networks of latent space models.
Notion is a single place where you can think, plan, and spend time. Capture ideas, manage projects, and even do your own writing — or plan it exactly the way you want.

有關如何使用該模型通過Contra進行插值和語義編輯的更多示例，請參閱此Google Colab筆記本。

✨ 主要特性

語義編輯：利用該模型生成的嵌入向量，我們可以在文本片段之間進行語義插值，並根據句子的潛在屬性（如長度、語氣、結構或主題）對其進行編輯。
歸一化處理：瓶頸T5嵌入向量始終被歸一化為長度為1，編碼器生成的嵌入向量長度為1，解碼器的任何輸入也將被歸一化為長度為1。

📚 詳細文檔

模型詳情

使用該模型生成的嵌入向量，我們可以在文本片段之間進行語義插值，並根據句子的潛在屬性（如長度、語氣、結構或主題）對其進行編輯。

所有瓶頸T5模型均在經過篩選的英文維基百科子集上進行訓練，在對百科全書及其他類似類型的文本進行編碼和解碼時表現最佳。技術含量高、對話式或其他非常規的文本可能超出了模型的分佈範圍，模型在處理此類輸入時可能表現不佳。

瓶頸T5嵌入向量始終被歸一化為長度為1，編碼器生成的嵌入向量長度為1，解碼器的任何輸入也將被歸一化為長度為1。

屬性	詳情
開發者	Linus Lee
模型類型	具有注意力池化瓶頸和門控交叉注意力的T5風格編碼器 - 解碼器Transformer
語言（NLP）	英語
許可證	MIT
微調基礎模型	適應語言模型的T5 v1.1