contra - bottleneck - t5 - xl - wikipediaオープンソースモデル - テキストのエンコード、再構築、および意味編集補間を無料で実現する

ホーム

Contra Bottleneck T5 Xl Wikipedia

thesephistによって開発

ボトルネックT5モデルはテキスト自動エンコーダーで、テキストを埋め込みベクトルにエンコードし元のテキストを再構築でき、意味的編集や補間操作をサポートします。

テキスト埋め込み

Transformers

英語オープンソースライセンス:MIT #テキスト自動エンコーディング #潜在空間編集 #意味的補間

ダウンロード数 95

リリース時間 : 9/30/2023

モデル概要

このモデルはT5ベースのテキスト自動エンコーダーで、テキストを埋め込みベクトルにエンコードし再構築するために特別に設計されています。生成される埋め込み空間は意味的編集やテキスト補間をサポートし、百科事典類のテキスト処理に適しています。

モデル特徴

テキスト自動エンコーディング

最大512トークンのテキストを埋め込みベクトルにエンコードし、元のテキストを再構築できます。

意味的編集

潜在空間でのベクトル演算により、テキストの長さ、トーン、構造やテーマなどの意味的編集を実現します。

テキスト補間

テキスト断片間で意味的補間を行い、移行テキストを生成できます。

正規化埋め込み

生成される埋め込みベクトルは常に長さ1に正規化され、ベクトル演算や比較が容易です。

モデル能力

テキストエンコーディング

テキスト再構築

意味的編集

テキスト補間

使用事例

テキスト処理

テキスト意味的編集

潜在空間の埋め込みベクトルを変更することで、テキストのトーンや長さなどの属性を編集します。

意味的に類似しているが属性が異なるテキストバリアントを生成できます。

テキスト補間

2つのテキスト間で意味的補間を行い、移行テキストを生成します。

一貫性のある中間テキストを生成し、意味的グラデーションプロセスを表示できます。

潜在空間探索

潜在空間分析

テキストが潜在空間でどのように分布し構造化されているかを分析します。

モデルがテキストの意味をどのように組織化し表現するかを理解するのに役立ちます。

🚀 Bottleneck T5 ⏳

Bottleneck T5モデルは、潜在空間でテキストを検査および編集するためのインターフェースを探索する多くの実験やデモを支えています。このモデルはテキスト用のオートエンコーダで、最大512トークンのテキストを埋め込みベクトルにエンコードし、その埋め込みベクトルから元のテキストを再構築することができます。また、このモデルが生成する埋め込み空間の構造により、潜在空間でのベクトル演算を通じてテキストに対する意味的な編集も可能になります。

✨ 主な機能

このモデルが生成する埋め込みを使用することで、テキスト間の意味的な補間や、長さ、トーン、構造、トピックなどの潜在属性を用いた文章の編集が可能です。

すべてのBottleneck T5モデルは、英語版Wikipediaのフィルタリングされたサブセットで学習されており、百科事典やそれに類するテキストのエンコードとデコードに最適な性能を発揮します。高度な技術的な内容、会話的な内容、またはその他の非定型のテキストは、モデルの分布外となる可能性があり、このような入力に対してはモデルの性能が低下する場合があります。

Bottleneck T5の埋め込みは常に長さ1に正規化されます。エンコーダは長さ1の埋め込みを生成し、デコーダへの入力も長さ1に正規化されます。

Property	Details
Developed by	Linus Lee
Model Type	T5-style encoder-decoder transformer with an attention pooled bottleneck and gated cross-attention
Language(s) (NLP)	English
License	MIT
Finetuned from model	LM-adapted T5 v1.1

📦 インストール

このモデルは現在、T5言語モデルの上に実装されたプロトタイプ状態です。そのため、埋め込みやテキスト生成に使用するには、小さなラッパークラスが必要です。

import os
import torch
import torch.nn as nn
import torch.nn.functional as F

from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM

class BottleneckT5Autoencoder:
    def __init__(self, model_path: str, device='cpu'):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device)
        self.model.eval()

    @torch.no_grad()
    def embed(self, text: str) -> torch.FloatTensor:
        inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
        decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
        return self.model(
            **inputs,
            decoder_input_ids=decoder_inputs['input_ids'],
            encode_only=True,
        )[0]

    @torch.no_grad()
    def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
        dummy_text = '.'
        dummy = self.embed(dummy_text)
        perturb_vector = latent - dummy
        self.model.perturb_vector = perturb_vector
        input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
        output = self.model.generate(
            input_ids=input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
        )
        return self.tokenizer.decode(output[0], skip_special_tokens=True)

次に、モデルクラスに基づいてこのオートエンコーダクラスを初期化します。

device = 'cuda' if torch.cuda.is_available() else 'cpu'
autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-large-wikipedia', device=device)

💻 使用例

基本的な使用法

texts = [
    'The quick brown fox jumps over the lazy dog',
    'Hi there! My name is Linus, and I spend a lot of my time thinking about latent spaces of neural network models.',
    'Notion is a single space where you can think, write, and plan. Capture thoughts, manage projects, or even run an entire company — and do it exactly the way you want.',
]

for t in texts:
    embedding = autoencoder.embed(t)
    reconstruction = autoencoder.generate_from_latent(embedding)
    print(reconstruction)

このコードは以下のテキストを生成します。

The quick brown fox jumps over the lazy dog
I'm named after Linus, and I spend a lot of my time thinking about neural networks of latent space models.
Notion is a single place where you can think, plan, and spend time. Capture ideas, manage projects, and even do your own writing — or plan it exactly the way you want.

モデルを使用してContraで補間や意味的な編集を計算する方法の詳細な例については、このGoogle Colabノートブックを参照してください。

🔧 技術詳細

Contraは、言語モデリング適応型T5 v1.1チェックポイントから初期化され、長さに基づいてフィルタリングされた英語版Wikipediaデータセットのサブセットで1エポック学習されました。学習は、30%のトークンがランダムにマスクされたノイズ除去オートエンコーダとして、Adafactorオプティマイザを使用して行われました。

モデルファミリーとチェックポイント

私は最初にthesephist/contra-bottleneck-t5-large-wikipediaで実験することをおすすめします。このモデルは、モデルサイズと出力品質のバランスが良いです。ただし、私は3.3億から30億のパラメータを持つ4つのバリアントを学習させています。

thesephist/contra-bottleneck-t5-small-wikipedia: 60M params, 512 embedding dimensions
thesephist/contra-bottleneck-t5-base-wikipedia: 220M params, 768 embedding dimensions
thesephist/contra-bottleneck-t5-large-wikipedia: 770M params, 1024 embedding dimensions
thesephist/contra-bottleneck-t5-xl-wikipedia: 3B params, 2048 embedding dimensions

📄 ライセンス

このモデルはMITライセンスの下で公開されています。

Contra Bottleneck T5 Xl Wikipedia

モデル紹介

コンテンツ詳細

代替品

モデル概要

モデル特徴

モデル能力

使用事例

🚀 Bottleneck T5 ⏳

✨ 主な機能

📦 インストール

💻 使用例

基本的な使用法

🔧 技術詳細

モデルファミリーとチェックポイント

📄 ライセンス

おすすめAIモデル