オープンソースのcontra - bottleneck - t5 - large - wikipediaモデル - テキストエンコーディングの再構築と意味編集の補間を実現

ホーム

Contra Bottleneck T5 Large Wikipedia

thesephistによって開発

ボトルネックT5モデルはテキスト自動エンコーダーで、テキストを埋め込みベクトルにエンコードし元のテキストを再構築でき、意味的編集や補間をサポートします。

テキスト埋め込み

Transformers

英語オープンソースライセンス:MIT #テキスト自動エンコーディング #潜在空間編集 #意味的補間

ダウンロード数 1,719

リリース時間 : 9/30/2023

モデル概要

このモデルはT5スタイルのエンコーダー-デコーダーアーキテクチャを基にしており、注意プーリングボトルネックとゲート付きクロスアテンションを備え、主にテキストの潜在空間表現と意味的編集に使用されます。

モデル特徴

テキスト自動エンコーディング

最大512トークンのテキストを埋め込みベクトルにエンコードし、そこから元のテキストを再構築できます。

意味的編集

潜在空間内のベクトル演算を通じて、テキストのトーン、長さ、またはテーマを編集します。

正規化埋め込み

生成される埋め込みベクトルは常に長さ1に正規化され、ベクトル演算や比較が容易です。

高品質再構築

百科事典類のテキストで最高のパフォーマンスを発揮し、元の内容を高品質で再構築できます。

モデル能力

テキストエンコーディング

テキスト再構築

意味的補間

テキスト編集

使用事例

テキスト処理

テキスト意味的編集

潜在空間ベクトルを変更してテキストのトーン、長さ、またはテーマを編集します。

意味的に類似しているがスタイルの異なるテキストを生成します。

テキスト補間

2つのテキスト断片間で意味的補間を行い、中間状態のテキストを生成します。

滑らかに遷移するテキストシーケンス。

コンテンツ生成

テキスト再構築

埋め込みベクトルから元のテキストを再構築します。

高品質な再構築テキスト。

🚀 Bottleneck T5 ⏳

Bottleneck T5モデルは、潜在空間でテキストを検査および編集するためのインターフェースを探索する多くの実験やデモをサポートしています。このモデルはテキスト用のオートエンコーダで、最大512トークンのテキストを埋め込みベクトルにエンコードし、その埋め込みベクトルから元のテキストを再構築することができます。また、このモデルが生成する埋め込み空間の構造により、潜在空間でのベクトル演算を通じてテキストの意味的な編集が可能になります。

🚀 クイックスタート

Bottleneck T5モデルは、潜在空間でのテキストの検査や編集を可能にするオートエンコーダです。以下に使用方法の概要を示します。

まず、モデルを使用するためのラッパークラスを定義します。

import os
import torch
import torch.nn as nn
import torch.nn.functional as F

from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM

class BottleneckT5Autoencoder:
    def __init__(self, model_path: str, device='cpu'):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device)
        self.model.eval()

    @torch.no_grad()
    def embed(self, text: str) -> torch.FloatTensor:
        inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
        decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
        return self.model(
            **inputs,
            decoder_input_ids=decoder_inputs['input_ids'],
            encode_only=True,
        )[0]

    @torch.no_grad()
    def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
        dummy_text = '.'
        dummy = self.embed(dummy_text)
        perturb_vector = latent - dummy
        self.model.perturb_vector = perturb_vector
        input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
        output = self.model.generate(
            input_ids=input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
        )
        return self.tokenizer.decode(output[0], skip_special_tokens=True)

次に、モデルを初期化します。

device = 'cuda' if torch.cuda.is_available() else 'cpu'
autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-large-wikipedia', device=device)

テキストをエンベディングし、再構築することができます。

texts = [
    'The quick brown fox jumps over the lazy dog',
    'Hi there! My name is Linus, and I spend a lot of my time thinking about latent spaces of neural network models.',
    'Notion is a single space where you can think, write, and plan. Capture thoughts, manage projects, or even run an entire company — and do it exactly the way you want.',
]

for t in texts:
    embedding = autoencoder.embed(t)
    reconstruction = autoencoder.generate_from_latent(embedding)
    print(reconstruction)

これにより、以下のようなテキストが出力されます。

The quick brown fox jumps over the lazy dog
I'm named after Linus, and I spend a lot of my time thinking about neural networks of latent space models.
Notion is a single place where you can think, plan, and spend time. Capture ideas, manage projects, and even do your own writing — or plan it exactly the way you want.

モデルを使用して補間や意味的な編集を行う詳細な例については、このGoogle Colabノートブックを参照してください。

✨ 主な機能

テキストのエンコードとデコード：最大512トークンのテキストを埋め込みベクトルにエンコードし、その埋め込みベクトルから元のテキストを再構築することができます。
潜在空間での意味的編集：モデルが生成する埋め込み空間の構造により、潜在空間でのベクトル演算を通じてテキストの意味的な編集が可能になります。

📚 ドキュメント

モデルの詳細

このモデルが生成する埋め込みを使用すると、テキスト間の意味的な補間や、長さ、トーン、構造、トピックなどの潜在属性を使用して文を編集することができます。

すべてのBottleneck T5モデルは、英語のWikipediaのフィルタリングされたサブセットでトレーニングされており、百科事典やその他の類似した種類のテキストのエンコードとデコードに最適な性能を発揮します。高度に技術的な、会話的な、またはその他の非伝統的なテキストは、モデルの分布外となる可能性があり、モデルはそのような入力に対しては性能が低下する場合があります。

Bottleneck T5の埋め込みは常に長さ1に正規化されます。エンコーダは長さ1の埋め込みを生成し、デコーダへの入力はすべて長さ1に正規化されます。

属性	详情
開発者	Linus Lee
モデルタイプ	注意プールされたボトルネックとゲート付きクロスアテンションを持つT5スタイルのエンコーダ - デコーダトランスフォーマー
言語 (NLP)	英語
ライセンス	MIT
ファインチューニング元のモデル	LM適応型T5 v1.1

トレーニングの詳細

Contraは、言語モデリング適応型T5 v1.1チェックポイントから初期化され、英語のWikipediaデータセットの長さでフィルタリングされたサブセットで、1エポックだけトレーニングされました。ノイズ除去オートエンコーダとして、30%のトークンがランダムにマスクされ、Adafactorオプティマイザを使用してトレーニングされました。

モデルファミリーとチェックポイント

モデルサイズと出力品質のバランスが良いthesephist/contra-bottleneck-t5-large-wikipediaから始めることをお勧めしますが、3.3億から30億のパラメータを持つ4つのバリアントをトレーニングしています。

thesephist/contra-bottleneck-t5-small-wikipedia: 6000万パラメータ、512次元の埋め込み
thesephist/contra-bottleneck-t5-base-wikipedia: 2.2億パラメータ、768次元の埋め込み
thesephist/contra-bottleneck-t5-large-wikipedia: 7.7億パラメータ、1024次元の埋め込み
thesephist/contra-bottleneck-t5-xl-wikipedia: 30億パラメータ、2048次元の埋め込み

📄 ライセンス

このモデルはMITライセンスの下で提供されています。