Contra-bottleneck-t5-base-wikipedia Open-source Text Processing Model - Text Encoding Reconstruction and Semantic Editing

Contra Bottleneck T5 Base Wikipedia

Developed by thesephist

A text autoencoder based on the T5 architecture that encodes text into embedding vectors and reconstructs it, supporting latent space semantic editing

Large Language Model

Transformers

EnglishOpen Source License:MIT #Text Autoencoding #Latent Space Editing #Semantic Interpolation

Downloads 143

Release Time : 9/30/2023

Model Overview

This model is a text autoencoder capable of encoding text up to 512 tokens into embedding vectors and reconstructing the original text from them. The generated embedding space structure allows for semantic editing of text through vector operations.

Model Features

Latent Space Semantic Editing

Supports editing text semantic attributes (e.g., length, tone, topic) through embedding vector operations

Normalized Embedding Space

All embedding vectors are automatically normalized to unit length, facilitating vector operations and comparisons

Encyclopedia Optimization

Specially trained on Wikipedia data, making it most suitable for processing encyclopedia-like text

Model Capabilities

Encode text into embedding vectors

Reconstruct text from embedding vectors

Text semantic interpolation

Latent space text editing

Use Cases

Text Processing

Text Style Transfer

Modify text tone or style through latent space vector operations

Can convert formal text into colloquial expressions or adjust text sentiment

Text Summarization

Generate more concise versions of text through latent space operations

Maintains core semantics while shortening text length

Semantic Analysis

Text Similarity Calculation

Evaluate text semantic similarity by comparing embedding vectors

Can be used for document retrieval or clustering analysis

🚀 Bottleneck T5 ⏳

The Bottleneck T5 model serves as the backbone for numerous experiments and demos. It explores interfaces for inspecting and editing text in the latent space. As a text auto - encoder, it can encode text up to 512 tokens into an embedding and then reconstruct the original text from it. The structure of the embedding space generated by this model enables semantic text edits through vector arithmetic in the latent space.

🚀 Quick Start

The model is currently in a prototype state implemented on top of the T5 language model. To use it for embedding and generating text, we need a small wrapper class around it.

Basic Usage

import os
import torch
import torch.nn as nn
import torch.nn.functional as F

from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM

class BottleneckT5Autoencoder:
    def __init__(self, model_path: str, device='cpu'):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device)
        self.model.eval()

    @torch.no_grad()
    def embed(self, text: str) -> torch.FloatTensor:
        inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
        decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
        return self.model(
            **inputs,
            decoder_input_ids=decoder_inputs['input_ids'],
            encode_only=True,
        )[0]

    @torch.no_grad()
    def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
        dummy_text = '.'
        dummy = self.embed(dummy_text)
        perturb_vector = latent - dummy
        self.model.perturb_vector = perturb_vector
        input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
        output = self.model.generate(
            input_ids=input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
        )
        return self.tokenizer.decode(output[0], skip_special_tokens=True)

Advanced Usage

device = 'cuda' if torch.cuda.is_available() else 'cpu'
autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-large-wikipedia', device=device)

texts = [
    'The quick brown fox jumps over the lazy dog',
    'Hi there! My name is Linus, and I spend a lot of my time thinking about latent spaces of neural network models.',
    'Notion is a single space where you can think, write, and plan. Capture thoughts, manage projects, or even run an entire company — and do it exactly the way you want.',
]

for t in texts:
    embedding = autoencoder.embed(t)
    reconstruction = autoencoder.generate_from_latent(embedding)
    print(reconstruction)

The above code produces the text:

The quick brown fox jumps over the lazy dog
I'm named after Linus, and I spend a lot of my time thinking about neural networks of latent space models.
Notion is a single place where you can think, plan, and spend time. Capture ideas, manage projects, and even do your own writing — or plan it exactly the way you want.

For more examples on how to use the model to compute interpolations and semantic edits with Contra, see this Google Colab notebook.

✨ Features

Semantic Interpolation and Editing: Using embeddings produced by this model, we can semantically interpolate between pieces of text and edit sentences using their latent attributes like length, tone, structure, or topic.
Normalized Embeddings: Bottleneck T5 embeddings are always normalized to length 1; the encoder produces embeddings of length 1, and any inputs to the decoder will be normalized to length 1.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

📚 Documentation

Model Details

Developed by: Linus Lee
Model type: T5 - style encoder - decoder transformer with an attention pooled bottleneck and gated cross - attention
Language(s) (NLP): English
License: MIT
Finetuned from model: LM - adapted T5 v1.1

Using embeddings produced by this model, we can semantically interpolate between pieces of text and edit sentences using their latent attributes like length, tone, structure, or topic.

All Bottleneck T5 models are trained on a filtered subset of the English Wikipedia, and performs best at encoding and decoding encyclopedic and other similar kinds of text. Text that's heavily technical, conversational, or otherwise unconventional may be out of distribution for the model, and the model may not perform as well on such inputs.

Training Details

Contra was initialized from the [language modeling - adapted T5 v1.1 checkpoint](https://huggingface.co/models?other=t5 - lm - adapt) and trained on a subset of the English Wikipedia dataset filtered for length, for a single epoch, as a denoising autoencoder with 30% of tokens randomly masked, using the Adafactor optimizer.

Model family and checkpoints

I recommend experimenting first with thesephist/contra - bottleneck - t5 - large - wikipedia, which strikes a good balance between model size and output quality, but I've trained four variants ranging from 330M to 3B parameters:

[thesephist/contra - bottleneck - t5 - small - wikipedia](https://huggingface.co/thesephist/contra - bottleneck - t5 - small - wikipedia): 60M params, 512 embedding dimensions
[thesephist/contra - bottleneck - t5 - base - wikipedia](https://huggingface.co/thesephist/contra - bottleneck - t5 - base - wikipedia): 220M params, 768 embedding dimensions
[thesephist/contra - bottleneck - t5 - large - wikipedia](https://huggingface.co/thesephist/contra - bottleneck - t5 - large - wikipedia): 770M params, 1024 embedding dimensions
[thesephist/contra - bottleneck - t5 - xl - wikipedia](https://huggingface.co/thesephist/contra - bottleneck - t5 - xl - wikipedia): 3B params, 2048 embedding dimensions

🔧 Technical Details

The model is an autoencoder for text, encoding text up to 512 tokens into an embedding and then reconstructing the original text from the embedding. The structure of the embedding space allows for semantic edits to text through vector arithmetic in latent space. It is trained on a filtered subset of the English Wikipedia as a denoising autoencoder with 30% of tokens randomly masked, using the Adafactor optimizer.