đ Bottleneck T5 âŗ
The Bottleneck T5 model serves as the backbone for numerous experiments and demos. It explores interfaces for inspecting and editing text in the latent space. As a text auto - encoder, it can encode text up to 512 tokens into an embedding and then reconstruct the original text from it. The structure of the embedding space generated by this model enables semantic text edits through vector arithmetic in the latent space.
đ Quick Start
The model is currently in a prototype state implemented on top of the T5 language model. To use it for embedding and generating text, we need a small wrapper class around it.
Basic Usage
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
class BottleneckT5Autoencoder:
def __init__(self, model_path: str, device='cpu'):
self.device = device
self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device)
self.model.eval()
@torch.no_grad()
def embed(self, text: str) -> torch.FloatTensor:
inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
return self.model(
**inputs,
decoder_input_ids=decoder_inputs['input_ids'],
encode_only=True,
)[0]
@torch.no_grad()
def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
dummy_text = '.'
dummy = self.embed(dummy_text)
perturb_vector = latent - dummy
self.model.perturb_vector = perturb_vector
input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
output = self.model.generate(
input_ids=input_ids,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
num_return_sequences=1,
)
return self.tokenizer.decode(output[0], skip_special_tokens=True)
Advanced Usage
device = 'cuda' if torch.cuda.is_available() else 'cpu'
autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-large-wikipedia', device=device)
texts = [
'The quick brown fox jumps over the lazy dog',
'Hi there! My name is Linus, and I spend a lot of my time thinking about latent spaces of neural network models.',
'Notion is a single space where you can think, write, and plan. Capture thoughts, manage projects, or even run an entire company â and do it exactly the way you want.',
]
for t in texts:
embedding = autoencoder.embed(t)
reconstruction = autoencoder.generate_from_latent(embedding)
print(reconstruction)
The above code produces the text:
The quick brown fox jumps over the lazy dog
I'm named after Linus, and I spend a lot of my time thinking about neural networks of latent space models.
Notion is a single place where you can think, plan, and spend time. Capture ideas, manage projects, and even do your own writing â or plan it exactly the way you want.
For more examples on how to use the model to compute interpolations and semantic edits with Contra, see this Google Colab notebook.
⨠Features
- Semantic Interpolation and Editing: Using embeddings produced by this model, we can semantically interpolate between pieces of text and edit sentences using their latent attributes like length, tone, structure, or topic.
- Normalized Embeddings: Bottleneck T5 embeddings are always normalized to length 1; the encoder produces embeddings of length 1, and any inputs to the decoder will be normalized to length 1.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đ Documentation
Model Details
- Developed by: Linus Lee
- Model type: T5 - style encoder - decoder transformer with an attention pooled bottleneck and gated cross - attention
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: LM - adapted T5 v1.1
Using embeddings produced by this model, we can semantically interpolate between pieces of text and edit sentences using their latent attributes like length, tone, structure, or topic.
All Bottleneck T5 models are trained on a filtered subset of the English Wikipedia, and performs best at encoding and decoding encyclopedic and other similar kinds of text. Text that's heavily technical, conversational, or otherwise unconventional may be out of distribution for the model, and the model may not perform as well on such inputs.
Training Details
Contra was initialized from the [language modeling - adapted T5 v1.1 checkpoint](https://huggingface.co/models?other=t5 - lm - adapt) and trained on a subset of the English Wikipedia dataset filtered for length, for a single epoch, as a denoising autoencoder with 30% of tokens randomly masked, using the Adafactor optimizer.
Model family and checkpoints
I recommend experimenting first with thesephist/contra - bottleneck - t5 - large - wikipedia
, which strikes a good balance between model size and output quality, but I've trained four variants ranging from 330M to 3B parameters:
- [thesephist/contra - bottleneck - t5 - small - wikipedia](https://huggingface.co/thesephist/contra - bottleneck - t5 - small - wikipedia): 60M params, 512 embedding dimensions
- [thesephist/contra - bottleneck - t5 - base - wikipedia](https://huggingface.co/thesephist/contra - bottleneck - t5 - base - wikipedia): 220M params, 768 embedding dimensions
- [thesephist/contra - bottleneck - t5 - large - wikipedia](https://huggingface.co/thesephist/contra - bottleneck - t5 - large - wikipedia): 770M params, 1024 embedding dimensions
- [thesephist/contra - bottleneck - t5 - xl - wikipedia](https://huggingface.co/thesephist/contra - bottleneck - t5 - xl - wikipedia): 3B params, 2048 embedding dimensions
đ§ Technical Details
The model is an autoencoder for text, encoding text up to 512 tokens into an embedding and then reconstructing the original text from the embedding. The structure of the embedding space allows for semantic edits to text through vector arithmetic in latent space. It is trained on a filtered subset of the English Wikipedia as a denoising autoencoder with 30% of tokens randomly masked, using the Adafactor optimizer.
đ License
The model is released under the MIT license.