RankGen-T5-XL-All Open-Source Encoder Model - A Great Helper to Improve Generation Quality and Retrieval Performance

Rankgen T5 Xl All

Developed by kalpeshk2011

RankGen is a set of encoder models capable of mapping prefixes and generated content from pre-trained language models into a shared vector space to enhance generation quality and retrieval performance.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Text Generation Reranking #Vector Space Mapping #Dense Retrieval

Downloads 4,535

Release Time : 7/20/2022

Model Overview

Trained through contrastive learning, RankGen maps language model-generated content and input prefixes into the same vector space, supporting text reranking, beam search optimization, and dense retrieval tasks.

Model Features

Shared Vector Space Mapping

Maps language model prefixes and generated content into the same vector space to achieve semantic alignment.

Multi-task Adaptation

Supports three application scenarios simultaneously: generated content reranking, beam search optimization, and dense retrieval.

Significant Quality Improvement

Increases MAUVE score from 0.77 to 0.85, with a human evaluation preference rate of 75%.

Model Capabilities

Text Generation Quality Optimization

Generated Content Reranking

Beam Search Decoding Enhancement

Dense Vector Retrieval

Use Cases

Text Generation Enhancement

Story Continuation Optimization

Reranks multiple story endings generated by a language model to select the most coherent version.

Human evaluation shows a 25% improvement in generation quality after optimization.

Information Retrieval

Literary Passage Retrieval

Functions as a dense retriever to find relevant passages in literary works.

Achieves SOTA performance on the RELIC literary retrieval task.

🚀 RankGen

RankGen is a suite of encoder models (100M - 1.2B parameters) that map prefixes and generations from any pretrained English language model to a shared vector space. It can rerank multiple full - length samples from an LM, improve generation quality when incorporated into beam search, and also serves as a dense retriever with state - of - the - art performance on literary retrieval.

🚀 Quick Start

RankGen offers a powerful way to enhance text generation and retrieval. This section will guide you through the initial steps to get started with RankGen.

✨ Features

Reranking: RankGen can rerank multiple full - length samples from a language model, enhancing the quality of the output.
Beam Search Improvement: It can be integrated into beam search as a scoring function, significantly improving generation quality (0.85 vs 0.77 MAUVE, 75% preference according to English - writer human annotators).
Dense Retrieval: Achieves state - of - the - art performance on literary retrieval.

📦 Installation

Requirements

Python 3.7+ is required, along with torch (CUDA is recommended) and transformers. pip will install these dependencies for you.

Installation Steps

python3.7 -m virtualenv rankgen - venv
source rankgen - venv/bin/activate
pip install rankgen

Data Acquisition

Get the data [here](https://drive.google.com/drive/folders/1DRG2ess7fK3apfB - 6KoHb_azMuHbsIv4?usp = sharing) and place the folder in the root directory. Alternatively, use gdown as shown below:

gdown --folder https://drive.google.com/drive/folders/1DRG2ess7fK3apfB - 6KoHb_azMuHbsIv4

Test the Installation

Run the test script to ensure the RankGen checkpoint has loaded correctly:

python -m rankgen.test_rankgen_encoder --model_path kalpeshk2011/rankgen - t5 - base - all

Expected output

0.0009239262409127233
0.0011521980725477804

💻 Usage Examples

Basic Usage

Loading RankGen is straightforward. You can use the RankGenEncoder for a more convenient experience.

from rankgen import RankGenEncoder, RankGenGenerator

rankgen_encoder = RankGenEncoder("kalpeshk2011/rankgen - t5 - xl - all")

# Encoding vectors
prefix_vectors = rankgen_encoder.encode(["This is a prefix sentence."], vectors_type="prefix")
suffix_vectors = rankgen_encoder.encode(["This is a suffix sentence."], vectors_type="suffix")

# Generating text
# use a HuggingFace compatible language model
generator = RankGenGenerator(rankgen_encoder=rankgen_encoder, language_model="gpt2 - medium")

inputs = ["Whatever might be the nature of the tragedy it would be over with long before this, and those moving black spots away yonder to the west, that he had discerned from the bluff, were undoubtedly the departing raiders. There was nothing left for Keith to do except determine the fate of the unfortunates, and give their bodies decent burial. That any had escaped, or yet lived, was altogether unlikely, unless, perchance, women had been in the party, in which case they would have been borne away prisoners."]

# Baseline nucleus sampling
print(generator.generate_single(inputs, top_p=0.9)[0][0])
# Over - generate and re - rank
print(generator.overgenerate_rerank(inputs, top_p=0.9, num_samples=10)[0][0])
# Beam search
print(generator.beam_search(inputs, top_p=0.9, num_samples=10, beam_size=2)[0][0])

Advanced Usage

You can also load the model using the HuggingFace APIs directly.

from transformers import T5Tokenizer, AutoModel

tokenizer = T5Tokenizer.from_pretrained(f"google/t5 - v1_1 - xl")
model = AutoModel.from_pretrained("kalpeshk2011/rankgen - t5 - xl - all", trust_remote_code=True)

RankGenEncoder Implementation

import tqdm
from transformers import T5Tokenizer, T5EncoderModel, AutoModel

class RankGenEncoder():
    def __init__(self, model_path, max_batch_size=32, model_size=None, cache_dir=None):
        assert model_path in ["kalpeshk2011/rankgen - t5 - xl - all", "kalpeshk2011/rankgen - t5 - xl - pg19", "kalpeshk2011/rankgen - t5 - base - all", "kalpeshk2011/rankgen - t5 - large - all"]
        self.max_batch_size = max_batch_size
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        if model_size is None:
            if "t5 - large" in model_path or "t5_large" in model_path:
                self.model_size = "large"
            elif "t5 - xl" in model_path or "t5_xl" in model_path:
                self.model_size = "xl"
            else:
                self.model_size = "base"
        else:
            self.model_size = model_size

        self.tokenizer = T5Tokenizer.from_pretrained(f"google/t5 - v1_1 - {self.model_size}", cache_dir=cache_dir)
        self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
        self.model.to(self.device)
        self.model.eval()

    def encode(self, inputs, vectors_type="prefix", verbose=False, return_input_ids=False):
        tokenizer = self.tokenizer
        max_batch_size = self.max_batch_size
        if isinstance(inputs, str):
            inputs = [inputs]
        if vectors_type == 'prefix':
            inputs = ['pre ' + input for input in inputs]
            max_length = 512
        else:
            inputs = ['suffi ' + input for input in inputs]
            max_length = 128

        all_embeddings = []
        all_input_ids = []
        for i in tqdm.tqdm(range(0, len(inputs), max_batch_size), total=(len(inputs) // max_batch_size) + 1, disable=not verbose, desc=f"Encoding {vectors_type} inputs:"):
            tokenized_inputs = tokenizer(inputs[i:i + max_batch_size], return_tensors="pt", padding=True)
            for k, v in tokenized_inputs.items():
                tokenized_inputs[k] = v[:, :max_length]
            tokenized_inputs = tokenized_inputs.to(self.device)
            with torch.inference_mode():
                batch_embeddings = self.model(**tokenized_inputs)
            all_embeddings.append(batch_embeddings)
            if return_input_ids:
                all_input_ids.extend(tokenized_inputs.input_ids.cpu().tolist())
        return {
            "embeddings": torch.cat(all_embeddings, dim=0),
            "input_ids": all_input_ids
        }

📚 Documentation

Main repository

The main repository for RankGen can be found at https://github.com/martiansideofthemoon/rankgen.

Datasets

RankGen is trained on the following datasets:

Wikipedia
PG19
C4
relic
ChapterBreak
HellaSwag
ROCStories

Metrics

The performance of RankGen is evaluated using the following metrics:

MAUVE
human

📄 License

This project is licensed under the "apache - 2.0" license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご