đ RankGen
RankGen is a suite of encoder models (100M - 1.2B parameters) that map prefixes and generations from any pretrained English language model to a shared vector space. It can rerank multiple full - length samples from an LM, improve generation quality when incorporated into beam search, and also serves as a dense retriever with state - of - the - art performance on literary retrieval.
đ Quick Start
RankGen offers a powerful way to enhance text generation and retrieval. This section will guide you through the initial steps to get started with RankGen.
⨠Features
- Reranking: RankGen can rerank multiple full - length samples from a language model, enhancing the quality of the output.
- Beam Search Improvement: It can be integrated into beam search as a scoring function, significantly improving generation quality (0.85 vs 0.77 MAUVE, 75% preference according to English - writer human annotators).
- Dense Retrieval: Achieves state - of - the - art performance on literary retrieval.
đĻ Installation
Requirements
Python 3.7+ is required, along with torch
(CUDA is recommended) and transformers
. pip
will install these dependencies for you.
Installation Steps
python3.7 -m virtualenv rankgen - venv
source rankgen - venv/bin/activate
pip install rankgen
Data Acquisition
Get the data [here](https://drive.google.com/drive/folders/1DRG2ess7fK3apfB - 6KoHb_azMuHbsIv4?usp = sharing) and place the folder in the root directory. Alternatively, use gdown
as shown below:
gdown --folder https://drive.google.com/drive/folders/1DRG2ess7fK3apfB - 6KoHb_azMuHbsIv4
Test the Installation
Run the test script to ensure the RankGen checkpoint has loaded correctly:
python -m rankgen.test_rankgen_encoder --model_path kalpeshk2011/rankgen - t5 - base - all
Expected output
0.0009239262409127233
0.0011521980725477804
đģ Usage Examples
Basic Usage
Loading RankGen is straightforward. You can use the RankGenEncoder
for a more convenient experience.
from rankgen import RankGenEncoder, RankGenGenerator
rankgen_encoder = RankGenEncoder("kalpeshk2011/rankgen - t5 - xl - all")
prefix_vectors = rankgen_encoder.encode(["This is a prefix sentence."], vectors_type="prefix")
suffix_vectors = rankgen_encoder.encode(["This is a suffix sentence."], vectors_type="suffix")
generator = RankGenGenerator(rankgen_encoder=rankgen_encoder, language_model="gpt2 - medium")
inputs = ["Whatever might be the nature of the tragedy it would be over with long before this, and those moving black spots away yonder to the west, that he had discerned from the bluff, were undoubtedly the departing raiders. There was nothing left for Keith to do except determine the fate of the unfortunates, and give their bodies decent burial. That any had escaped, or yet lived, was altogether unlikely, unless, perchance, women had been in the party, in which case they would have been borne away prisoners."]
print(generator.generate_single(inputs, top_p=0.9)[0][0])
print(generator.overgenerate_rerank(inputs, top_p=0.9, num_samples=10)[0][0])
print(generator.beam_search(inputs, top_p=0.9, num_samples=10, beam_size=2)[0][0])
Advanced Usage
You can also load the model using the HuggingFace APIs directly.
from transformers import T5Tokenizer, AutoModel
tokenizer = T5Tokenizer.from_pretrained(f"google/t5 - v1_1 - xl")
model = AutoModel.from_pretrained("kalpeshk2011/rankgen - t5 - xl - all", trust_remote_code=True)
RankGenEncoder Implementation
import tqdm
from transformers import T5Tokenizer, T5EncoderModel, AutoModel
class RankGenEncoder():
def __init__(self, model_path, max_batch_size=32, model_size=None, cache_dir=None):
assert model_path in ["kalpeshk2011/rankgen - t5 - xl - all", "kalpeshk2011/rankgen - t5 - xl - pg19", "kalpeshk2011/rankgen - t5 - base - all", "kalpeshk2011/rankgen - t5 - large - all"]
self.max_batch_size = max_batch_size
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
if model_size is None:
if "t5 - large" in model_path or "t5_large" in model_path:
self.model_size = "large"
elif "t5 - xl" in model_path or "t5_xl" in model_path:
self.model_size = "xl"
else:
self.model_size = "base"
else:
self.model_size = model_size
self.tokenizer = T5Tokenizer.from_pretrained(f"google/t5 - v1_1 - {self.model_size}", cache_dir=cache_dir)
self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
self.model.to(self.device)
self.model.eval()
def encode(self, inputs, vectors_type="prefix", verbose=False, return_input_ids=False):
tokenizer = self.tokenizer
max_batch_size = self.max_batch_size
if isinstance(inputs, str):
inputs = [inputs]
if vectors_type == 'prefix':
inputs = ['pre ' + input for input in inputs]
max_length = 512
else:
inputs = ['suffi ' + input for input in inputs]
max_length = 128
all_embeddings = []
all_input_ids = []
for i in tqdm.tqdm(range(0, len(inputs), max_batch_size), total=(len(inputs) // max_batch_size) + 1, disable=not verbose, desc=f"Encoding {vectors_type} inputs:"):
tokenized_inputs = tokenizer(inputs[i:i + max_batch_size], return_tensors="pt", padding=True)
for k, v in tokenized_inputs.items():
tokenized_inputs[k] = v[:, :max_length]
tokenized_inputs = tokenized_inputs.to(self.device)
with torch.inference_mode():
batch_embeddings = self.model(**tokenized_inputs)
all_embeddings.append(batch_embeddings)
if return_input_ids:
all_input_ids.extend(tokenized_inputs.input_ids.cpu().tolist())
return {
"embeddings": torch.cat(all_embeddings, dim=0),
"input_ids": all_input_ids
}
đ Documentation
Main repository
The main repository for RankGen can be found at https://github.com/martiansideofthemoon/rankgen.
Datasets
RankGen is trained on the following datasets:
- Wikipedia
- PG19
- C4
- relic
- ChapterBreak
- HellaSwag
- ROCStories
Metrics
The performance of RankGen is evaluated using the following metrics:
đ License
This project is licensed under the "apache - 2.0" license.