Pythia-2.8B Deduplicated Synthetic Instruction Model Open-Sourced! Efficiently Generate Precise Instruction Content

Pythia 2.8b Deduped Synthetic Instruct

Developed by lambdalabs

An instruction generation model fine-tuned on the deduplicated version of Pythia-2.8B, optimized for synthetic instruction datasets

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Instruction Fine-tuning #English Q&A #Synthetic Data Training

Downloads 46

Release Time : 3/4/2023

Model Overview

This model is a language model fine-tuned on synthetic instruction datasets based on the deduplicated version of Pythia-2.8B, excelling in generating text responses that follow instructions

Model Features

Instruction Fine-tuning Optimization

Fine-tuned on synthetic instruction datasets, enhancing the ability to follow instructions and generate responses

Efficient Inference

Requires approximately 7GB of GPU memory for inference, relatively efficient

Stop Token Support

Supports custom stop tokens for better control over generated text length

Model Capabilities

Text Generation

Instruction Response

Q&A Generation

Use Cases

Educational Assistance

Teaching Guide Generation

Generate step-by-step teaching guides, such as cooking methods

Examples demonstrate detailed steps for making an omelette

Virtual Assistant

Task Guidance

Answer user questions about how to complete specific tasks

🚀 Pythia-2.8B Deduped Synthetic Instruct Model

This is a fine-tuned language model based on the Transformer architecture. It addresses the need for high - quality natural language processing tasks by leveraging the pre - trained EleutherAI/pythia - 2.8b - deduped model and fine - tuning it on a specific dataset.

🚀 Quick Start

Prerequisites

Running inference with the model takes ~7GB of GPU memory.

Code Example

import torch

from transformers import AutoTokenizer, pipeline, StoppingCriteria, StoppingCriteriaList

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

model_name = "lambdalabs/pythia-2.8b-deduped-synthetic-instruct"
max_new_tokens = 2048
stop_token = "<|stop|>"


class KeywordsStoppingCriteria(StoppingCriteria):
    def __init__(self, keywords_ids: list):
        self.keywords = keywords_ids

    def __call__(
        self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
    ) -> bool:
        if input_ids[0][-1] in self.keywords:
            return True
        return False


tokenizer = AutoTokenizer.from_pretrained(
    model_name,
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_tokens([stop_token])

stop_ids = [tokenizer.encode(w)[0] for w in [stop_token]]
stop_criteria = KeywordsStoppingCriteria(stop_ids)

generator = pipeline(
    "text-generation",
    model=model_name,
    device=device,
    max_new_tokens=max_new_tokens,
    torch_dtype=torch.float16,
    stopping_criteria=StoppingCriteriaList([stop_criteria]),
)

example = "How can I make an omelette."
text = "Question: {}\nAnswer:".format(example)

result = generator(
    text,
    num_return_sequences=1,
)

output = result[0]["generated_text"]

print(output)

Output

Question: How can I make an omelette.
Answer:To make an omelette, start by cracking two eggs into a bowl and whisking them together. Add a splash of milk and a pinch of salt and pepper. Heat a non-stick pan over medium-high heat and add a tablespoon of butter. Once the butter has melted, pour in the egg mixture. As the eggs set, use a spatula to lift the edges and let the uncooked egg run underneath. When the eggs are cooked through and no visible liquid egg remains, top with your desired fillings and fold the omelette in half before sliding it onto a plate.<|stop|>

✨ Features

Fine - tuned Model: Based on the pre - trained EleutherAI/pythia - 2.8b - deduped model, fine - tuned on a specific dataset for better performance.
Text Generation: Capable of generating high - quality text responses for given questions.

📦 Installation

The installation mainly involves setting up the necessary Python libraries. You can install the transformers library using the following command:

pip install transformers

📚 Documentation

Model Details

Property	Details
Model Type	Transformer - based Language Model
Language	English
Pre - trained Model	EleutherAI/pythia-2.8b-deduped
Dataset	Dahoas/synthetic-instruct-gptj-pairwise
Library	transformers
License	Apache 2.0
Finetuned by	Lambda

Training

The model was trained on the Dahoas/synthetic-instruct-gptj-pairwise. We split the original dataset into the train (first 32000 examples) and validation (the remaining 1144 examples) subsets.

We finetune the model for 4 epoches. This took 8xA100 80GB 5 hours, where we set batch_size_per_gpu to 2 (so global batch size is 16), and learning rate to 0.00001 (with linear decay to zero at the last training step). You can find a Weights and Biases record here.

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご