Pythia-2.8b Open-Source Language Model - A Practical Tool for Interpretable Research on Large Language Models

Pythia 2.8b

Developed by EleutherAI

Pythia - 2.8 billion is a member of the scalable language model suite developed by EleutherAI, designed specifically to promote the interpretability research of large language models. This model is based on the Transformer architecture and is trained on the The Pile dataset, with 2.8 billion parameters.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Interpretability research #Causal language model #English text generation

Downloads 40.38k

Release Time : 2/13/2023

Model Overview

Pythia - 2.8 billion is a causal language model based on the Transformer architecture and is part of the Pythia scalable model suite. This suite aims to provide a controlled environment for scientific research, with a particular focus on the study of the behavior, functionality, and limitations of large language models.

Model Features

Controlled environment for scientific research

Designed specifically to promote the interpretability research of large language models, providing a standardized training process and checkpoints

Complete training checkpoints

Provides 154 intermediate checkpoints, including the initial step0, 10 logarithmically spaced checkpoints, and 143 evenly spaced checkpoints

Standardized training data

All Pythia models are trained using exactly the same training data and order, facilitating comparative research

Model Capabilities

English text generation

Language modeling

Text completion

Use Cases

Scientific research

Research on language model behavior

Study the evolution process of large language models at different training stages

Analysis of model interpretability

Analyze the internal working mechanism and decision - making process of the model

🚀 Pythia-2.8B

The Pythia Scaling Suite is a collection of models designed to facilitate interpretability research on large language models.

🚀 Quick Start

The Pythia Scaling Suite consists of two sets of eight models, each with sizes of 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two models: one trained on the Pile, and one trained on the globally deduplicated Pile. All models are trained on the same data in the same order. We also offer 154 intermediate checkpoints per model, hosted on Hugging Face as branches.

You can load and use the Pythia models with the following code, demonstrated here for the third pythia-70m-deduped checkpoint:

from transformers import GPTNeoXForCausalLM, AutoTokenizer

model = GPTNeoXForCausalLM.from_pretrained(
  "EleutherAI/pythia-70m-deduped",
  revision="step3000",
  cache_dir="./pythia-70m-deduped/step3000",
)

tokenizer = AutoTokenizer.from_pretrained(
  "EleutherAI/pythia-70m-deduped",
  revision="step3000",
  cache_dir="./pythia-70m-deduped/step3000",
)

inputs = tokenizer("Hello, I am", return_tensors="pt")
tokens = model.generate(**inputs)
tokenizer.decode(tokens[0])

Revision/branch 143000 corresponds exactly to the model checkpoint on the main branch of each model. For more information on using all Pythia models, see documentation on GitHub.

✨ Features

Interpretability Research: The Pythia model suite is designed to promote scientific research on large language models, especially interpretability research.
Model Variations: It includes models trained on the Pile and the globally deduplicated Pile, with different sizes ranging from 70M to 12B parameters.
Intermediate Checkpoints: 154 intermediate checkpoints are provided per model, hosted on Hugging Face as branches.
Performance: Despite not prioritizing downstream performance, the models match or exceed the performance of similar and same - sized models, such as those in the OPT and GPT - Neo suites.

📦 Installation

The Pythia models work with the Hugging Face Transformers Library. You can install the necessary libraries via pip install transformers to use the models.

💻 Usage Examples

Basic Usage

from transformers import GPTNeoXForCausalLM, AutoTokenizer

model = GPTNeoXForCausalLM.from_pretrained(
  "EleutherAI/pythia-70m-deduped",
  revision="step3000",
  cache_dir="./pythia-70m-deduped/step3000",
)

tokenizer = AutoTokenizer.from_pretrained(
  "EleutherAI/pythia-70m-deduped",
  revision="step3000",
  cache_dir="./pythia-70m-deduped/step3000",
)

inputs = tokenizer("Hello, I am", return_tensors="pt")
tokens = model.generate(**inputs)
tokenizer.decode(tokens[0])

📚 Documentation

Model Details

Property	Details
Developed by	EleutherAI
Model Type	Transformer-based Language Model
Language	English
Learn more	Pythia's GitHub repository for training procedure, config files, and usage details. See paper for more evals and implementation details.
Library	GPT-NeoX
License	Apache 2.0
Contact	Join the EleutherAI Discord and post questions in `#release-discussion`. Read the existing Pythia documentation before asking. For general correspondence: contact@eleuther.ai.

Pythia model	Non-Embedding Params	Layers	Model Dim	Heads	Batch Size	Learning Rate	Equivalent Models
70M	18,915,328	6	512	8	2M	1.0 x 10^-3	—
160M	85,056,000	12	768	12	2M	6.0 x 10^-4	GPT-Neo 125M, OPT-125M
410M	302,311,424	24	1024	16	2M	3.0 x 10^-4	OPT-350M
1.0B	805,736,448	16	2048	8	2M	3.0 x 10^-4	—
1.4B	1,208,602,624	24	2048	16	2M	2.0 x 10^-4	GPT-Neo 1.3B, OPT-1.3B
2.8B	2,517,652,480	32	2560	32	2M	1.6 x 10^-4	GPT-Neo 2.7B, OPT-2.7B
6.9B	6,444,163,072	32	4096	32	2M	1.2 x 10^-4	OPT-6.7B
12B	11,327,027,200	36	5120	40	2M	1.2 x 10^-4	—

Uses and Limitations

Intended Use

The primary purpose of Pythia is to support research on the behavior, functionality, and limitations of large language models. It provides a controlled environment for scientific experiments. You can also fine - tune and adapt Pythia - 2.8B for deployment, as long as it complies with the Apache 2.0 license. If you use pre - trained Pythia - 2.8B as a basis for your fine - tuned model, conduct your own risk and bias assessment.

Out-of-scope use

The Pythia Suite is not intended for deployment. It is not a product and cannot be used for human - facing interactions. The model may generate harmful or offensive text. Pythia models are English - only and not suitable for translation or generating text in other languages. Pythia - 2.8B has not been fine - tuned for common downstream contexts, so it will not respond like ChatGPT.

Limitations and biases

Do not rely on Pythia - 2.8B to produce factually accurate output. The model was trained on the Pile, which contains offensive content. Pythia - 2.8B may generate socially unacceptable text. If using the model's output, have a human curate it and inform your audience that the text was generated by Pythia - 2.8B.

Training

Training data

The Pile is an 825GiB general - purpose English dataset. It contains texts from 22 diverse sources, divided into five categories. The Pile was not deduplicated before training Pythia - 2.8B.

Training procedure

All models were trained on the same data in the same order. Each model saw 299,892,736,000 tokens during training. 143 checkpoints were saved from step1000 to step143000 (same as main), and additional early checkpoints (step0 and step{1,2,4...512}) are provided. All Pythia models trained for 143000 steps at a batch size of 2M tokens. Pythia uses the same tokenizer as [GPT - NeoX - 20B](https://huggingface.co/EleutherAI/gpt - neox - 20b).

Evaluations

All 16 Pythia models were evaluated using the [LM Evaluation Harness](https://github.com/EleutherAI/lm - evaluation - harness). You can access the results by model and step at results/json/* in the GitHub repository. Expand the sections below to see plots of evaluation results for all Pythia and Pythia - deduped models compared with OPT and BLOOM.

LAMBADA – OpenAI

Physical Interaction: Question Answering (PIQA)

WinoGrande

AI2 Reasoning Challenge—Easy Set

SciQ

Changelog

This section compares differences between previously released Pythia v0 and the current models. Retraining Pythia had no impact on benchmark performance.

All model sizes are now trained with a uniform batch size of 2M tokens. Previously, some models were trained with a batch size of 4M tokens.
Additional checkpoints were added at initialization (step 0) and steps {1,2,4,8,16,32,64,128,256,512}.
Flash Attention was used in the new retrained suite.
An LR schedule inconsistency was rectified: all models now decay to a minimum LR of 0.1× their maximum LR.

Naming convention and parameter count

Pythia models were renamed in January 2023. The current naming convention (70M, 160M, etc.) is based on total parameter count.

current Pythia suffix	old suffix	total params	non-embedding params
70M	19M	70,426,624	18,915,328
160M	125M	162,322,944	85,056,000
410M	350M	405,334,016	302,311,424
1B	800M	1,011,781,632	805,736,448
1.4B	1.3B	1,414,647,808	1,208,602,624
2.8B	2.7B	2,775,208,960	2,517,652,480
6.9B	6.7B	6,857,302,016	6,444,163,072
12B	13B	11,846,072,320	11,327,027,200

🔧 Technical Details

The Pythia model suite was developed to facilitate interpretability research on large language models. It uses a Transformer - based architecture and is trained on the Pile, a large - scale English dataset. All models are trained on the same data in the same order, with specific hyperparameters for each model size. The training process includes saving 154 intermediate checkpoints per model.

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご