Pythia is a series of causal language models developed by EleutherAI, specifically designed for interpretability research. It includes 8 model sizes ranging from 70 million to 12 billion parameters, providing 154 training checkpoints.
A Transformer-based English language model using the GPT-NeoX architecture, trained on the Pile dataset, primarily used for studying the behavior and functionality of large language models.
Model Features
Complete Training Checkpoints
Provides 154 intermediate training checkpoints to facilitate the study of model evolution.
Scientific Experimental Design
All model sizes use the same training data and sequence to ensure experimental comparability.
Performance Benchmarking
Achieves or surpasses the performance of similar-scale models (e.g., OPT, GPT-Neo).
Deduplication Comparison
Each model size offers two versions: one trained on original data and another on globally deduplicated data.
Model Capabilities
English Text Generation
Language Model Behavior Research
Model Interpretability Analysis
Use Cases
Academic Research
Language Model Behavior Analysis
Study the parameter variation patterns of the model at different training stages.
Track model capability development through 154 checkpoints.
Deduplicated Data Impact Study
Compare performance differences between models trained on original and deduplicated data.
Technical Validation
Medium-Scale Model Benchmarking
Serve as a reference model for the 400M parameter level for technical comparisons.
Outperforms similar models like OPT-350M.
🚀 Pythia-410M
The Pythia Scaling Suite is a collection of models developed to facilitate interpretability research (see paper). It contains two sets of eight models of sizes 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. This suite is designed to promote scientific research on large language models, especially interpretability research, and its models match or exceed the performance of similar and same - sized models.
🚀 Quick Start
Pythia models can be loaded and used via the following code, demonstrated here for the third pythia - 70m - deduped checkpoint:
from transformers import GPTNeoXForCausalLM, AutoTokenizer
model = GPTNeoXForCausalLM.from_pretrained(
"EleutherAI/pythia-70m-deduped",
revision="step3000",
cache_dir="./pythia-70m-deduped/step3000",
)
tokenizer = AutoTokenizer.from_pretrained(
"EleutherAI/pythia-70m-deduped",
revision="step3000",
cache_dir="./pythia-70m-deduped/step3000",
)
inputs = tokenizer("Hello, I am", return_tensors="pt")
tokens = model.generate(**inputs)
tokenizer.decode(tokens[0])
Revision/branch step143000 corresponds exactly to the model checkpoint on the main branch of each model. For more information on how to use all Pythia models, see documentation on GitHub.
✨ Features
Research - Oriented: The Pythia suite is developed to facilitate interpretability research on large language models.
Multiple Sizes and Checkpoints: It contains models of various sizes (70M - 12B) and provides 154 intermediate checkpoints per model, hosted on Hugging Face as branches.
Performance: The models match or exceed the performance of similar and same - sized models like those in the OPT and GPT - Neo suites.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import GPTNeoXForCausalLM, AutoTokenizer
model = GPTNeoXForCausalLM.from_pretrained(
"EleutherAI/pythia-70m-deduped",
revision="step3000",
cache_dir="./pythia-70m-deduped/step3000",
)
tokenizer = AutoTokenizer.from_pretrained(
"EleutherAI/pythia-70m-deduped",
revision="step3000",
cache_dir="./pythia-70m-deduped/step3000",
)
inputs = tokenizer("Hello, I am", return_tensors="pt")
tokens = model.generate(**inputs)
tokenizer.decode(tokens[0])
To ask questions about this model, join the EleutherAI Discord, and post them in #release - discussion. Please read the existing Pythia documentation before asking about it in the EleutherAI Discord. For general correspondence: contact@eleuther.ai.
Pythia model
Non - Embedding Params
Layers
Model Dim
Heads
Batch Size
Learning Rate
Equivalent Models
70M
18,915,328
6
512
8
2M
1.0 x 10-3
—
160M
85,056,000
12
768
12
2M
6.0 x 10-4
GPT - Neo 125M, OPT - 125M
410M
302,311,424
24
1024
16
2M
3.0 x 10-4
OPT - 350M
1.0B
805,736,448
16
2048
8
2M
3.0 x 10-4
—
1.4B
1,208,602,624
24
2048
16
2M
2.0 x 10-4
GPT - Neo 1.3B, OPT - 1.3B
2.8B
2,517,652,480
32
2560
32
2M
1.6 x 10-4
GPT - Neo 2.7B, OPT - 2.7B
6.9B
6,444,163,072
32
4096
32
2M
1.2 x 10-4
OPT - 6.7B
12B
11,327,027,200
36
5120
40
2M
1.2 x 10-4
—
Engineering details for the Pythia Suite. Deduped and non - deduped models of a given size have the same hyperparameters. “Equivalent” models have exactly the same architecture, and the same number of non - embedding parameters.
Uses and Limitations
Intended Use
The primary intended use of Pythia is research on the behavior, functionality, and limitations of large language models. You may also further fine - tune and adapt Pythia - 410M for deployment, as long as your use is in accordance with the Apache 2.0 license.
Out - of - scope use
The Pythia Suite is not intended for deployment. It is English - language only and not suitable for translation or generating text in other languages. Also, it has not been fine - tuned for common downstream contexts.
Limitations and biases
Never rely on Pythia - 410M to produce factually accurate output. This model was trained on the Pile, which may contain offensive text.
Training
Training data
The Pile is a 825GiB general - purpose dataset in English. It was not deduplicated before being used to train Pythia - 410M.
Training procedure
All models were trained on the exact same data, in the exact same order. Each model saw 299,892,736,000 tokens during training. See GitHub for more details.
Evaluations
All 16 Pythia models were evaluated using the [LM Evaluation Harness](https://github.com/EleutherAI/lm - evaluation - harness). You can access the results by model and step at results/json/* in the GitHub repository.
This section compares differences between previously released Pythia v0 and the current models.
All model sizes are now trained with uniform batch size of 2M tokens.
Added checkpoints at initialization (step 0) and steps {1,2,4,8,16,32,64,128,256,512} in addition to every 1000 training steps.
Flash Attention was used in the new retrained suite.
Rectified a minor inconsistency in the original suite regarding the learning rate schedule.
Naming convention and parameter count
Pythia models were renamed in January 2023. The current naming convention (70M, 160M, etc.) is based on total parameter count.
current Pythia suffix
old suffix
total params
non - embedding params
70M
19M
70,426,624
18,915,328
🔧 Technical Details
The Pythia models are designed with specific hyperparameters and training procedures to ensure consistent and comparable results across different model sizes. All models in the suite are trained on the same data in the same order, allowing for controlled experiments in interpretability research. The use of specific checkpoints and a uniform batch size during training also contributes to the reproducibility and reliability of the models.
📄 License
This project is licensed under the Apache 2.0 license.