🚀 Pythia-2.8B
The Pythia Scaling Suite is a collection of models designed to facilitate interpretability research on large language models.
🚀 Quick Start
The Pythia Scaling Suite consists of two sets of eight models, each with sizes of 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two models: one trained on the Pile, and one trained on the globally deduplicated Pile. All models are trained on the same data in the same order. We also offer 154 intermediate checkpoints per model, hosted on Hugging Face as branches.
You can load and use the Pythia models with the following code, demonstrated here for the third pythia-70m-deduped
checkpoint:
from transformers import GPTNeoXForCausalLM, AutoTokenizer
model = GPTNeoXForCausalLM.from_pretrained(
"EleutherAI/pythia-70m-deduped",
revision="step3000",
cache_dir="./pythia-70m-deduped/step3000",
)
tokenizer = AutoTokenizer.from_pretrained(
"EleutherAI/pythia-70m-deduped",
revision="step3000",
cache_dir="./pythia-70m-deduped/step3000",
)
inputs = tokenizer("Hello, I am", return_tensors="pt")
tokens = model.generate(**inputs)
tokenizer.decode(tokens[0])
Revision/branch 143000
corresponds exactly to the model checkpoint on the main
branch of each model. For more information on using all Pythia models, see documentation on GitHub.
✨ Features
- Interpretability Research: The Pythia model suite is designed to promote scientific research on large language models, especially interpretability research.
- Model Variations: It includes models trained on the Pile and the globally deduplicated Pile, with different sizes ranging from 70M to 12B parameters.
- Intermediate Checkpoints: 154 intermediate checkpoints are provided per model, hosted on Hugging Face as branches.
- Performance: Despite not prioritizing downstream performance, the models match or exceed the performance of similar and same - sized models, such as those in the OPT and GPT - Neo suites.
📦 Installation
The Pythia models work with the Hugging Face Transformers Library. You can install the necessary libraries via pip install transformers
to use the models.
💻 Usage Examples
Basic Usage
from transformers import GPTNeoXForCausalLM, AutoTokenizer
model = GPTNeoXForCausalLM.from_pretrained(
"EleutherAI/pythia-70m-deduped",
revision="step3000",
cache_dir="./pythia-70m-deduped/step3000",
)
tokenizer = AutoTokenizer.from_pretrained(
"EleutherAI/pythia-70m-deduped",
revision="step3000",
cache_dir="./pythia-70m-deduped/step3000",
)
inputs = tokenizer("Hello, I am", return_tensors="pt")
tokens = model.generate(**inputs)
tokenizer.decode(tokens[0])
📚 Documentation
Model Details
Property |
Details |
Developed by |
EleutherAI |
Model Type |
Transformer-based Language Model |
Language |
English |
Learn more |
Pythia's GitHub repository for training procedure, config files, and usage details. See paper for more evals and implementation details. |
Library |
GPT-NeoX |
License |
Apache 2.0 |
Contact |
Join the EleutherAI Discord and post questions in #release-discussion . Read the existing Pythia documentation before asking. For general correspondence: contact@eleuther.ai. |
Pythia model |
Non-Embedding Params |
Layers |
Model Dim |
Heads |
Batch Size |
Learning Rate |
Equivalent Models |
70M |
18,915,328 |
6 |
512 |
8 |
2M |
1.0 x 10-3 |
— |
160M |
85,056,000 |
12 |
768 |
12 |
2M |
6.0 x 10-4 |
GPT-Neo 125M, OPT-125M |
410M |
302,311,424 |
24 |
1024 |
16 |
2M |
3.0 x 10-4 |
OPT-350M |
1.0B |
805,736,448 |
16 |
2048 |
8 |
2M |
3.0 x 10-4 |
— |
1.4B |
1,208,602,624 |
24 |
2048 |
16 |
2M |
2.0 x 10-4 |
GPT-Neo 1.3B, OPT-1.3B |
2.8B |
2,517,652,480 |
32 |
2560 |
32 |
2M |
1.6 x 10-4 |
GPT-Neo 2.7B, OPT-2.7B |
6.9B |
6,444,163,072 |
32 |
4096 |
32 |
2M |
1.2 x 10-4 |
OPT-6.7B |
12B |
11,327,027,200 |
36 |
5120 |
40 |
2M |
1.2 x 10-4 |
— |
Uses and Limitations
Intended Use
The primary purpose of Pythia is to support research on the behavior, functionality, and limitations of large language models. It provides a controlled environment for scientific experiments. You can also fine - tune and adapt Pythia - 2.8B for deployment, as long as it complies with the Apache 2.0 license. If you use pre - trained Pythia - 2.8B as a basis for your fine - tuned model, conduct your own risk and bias assessment.
Out-of-scope use
The Pythia Suite is not intended for deployment. It is not a product and cannot be used for human - facing interactions. The model may generate harmful or offensive text. Pythia models are English - only and not suitable for translation or generating text in other languages. Pythia - 2.8B has not been fine - tuned for common downstream contexts, so it will not respond like ChatGPT.
Limitations and biases
Do not rely on Pythia - 2.8B to produce factually accurate output. The model was trained on the Pile, which contains offensive content. Pythia - 2.8B may generate socially unacceptable text. If using the model's output, have a human curate it and inform your audience that the text was generated by Pythia - 2.8B.
Training
Training data
The Pile is an 825GiB general - purpose English dataset. It contains texts from 22 diverse sources, divided into five categories. The Pile was not deduplicated before training Pythia - 2.8B.
Training procedure
All models were trained on the same data in the same order. Each model saw 299,892,736,000 tokens during training. 143 checkpoints were saved from step1000
to step143000
(same as main
), and additional early checkpoints (step0
and step{1,2,4...512}
) are provided. All Pythia models trained for 143000 steps at a batch size of 2M tokens. Pythia uses the same tokenizer as [GPT - NeoX - 20B](https://huggingface.co/EleutherAI/gpt - neox - 20b).
Evaluations
All 16 Pythia models were evaluated using the [LM Evaluation Harness](https://github.com/EleutherAI/lm - evaluation - harness). You can access the results by model and step at results/json/*
in the GitHub repository. Expand the sections below to see plots of evaluation results for all Pythia and Pythia - deduped models compared with OPT and BLOOM.
LAMBADA – OpenAI
Physical Interaction: Question Answering (PIQA)
WinoGrande
AI2 Reasoning Challenge—Easy Set
SciQ
Changelog
This section compares differences between previously released Pythia v0 and the current models. Retraining Pythia had no impact on benchmark performance.
- All model sizes are now trained with a uniform batch size of 2M tokens. Previously, some models were trained with a batch size of 4M tokens.
- Additional checkpoints were added at initialization (step 0) and steps {1,2,4,8,16,32,64,128,256,512}.
- Flash Attention was used in the new retrained suite.
- An LR schedule inconsistency was rectified: all models now decay to a minimum LR of 0.1× their maximum LR.
Naming convention and parameter count
Pythia models were renamed in January 2023. The current naming convention (70M, 160M, etc.) is based on total parameter count.
current Pythia suffix |
old suffix |
total params |
non-embedding params |
70M |
19M |
70,426,624 |
18,915,328 |
160M |
125M |
162,322,944 |
85,056,000 |
410M |
350M |
405,334,016 |
302,311,424 |
1B |
800M |
1,011,781,632 |
805,736,448 |
1.4B |
1.3B |
1,414,647,808 |
1,208,602,624 |
2.8B |
2.7B |
2,775,208,960 |
2,517,652,480 |
6.9B |
6.7B |
6,857,302,016 |
6,444,163,072 |
12B |
13B |
11,846,072,320 |
11,327,027,200 |
🔧 Technical Details
The Pythia model suite was developed to facilitate interpretability research on large language models. It uses a Transformer - based architecture and is trained on the Pile, a large - scale English dataset. All models are trained on the same data in the same order, with specific hyperparameters for each model size. The training process includes saving 154 intermediate checkpoints per model.
📄 License
This project is licensed under the Apache 2.0 license.