Cerebras-GPT-13B Open-Source Large Language Model - Freely supported to demonstrate simplicity and scalability of training in an easy-to-show way

Cerebras GPT 13B

Developed by cerebras

Cerebras-GPT 13B is a large language model trained based on an open architecture and dataset. It belongs to the Cerebras-GPT series and aims to study the scaling laws of large language models and demonstrate the simplicity and scalability of training on the Cerebras software and hardware stack.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Wafer-level training #Chinchilla scaling #English generation

Downloads 669

Release Time : 3/20/2023

Model Overview

Cerebras-GPT 13B is a large language model based on the Transformer architecture, mainly used for natural language processing tasks such as text generation and understanding. It is trained following the Chinchilla scaling law and has high computational efficiency.

Model Features

Rich model family

The Cerebras-GPT family includes models of various scales from 111M to 13B to meet different computational needs.

Follow the scaling law

All models are trained according to the Chinchilla scaling law (20 tokens per model parameter) to achieve optimal computation.

Efficient training architecture

Trained on the Andromeda AI supercomputer, using Cerebras' weight streaming technology to achieve efficient training expansion through simple data parallelism.

Model Capabilities

Text generation

Natural language understanding

Zero-shot learning

Five-shot learning

Use Cases

Research

Research on the scaling laws of large language models

Used to study the scaling laws of large language models and verify the computationally optimal training method.

Natural language processing

Text generation

Used to generate coherent text content, such as articles, conversations, etc.

🚀 Cerebras-GPT 13B

The Cerebras-GPT family is released to facilitate research on LLM scaling laws using open architectures and datasets. It also demonstrates the simplicity and scalability of training LLMs on the Cerebras software and hardware stack.

Check out our Blog Post and arXiv paper!

🚀 Quick Start

This model can be easily loaded using the AutoModelForCausalLM functionality:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cerebras/Cerebras-GPT-13B")
model = AutoModelForCausalLM.from_pretrained("cerebras/Cerebras-GPT-13B")

text = "Generative AI is "

And can be used with Hugging Face Pipelines:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
generated_text = pipe(text, max_length=50, do_sample=False, no_repeat_ngram_size=2)[0]
print(generated_text['generated_text'])

or with model.generate():

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=5, 
                        max_new_tokens=50, early_stopping=True,
                        no_repeat_ngram_size=2)
text_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(text_output[0])

✨ Features

The Cerebras-GPT family includes 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B models.
All models are trained in accordance with Chinchilla scaling laws (20 tokens per model parameter), which is compute-optimal.
These models were trained on the Andromeda AI supercomputer.
Cerebras' weight streaming technology simplifies LLM training.

📚 Documentation

Model Details

Property	Details
Developed by	Cerebras Systems
License	Apache 2.0
Model Type	Transformer-based Language Model
Architecture	GPT-3 style architecture
Data set	The Pile
Tokenizer	Byte Pair Encoding
Vocabulary Size	50257
Sequence Length	2048
Optimizer	AdamW, (β1, β2) = (0.9, 0.95), adam_eps = 1e−8 (1e−9 for larger models)
Positional Encoding	Learned
Language	English
Learn more	Dense Scaling Laws Paper for training procedure, config files, and details on how to use.

Contact: To ask questions about Cerebras-GPT models, join the Cerebras Discord.

This is the standard parameterization version of Cerebras-GPT with 13B parameters.

Related models: Cerebras-GPT Models

Model	Parameters	Layers	d_model	Heads	d_head	d_ffn	LR	BS (seq)	BS (tokens)
Cerebras-GPT	111M	10	768	12	64	3072	6.0E-04	120	246K
Cerebras-GPT	256M	14	1088	17	64	4352	6.0E-04	264	541K
Cerebras-GPT	590M	18	1536	12	128	6144	2.0E-04	264	541K
Cerebras-GPT	1.3B	24	2048	16	128	8192	2.0E-04	528	1.08M
Cerebras-GPT	2.7B	32	2560	32	80	10240	2.0E-04	528	1.08M
Cerebras-GPT	6.7B	32	4096	32	128	16384	1.2E-04	1040	2.13M
Cerebras-GPT	13B	40	5120	40	128	20480	1.2E-04	720 → 1080	1.47M → 2.21M

Training data

Cerebras-GPT is trained using the Pile dataset from EleutherAI. See the Pile paper for a more detailed breakdown of data sources and methodology. The Pile was cleaned using the ftfy library to normalize the text, then filtered using scripts provided by Eleuther.

We tokenized the data using byte-pair encoding using the GPT-2 vocabulary. Our tokenized version of the Pile has 371B tokens. We include more details about the training dataset preprocessing in Appendix A.1 of our paper.

Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the Pile dataset size. Pythia was trained on both the standard dataset and deduplicated dataset to characterize the impact. Our models are trained on the standard Pile without deduplication, which may present an opportunity for further improvement with the deduplicated data set.

Training procedure

We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.

All models were trained to Chinchilla point: 20 tokens per model parameter. Number of steps was chosen based on optimal batch size (varied by model) and fixed sequence length (2048).

Model Params	Sequence Length	Batch Size	Number of Steps	Tokens	Tokens per Parameter	Flops
111M	2048	120	9037	2.22E+09	20	2.6E+18
256M	2048	264	9468	5.12E+09	20	1.3E+19
590M	2048	264	21836	1.18E+10	20	6.1E+19
1.3B	2048	528	24334	2.63E+10	20	2.8E+20
2.7B	2048	528	49041	5.30E+10	20	1.1E+21
6.7B	2048	1040	62522	1.33E+11	20	6.3E+21
13B	2048	720	174335	2.57E+11	20	2.3E+22

Evaluations

We trained models from smallest to largest and fit a power law as we went along. The power law was helpful for extrapolating the validation loss of the next largest model we trained and provided confidence about whether the training run was going well.

We performed upstream (pre-training) evaluations of text prediction cross-entropy using the Pile validation and test splits. We performed downstream evaluations of text generation accuracy on standardized tasks using the Eleuther lm-evaluation-harness. Results are compared against many publicly available large language models in Section 3 of the paper.

0-shot Evaluation

Model	Params	Training FLOPs	PILE test xent	Hella-Swag	PIQA	Wino-Grande	Lambada	ARC-e	ARC-c	OpenBookQA	Downstream Average
Cerebras-GPT	111M	2.6E+18	2.566	0.268	0.594	0.488	0.194	0.380	0.166	0.118	0.315
Cerebras-GPT	256M	1.3E+19	2.299	0.274	0.613	0.511	0.293	0.410	0.170	0.158	0.347
Cerebras-GPT	590M	6.1E+19	2.184	0.291	0.627	0.498	0.366	0.464	0.190	0.158	0.370
Cerebras-GPT	1.3B	2.8E+20	1.996	0.325	0.664	0.521	0.462	0.508	0.224	0.166	0.410
Cerebras-GPT	2.7B	1.1E+21	1.834	0.386	0.701	0.559	0.567	0.571	0.246	0.206	0.462
Cerebras-GPT	6.7B	6.3E+21	1.704	0.447	0.739	0.602	0.636	0.643	0.282	0.238	0.512
Cerebras-GPT	13B	2.3E+22	1.575	0.513	0.766	0.646	0.696	0.714	0.367	0.286	0.570

5-shot Evaluation

Model	Params	Hella-Swag	PIQA	Wino-Grande	Lambada	ARC-e	ARC-c	OpenBookQA
Cerebras-GPT	111M	0.267	0.588	0.475	0.158	0.356	0.166	0.136
Cerebras-GPT	256M	0.278	0.606	0.522	0.225	0.422	0.183	0.164
Cerebras-GPT	590M	0.291	0.634	0.479	0.281	0.475	0.206	0.152
Cerebras-GPT	1.3B	0.326	0.668	0.536	0.395	0.529	0.241	0.174
Cerebras-GPT	2.7B	0.382	0.697	0.543	0.487	0.590	0.267	0.224
Cerebras-GPT	6.7B	0.444	0.736	0.590	0.591	0.667	0.314	0.270
Cerebras-GPT	13B	0.514	0.768	0.674	0.655	0.743	0.398	0.318

Uses and Limitations

Intended Use

The primary intended use is to further research into large language models. These models can be used as a foundation model for NLP, applications, ethics, and alignment research. Our primary intended users are researchers who are working to improve LLMs and practitioners seeking reference implementations, training setups, hyperparameters, or pre-trained models. We release these models with a fully permissive Apache license for the community to use freely.

You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras Model Studio or third-party libraries. Further safety-related testing and mitigations should be applied before using the Cerebras-GPT model family in production downstream applications.

Due to financial and compute budgets, Cerebras-GPT models were only trained and evaluated following the approaches described in the paper.

Out of Scope Use

Cerebras-GPT models are trained on the Pile, with English language only, and are not suitable for machine translation tasks.

Cerebras-GPT models have not been tuned for human-facing dialog applications like chatbots and will not respond to prompts in a similar way to models that have received instruction tuning or reinforcement learning from human feedback (RLHF) like Flan-T5 or ChatGPT. Cerebras-GPT models can be tuned using those methods.

Risk, Bias, Ethical Considerations

⚠️ Important Note

Data: The Pile dataset has been thoroughly analyzed from various ethical standpoints such as toxicity analysis, gender bias, pejorative content, racially sensitive content etc. Please refer to Pile dataset references.

Human life: The outputs from this model may or may not align with human values. The risk needs to be thoroughly investigated before deploying this model in a production environment where it can directly impact human life.

Risks and harms: There can be distributional bias in the Pile dataset that can manifest in various forms in the downstream model deployment. There are other risks associated with large language models such as amplifying stereotypes, memorizing training data, or revealing private or secure information.

Mitigations: Only mitigations in standard Pile dataset pre-processing were employed when pre-training Cerebras-GPT.

🔧 Technical Details

Model Architecture

The Cerebras-GPT models use a GPT-3 style architecture. All layers use full attention instead of the GPT-3 style sparse banded attention. The model shapes are selected to follow an aspect ratio of 80 or match the shape of GPT-3 models.

Training

Learning Rate: The learning rate was warmed up for 375M tokens (1500 steps for 111M and 256M models) and then decayed by a factor of 10 using cosine decay.
Dropout and Weight Decay: No dropout was used, and weight decay was set to 0.1.
Sequence Length: All models were trained with a maximum sequence length (MSL) of 2048.

Data Preprocessing

The Pile dataset was cleaned using the ftfy library to normalize the text and then filtered using scripts provided by EleutherAI. The data was tokenized using byte-pair encoding with the GPT-2 vocabulary.

📄 License

This project is licensed under the Apache 2.0 license.

Acknowledgements

We are thankful to all Cerebras engineers, past and present, that made this work possible.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご