đ Quantized EleutherAI/gpt - j - 6b with 8 - bit weights
This is a modified version of EleutherAI's GPT - J with 6 billion parameters, enabling generation and fine - tuning on colab or equivalent desktop GPUs.
đ Quick Start
Note: this model was superceded by the load_in_8bit=True
feature in transformers by Younes Belkada and Tim Dettmers. Please see [this usage example](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8 - MoWKGdSatZ4#scrollTo=W8tQtyjp75O). This legacy model was built for transformers v4.15.0 and pytorch 1.11. Newer versions could work, but are not supported.
Here's how to run it: 
⨠Features
- Memory - efficient: The [original GPT - J](https://huggingface.co/EleutherAI/gpt - j - 6B/tree/main) takes 22+ GB memory for float32 parameters alone, and that's before accounting for gradients & optimizer. Even with 16 - bit casting, it won't fit on most single - GPU setups except A6000 and A100. This quantized version can run on a single GPU with ~11 GB memory.
- Quantization techniques:
- Large weight tensors are quantized using dynamic 8 - bit quantization and de - quantized just - in - time for multiplication.
- Gradient checkpoints are used to store only one activation per layer, reducing memory usage at the cost of 30% slower training.
- Scalable fine - tuning is achieved with LoRA and 8 - bit Adam.
- Negligible quality impact: Technically, 8 - bit quantization affects model quality, but the effect is negligible in practice. [This notebook measures wikitext test perplexity](https://nbviewer.org/urls/huggingface.co/hivemind/gpt - j - 6B - 8bit/raw/main/check_perplexity.ipynb) and shows it's nearly indistinguishable from the original GPT - J.
- Unique computation approach: Our code uses 8 - bit only for storage, and all computations are performed in float16 or float32, allowing for much smaller error.
đ Documentation
How should I fine - tune the model?
We recommend starting with the original hyperparameters from the LoRA paper. On top of that, there is one more trick to consider: the overhead from de - quantizing weights does not depend on batch size. As a result, the larger batch size you can fit, the more efficient you will train.
Where can I train for free?
You can train fine in colab, but if you get a K80, it's probably best to switch to other free GPU providers: [kaggle](https://towardsdatascience.com/amazon - sagemaker - studio - lab - a - great - alternative - to - google - colab - 7194de6ef69a), [aws sagemaker](https://towardsdatascience.com/amazon - sagemaker - studio - lab - a - great - alternative - to - google - colab - 7194de6ef69a) or [paperspace](https://docs.paperspace.com/gradient/more/instance - types/free - instances). For instance, this is the same notebook [running in kaggle](https://www.kaggle.com/justheuristic/dmazur - converted) using a more powerful P100 instance.
Can I use this technique with other models?
The model was converted using [this notebook](https://nbviewer.org/urls/huggingface.co/hivemind/gpt - j - 6B - 8bit/raw/main/convert - gpt - j.ipynb). It can be adapted to work with other model types. However, please bear in mind that some models replace Linear and Embedding with custom alternatives that require their own BNBWhateverWithAdapters.
đ License
This project is licensed under the Apache - 2.0 license.
Property |
Details |
Model Type |
Quantized EleutherAI/gpt - j - 6b with 8 - bit weights |
Training Data |
The Pile |
Tags |
pytorch, causal - lm |