Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Storytime 13B - GPTQ
This repository contains GPTQ model files for Charles Goddard's Storytime 13B, offering multiple quantisation parameter options to suit different hardware and requirements.

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z)
✨ Features
- Multiple Quantisation Options: Different GPTQ parameter permutations are provided to meet various hardware and performance needs.
- Diverse Repositories: Available in AWQ, GPTQ, GGUF formats, as well as the original unquantised fp16 model.
- Alpaca Prompt Template: Uses the Alpaca prompt template for easy interaction.
📦 Installation
In text - generation - webui
To download from the main
branch, enter TheBloke/storytime-13B-GPTQ
in the "Download model" box.
To download from another branch, add :branchname
to the end of the download name, e.g., TheBloke/storytime-13B-GPTQ:gptq-4-32g-actorder_True
From the command line
I recommend using the huggingface - hub
Python library:
pip3 install huggingface - hub
To download the main
branch to a folder called storytime-13B-GPTQ
:
mkdir storytime-13B-GPTQ
huggingface-cli download TheBloke/storytime-13B-GPTQ --local-dir storytime-13B-GPTQ --local-dir-use-symlinks False
To download from a different branch, add the --revision
parameter:
mkdir storytime-13B-GPTQ
huggingface-cli download TheBloke/storytime-13B-GPTQ --revision gptq-4-32g-actorder_True --local-dir storytime-13B-GPTQ --local-dir-use-symlinks False
With git
(not recommended)
To clone a specific branch with git
, use a command like this:
git clone --single-branch --branch gptq-4-32g-actorder_True https://huggingface.co/TheBloke/storytime-13B-GPTQ
💻 Usage Examples
How to easily download and use this model in [text - generation - webui](https://github.com/oobabooga/text - generation - webui)
- Click the Model tab.
- Under Download custom model or LoRA, enter
TheBloke/storytime-13B-GPTQ
.- To download from a specific branch, enter for example
TheBloke/storytime-13B-GPTQ:gptq-4-32g-actorder_True
- see Provided Files below for the list of branches for each option.
- To download from a specific branch, enter for example
- Click Download.
- The model will start downloading. Once it's finished it will say "Done".
- In the top left, click the refresh icon next to Model.
- In the Model dropdown, choose the model you just downloaded:
storytime-13B-GPTQ
- The model will automatically load, and is now ready for use!
- If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right.
- Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file
quantize_config.json
.
- Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file
- Once you're ready, click the Text Generation tab and enter a prompt to get started!
How to use this GPTQ model from Python code
Install the necessary packages
Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
pip3 install transformers optimum
pip3 install auto - gptq --extra - index - url https://huggingface.github.io/autogptq - index/whl/cu118/ # Use cu117 if on CUDA 11.7
If you have problems installing AutoGPTQ using the pre - built wheels, install it from source instead:
pip3 uninstall - y auto - gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
git checkout v0.4.2
pip3 install .
You can then use the following code
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_name_or_path = "TheBloke/storytime-13B-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-4-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
device_map="auto",
trust_remote_code=False,
revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
# Create a text generation pipeline
generate_text = pipeline('text-generation',
model=model,
tokenizer=tokenizer,
device_map="auto")
# Generate text
prompt = "Once upon a time"
output = generate_text(prompt, max_length=200, num_return_sequences=1)
print(output[0]['generated_text'])
📚 Documentation
Model Information
- Model creator: Charles Goddard
- Original model: Storytime 13B
Repositories available
- AWQ model(s) for GPU inference.
- GPTQ models for GPU inference, with multiple quantisation parameter options.
- 2, 3, 4, 5, 6 and 8 - bit GGUF models for CPU+GPU inference
- Charles Goddard's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions
Prompt template: Alpaca
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{prompt}
### Response:
Provided files, and GPTQ parameters
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
Each separate quant is in a different branch. See below for instructions on fetching from different branches.
All recent GPTQ files are made with AutoGPTQ, and all files in non - main branches are made with AutoGPTQ. Files in the main
branch which were uploaded before August 2023 were made with GPTQ - for - LLaMa.
Explanation of GPTQ parameters
- Bits: The bit size of the quantised model.
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
- Act Order: True or False. Also known as
desc_act
. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now. - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
- GPTQ dataset: The calibration dataset used during quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ calibration dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
- ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4 - bit.
Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
---|---|---|---|---|---|---|---|---|---|
main | 4 | 128 | Yes | 0.1 | wikitext | 4096 | 7.26 GB | Yes | 4 - bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
gptq-4-32g-actorder_True | 4 | 32 | Yes | 0.1 | wikitext | 4096 | 8.00 GB | Yes | 4 - bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. |
gptq-8--1g-actorder_True | 8 | None | Yes | 0.1 | wikitext | 4096 | 13.36 GB | No | 8 - bit, with Act Order. No group size, to lower VRAM requirements. |
gptq-8-128g-actorder_True | 8 | 128 | Yes | 0.1 | wikitext | 4096 | 13.65 GB | No | 8 - bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. |
gptq-8-32g-actorder_True | 8 | 32 | Yes | 0.1 | wikitext | 4096 | 14.54 GB | No | 8 - bit, with group size 32g and Act Order for maximum inference quality. |
gptq-4-64g-actorder_True | 4 | 64 | Yes | 0.1 | wikitext | 4096 | 7.51 GB | Yes | 4 - bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. |
📄 License
The model is under the Llama 2 license.

