đ Yarn Mistral 7B 128K - AWQ
This repository provides AWQ model files for NousResearch's Yarn Mistral 7B 128K, enabling efficient and accurate low - bit weight quantization for various inference scenarios.
đ Quick Start
This README offers detailed guidance on downloading, installing, and using the AWQ model of Yarn Mistral 7B 128K in different environments. Whether you're using text - generation - webui, vLLM, Hugging Face Text Generation Inference (TGI), or Python code, you can find the corresponding steps here.
⨠Features
- AWQ Quantization: AWQ is an efficient, accurate, and fast low - bit weight quantization method, currently supporting 4 - bit quantization. It provides faster inference for Transformer - based models compared to GPTQ, with equivalent or better quality.
- Multiple Inference Environments: Supported by text - generation - webui, vLLM, Hugging Face Text Generation Inference (TGI), and AutoAWQ, offering flexibility for different usage scenarios.
đĻ Installation
Install from text - generation - webui
- Ensure you're using the latest version of [text - generation - webui](https://github.com/oobabooga/text - generation - webui). It's recommended to use the one - click installers.
- Click the Model tab.
- Under Download custom model or LoRA, enter
TheBloke/Yarn - Mistral - 7B - 128k - AWQ
.
- Click Download.
- Wait for the download to complete (it will show "Done").
- In the top left, click the refresh icon next to Model.
- In the Model dropdown, choose
Yarn - Mistral - 7B - 128k - AWQ
.
- Select Loader: AutoAWQ.
- Click Load.
- Optionally, set custom settings, click Save settings for this model, and then Reload the Model.
Install AutoAWQ for Python Inference
Requires [AutoAWQ](https://github.com/casper - hansen/AutoAWQ) 0.1.1 or later.
pip3 install autoawq
If installation with pre - built wheels fails, install from source:
pip3 uninstall -y autoawq
git clone https://github.com/casper - hansen/AutoAWQ
cd AutoAWQ
pip3 install .
đģ Usage Examples
Use in text - generation - webui
After installation, click the Text Generation tab and enter a prompt to start generating text.
Use with vLLM
As a Server
python3 python -m vllm.entrypoints.api_server --model TheBloke/Yarn - Mistral - 7B - 128k - AWQ --quantization awq
From Python Code
from vllm import LLM, SamplingParams
prompts = [
"Tell me about AI",
"Write a story about llamas",
"What is 291 - 150?",
"How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''{prompt}
'''
prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="TheBloke/Yarn - Mistral - 7B - 128k - AWQ", quantization="awq", dtype="auto")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Use with Hugging Face Text Generation Inference (TGI)
Docker Example
--model - id TheBloke/Yarn - Mistral - 7B - 128k - AWQ --port 3000 --quantize awq --max - input - length 3696 --max - total - tokens 4096 --max - batch - prefill - tokens 4096
Python Example
from huggingface_hub import InferenceClient
endpoint_url = "https://your - endpoint - url - here"
prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''
client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
repetition_penalty=1.1)
print(f"Model output: ", response)
Use from Python Code with AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name_or_path = "TheBloke/Yarn - Mistral - 7B - 128k - AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
trust_remote_code=True, safetensors=True)
prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''
print("*** Running model.generate:")
token_input = tokenizer(
prompt_template,
return_tensors='pt'
).input_ids.cuda()
generation_output = model.generate(
token_input,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
max_new_tokens=512
)
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("LLM output: ", text_output)
đ Documentation
Prompt Template
{prompt}
Provided Files and AWQ Parameters
For the first release of AWQ models, only 128g models are released. 32g models may be added in the future.
Branch |
Bits |
GS |
AWQ Dataset |
Seq Len |
Size |
[main](https://huggingface.co/TheBloke/Yarn - Mistral - 7B - 128k - AWQ/tree/main) |
4 |
128 |
[wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) |
4096 |
4.15 GB |
Repositories Available
- [AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/Yarn - Mistral - 7B - 128k - AWQ)
- [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Yarn - Mistral - 7B - 128k - GPTQ)
- [2, 3, 4, 5, 6 and 8 - bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Yarn - Mistral - 7B - 128k - GGUF)
- [NousResearch's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/NousResearch/Yarn - Mistral - 7b - 128k)
đ§ Technical Details
About AWQ
AWQ is an efficient, accurate and blazing - fast low - bit weight quantization method, currently supporting 4 - bit quantization. Compared to GPTQ, it offers faster Transformers - based inference with equivalent or better quality compared to the most commonly used GPTQ settings. It is supported by multiple inference frameworks such as [Text Generation Webui](https://github.com/oobabooga/text - generation - webui), [vLLM](https://github.com/vllm - project/vllm), [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text - generation - inference), and [AutoAWQ](https://github.com/casper - hansen/AutoAWQ).
đ License
This project is licensed under the apache - 2.0 license.