Yarn-Mistral-7B-128k-AWQ Open-Source Language Model - Supports 128k Long-Context Window Conversation and Communication

Yarn Mistral 7B 128k AWQ

Developed by TheBloke

Yarn Mistral 7B 128K is an advanced language model optimized for long-context processing, further pre-trained on long-context data using the YaRN extension method, supporting a 128k token context window.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #128k long context #efficient inference optimization #English text generation

Downloads 483

Release Time : 11/2/2023

Model Overview

A language model based on Mistral-7B-v0.1, specifically optimized for handling long-context scenarios, suitable for various natural language processing tasks involving ultra-long texts.

Model Features

Ultra-long context support

Supports a 128k token context window, capable of processing ultra-long text content.

Efficient quantization

Provides AWQ-quantized versions to improve inference efficiency while maintaining quality.

Optimized pre-training

Undergoes 1500 additional pre-training steps on long-context data using the YaRN method.

Model Capabilities

Long text generation

Context understanding

Text continuation

Question answering systems

Use Cases

Document processing

Long document summarization

Summarizes and extracts key information from ultra-long documents.

Legal document analysis

Processes and analyzes complex legal contracts and clauses.

Code processing

Codebase analysis

Understands the structure and functionality of large codebases.

🚀 Yarn Mistral 7B 128K - AWQ

This repository provides AWQ model files for NousResearch's Yarn Mistral 7B 128K, enabling efficient and accurate low - bit weight quantization for various inference scenarios.

🚀 Quick Start

This README offers detailed guidance on downloading, installing, and using the AWQ model of Yarn Mistral 7B 128K in different environments. Whether you're using text - generation - webui, vLLM, Hugging Face Text Generation Inference (TGI), or Python code, you can find the corresponding steps here.

✨ Features

AWQ Quantization: AWQ is an efficient, accurate, and fast low - bit weight quantization method, currently supporting 4 - bit quantization. It provides faster inference for Transformer - based models compared to GPTQ, with equivalent or better quality.
Multiple Inference Environments: Supported by text - generation - webui, vLLM, Hugging Face Text Generation Inference (TGI), and AutoAWQ, offering flexibility for different usage scenarios.

📦 Installation

Install from text - generation - webui

Ensure you're using the latest version of [text - generation - webui](https://github.com/oobabooga/text - generation - webui). It's recommended to use the one - click installers.
Click the Model tab.
Under Download custom model or LoRA, enter TheBloke/Yarn - Mistral - 7B - 128k - AWQ.
Click Download.
Wait for the download to complete (it will show "Done").
In the top left, click the refresh icon next to Model.
In the Model dropdown, choose Yarn - Mistral - 7B - 128k - AWQ.
Select Loader: AutoAWQ.
Click Load.
Optionally, set custom settings, click Save settings for this model, and then Reload the Model.

Install AutoAWQ for Python Inference

Requires [AutoAWQ](https://github.com/casper - hansen/AutoAWQ) 0.1.1 or later.

pip3 install autoawq

If installation with pre - built wheels fails, install from source:

pip3 uninstall -y autoawq
git clone https://github.com/casper - hansen/AutoAWQ
cd AutoAWQ
pip3 install .

💻 Usage Examples

Use in text - generation - webui

After installation, click the Text Generation tab and enter a prompt to start generating text.

Use with vLLM

As a Server

python3 python -m vllm.entrypoints.api_server --model TheBloke/Yarn - Mistral - 7B - 128k - AWQ --quantization awq

From Python Code

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''{prompt}
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/Yarn - Mistral - 7B - 128k - AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Use with Hugging Face Text Generation Inference (TGI)

Docker Example

--model - id TheBloke/Yarn - Mistral - 7B - 128k - AWQ --port 3000 --quantize awq --max - input - length 3696 --max - total - tokens 4096 --max - batch - prefill - tokens 4096

Python Example

from huggingface_hub import InferenceClient

endpoint_url = "https://your - endpoint - url - here"

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

Use from Python Code with AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "TheBloke/Yarn - Mistral - 7B - 128k - AWQ"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=True, safetensors=True)

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

print("*** Running model.generate:")

token_input = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    token_input,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("LLM output: ", text_output)

📚 Documentation

Prompt Template

{prompt}

Provided Files and AWQ Parameters

For the first release of AWQ models, only 128g models are released. 32g models may be added in the future.

Branch	Bits	GS	AWQ Dataset	Seq Len	Size
[main](https://huggingface.co/TheBloke/Yarn - Mistral - 7B - 128k - AWQ/tree/main)	4	128	[wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test)	4096	4.15 GB

Repositories Available

[AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/Yarn - Mistral - 7B - 128k - AWQ)
[GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Yarn - Mistral - 7B - 128k - GPTQ)
[2, 3, 4, 5, 6 and 8 - bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Yarn - Mistral - 7B - 128k - GGUF)
[NousResearch's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/NousResearch/Yarn - Mistral - 7b - 128k)

🔧 Technical Details

About AWQ

AWQ is an efficient, accurate and blazing - fast low - bit weight quantization method, currently supporting 4 - bit quantization. Compared to GPTQ, it offers faster Transformers - based inference with equivalent or better quality compared to the most commonly used GPTQ settings. It is supported by multiple inference frameworks such as [Text Generation Webui](https://github.com/oobabooga/text - generation - webui), [vLLM](https://github.com/vllm - project/vllm), [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text - generation - inference), and [AutoAWQ](https://github.com/casper - hansen/AutoAWQ).

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご