Model Overview
Model Features
Model Capabilities
Use Cases
đ Llava v1.5 13B - AWQ
This repository provides AWQ model files for Llava v1.5 13B, offering efficient, accurate, and fast low - bit weight quantization for inference.
đ Quick Start
This README offers detailed instructions on using the AWQ model of Llava v1.5 13B, including installation, inference, and compatibility information.
⨠Features
- AWQ Quantization: AWQ is an efficient, accurate, and fast low - bit weight quantization method, currently supporting 4 - bit quantization. It enables faster inference compared to GPTQ and is supported by vLLM and Huggingface Text Generation Inference (TGI).
- Multiple Repositories: There are multiple repositories available for different quantization types and formats, including AWQ, GPTQ, and the original unquantized model.
- Compatibility: The model is compatible with AutoAWQ, vLLM, and TGI, facilitating high - throughput concurrent inference in multi - user server scenarios.
đĻ Installation
Install from Python
Requires AutoAWQ 0.1.1 or later.
pip3 install autoawq
If you have problems installing AutoAWQ using the pre - built wheels, install it from source instead:
pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .
đģ Usage Examples
Serving from vLLM
When using vLLM as a server, pass the --quantization awq
parameter:
python3 python -m vllm.entrypoints.api_server --model TheBloke/llava-v1.5-13B-AWQ --quantization awq --dtype half
When using vLLM from Python code, pass the quantization = awq
parameter:
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="TheBloke/llava-v1.5-13B-AWQ", quantization="awq", dtype="half")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Serving from Text Generation Inference (TGI)
Use TGI version 1.1.0 or later. The official Docker container is: ghcr.io/huggingface/text-generation-inference:1.1.0
Example Docker parameters:
--model-id TheBloke/llava-v1.5-13B-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
Example Python code for interfacing with TGI (requires huggingface - hub 0.17.0 or later):
pip3 install huggingface-hub
from huggingface_hub import InferenceClient
endpoint_url = "https://your-endpoint-url-here"
prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''
client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
repetition_penalty=1.1)
print(f"Model output: {response}")
Using from Python code
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name_or_path = "TheBloke/llava-v1.5-13B-AWQ"
# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
trust_remote_code=False, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)
prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''
print("\n\n*** Generate:")
tokens = tokenizer(
prompt_template,
return_tensors='pt'
).input_ids.cuda()
# Generate output
generation_output = model.generate(
tokens,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
max_new_tokens=512
)
print("Output: ", tokenizer.decode(generation_output[0]))
"""
# Inference should be possible with transformers pipeline as well in future
# But currently this is not yet supported by AutoAWQ (correct as of September 25th 2023)
from transformers import pipeline
print("*** Pipeline:")
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
repetition_penalty=1.1
)
print(pipe(prompt_template)[0]['generated_text'])
"""
đ Documentation
Prompt template: llava 1.5
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: <image>{prompt}
ASSISTANT:
Provided files, and AWQ parameters
For the first release of AWQ models, 128g models are released only. 32g models may be added later if there is interest. Models are released as sharded safetensors files.
Branch | Bits | GS | AWQ Dataset | Seq Len | Size |
---|---|---|---|---|---|
main | 4 | 128 | wikitext | 4096 | 7.25 GB |
Repositories available
- AWQ model(s) for GPU inference.
- GPTQ models for GPU inference, with multiple quantisation parameter options.
- Haotian Liu's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions
đ§ Technical Details
About AWQ
AWQ is an efficient, accurate and blazing - fast low - bit weight quantization method, currently supporting 4 - bit quantization. Compared to GPTQ, it offers faster Transformers - based inference.
It is also now supported by continuous batching server vLLM, allowing use of Llama AWQ models for high - throughput concurrent inference in multi - user server scenarios.
As of September 25th 2023, preliminary Llama - only AWQ support has also been added to Huggingface Text Generation Inference (TGI).
Note that, at the time of writing, overall throughput is still lower than running vLLM or TGI with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB.
Compatibility
The files provided are tested to work with:
TGI merged AWQ support on September 25th, 2023: TGI PR #1054. Use the :latest
Docker container until the next TGI release is made.
đ License
Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
Discord
For further support, and discussions on these models and AI in general, join us at: TheBloke AI's Discord server
Thanks, and how to contribute
Thanks to the chirper.ai team! Thanks to Clay from gpus.llm-utils.org!
If you're able and willing to contribute, it will be most gratefully received and will help to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.
- Patreon: https://patreon.com/TheBlokeAI
- Ko - Fi: https://ko-fi.com/TheBlokeAI
Special thanks to: Aemon Algiz. Patreon special mentions: Pierre Kircher, Stanislav Ovsiannikov, etc.
Thank you to all generous patrons and donaters! And thank you again to a16z for their generous grant.
Original model card: Haotian Liu's Llava v1.5 13B
Model details
Property | Details |
---|---|
Model Type | LLaVA is an open - source chatbot trained by fine - tuning LLaMA/Vicuna on GPT - generated multimodal instruction - following data. It is an auto - regressive language model, based on the transformer architecture. |
Model Date | LLaVA - v1.5 - 13B was trained in September 2023. |
Paper or resources for more information | https://llava - vl.github.io/ |
Intended use
Property | Details |
---|---|
Primary intended uses | The primary use of LLaVA is research on large multimodal models and chatbots. |
Primary intended users | The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. |







