Open-source multimodal chatbot llava-v1.5-13B-AWQ - Support image dialogue interaction experience

Llava V1.5 13B AWQ

Developed by TheBloke

LLaVA is an open-source multimodal chatbot, fine-tuned on GPT-generated multimodal instruction-following data based on LLaMA/Vicuna.

Text-to-Image

Transformers

#Multimodal Dialogue #Instruction Following #Academic VQA

Downloads 141

Release Time : 10/15/2023

Model Overview

LLaVA is an autoregressive language model based on the transformer architecture, capable of understanding and generating text related to images.

Model Features

Multimodal Understanding

Capable of processing both image and text inputs, understanding the relationship between them.

Instruction Following

Can execute tasks by following complex multimodal instructions.

Open-source

The model is fully open-source, available for both research and commercial use.

Model Capabilities

Visual Question Answering

Image Caption Generation

Multimodal Dialogue

Instruction Following

Use Cases

Research

Multimodal Model Research

Used to study the behavior and capabilities of large multimodal models.

Education

Visual-assisted Learning

Helps students understand complex concepts through images.

🚀 Llava v1.5 13B - AWQ

This repository provides AWQ model files for Llava v1.5 13B, offering efficient, accurate, and fast low - bit weight quantization for inference.

🚀 Quick Start

This README offers detailed instructions on using the AWQ model of Llava v1.5 13B, including installation, inference, and compatibility information.

✨ Features

AWQ Quantization: AWQ is an efficient, accurate, and fast low - bit weight quantization method, currently supporting 4 - bit quantization. It enables faster inference compared to GPTQ and is supported by vLLM and Huggingface Text Generation Inference (TGI).
Multiple Repositories: There are multiple repositories available for different quantization types and formats, including AWQ, GPTQ, and the original unquantized model.
Compatibility: The model is compatible with AutoAWQ, vLLM, and TGI, facilitating high - throughput concurrent inference in multi - user server scenarios.

📦 Installation

Install from Python

Requires AutoAWQ 0.1.1 or later.

pip3 install autoawq

If you have problems installing AutoAWQ using the pre - built wheels, install it from source instead:

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

💻 Usage Examples

Serving from vLLM

When using vLLM as a server, pass the --quantization awq parameter:

python3 python -m vllm.entrypoints.api_server --model TheBloke/llava-v1.5-13B-AWQ --quantization awq --dtype half

When using vLLM from Python code, pass the quantization = awq parameter:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/llava-v1.5-13B-AWQ", quantization="awq", dtype="half")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Serving from Text Generation Inference (TGI)

Use TGI version 1.1.0 or later. The official Docker container is: ghcr.io/huggingface/text-generation-inference:1.1.0 Example Docker parameters:

--model-id TheBloke/llava-v1.5-13B-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

Example Python code for interfacing with TGI (requires huggingface - hub 0.17.0 or later):

pip3 install huggingface-hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''{prompt}

'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: {response}")

Using from Python code

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "TheBloke/llava-v1.5-13B-AWQ"

# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)

prompt = "Tell me about AI"
prompt_template=f'''{prompt}

'''

print("\n\n*** Generate:")

tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

print("Output: ", tokenizer.decode(generation_output[0]))

"""
# Inference should be possible with transformers pipeline as well in future
# But currently this is not yet supported by AutoAWQ (correct as of September 25th 2023)
from transformers import pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])
"""

📚 Documentation

Prompt template: llava 1.5

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: <image>{prompt}
ASSISTANT:

Provided files, and AWQ parameters

For the first release of AWQ models, 128g models are released only. 32g models may be added later if there is interest. Models are released as sharded safetensors files.

Branch	Bits	GS	AWQ Dataset	Seq Len	Size
main	4	128	wikitext	4096	7.25 GB

Repositories available

🔧 Technical Details

About AWQ

AWQ is an efficient, accurate and blazing - fast low - bit weight quantization method, currently supporting 4 - bit quantization. Compared to GPTQ, it offers faster Transformers - based inference.

It is also now supported by continuous batching server vLLM, allowing use of Llama AWQ models for high - throughput concurrent inference in multi - user server scenarios.

As of September 25th 2023, preliminary Llama - only AWQ support has also been added to Huggingface Text Generation Inference (TGI).

Note that, at the time of writing, overall throughput is still lower than running vLLM or TGI with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB.

Compatibility

The files provided are tested to work with:

TGI merged AWQ support on September 25th, 2023: TGI PR #1054. Use the :latest Docker container until the next TGI release is made.

📄 License

Discord

For further support, and discussions on these models and AI in general, join us at: TheBloke AI's Discord server

Thanks, and how to contribute

Thanks to the chirper.ai team! Thanks to Clay from gpus.llm-utils.org!

If you're able and willing to contribute, it will be most gratefully received and will help to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.

Patreon: https://patreon.com/TheBlokeAI
Ko - Fi: https://ko-fi.com/TheBlokeAI

Special thanks to: Aemon Algiz. Patreon special mentions: Pierre Kircher, Stanislav Ovsiannikov, etc.

Thank you to all generous patrons and donaters! And thank you again to a16z for their generous grant.

Original model card: Haotian Liu's Llava v1.5 13B

Model details

Property	Details
Model Type	LLaVA is an open - source chatbot trained by fine - tuning LLaMA/Vicuna on GPT - generated multimodal instruction - following data. It is an auto - regressive language model, based on the transformer architecture.
Model Date	LLaVA - v1.5 - 13B was trained in September 2023.
Paper or resources for more information	https://llava - vl.github.io/

Intended use

Property	Details
Primary intended uses	The primary use of LLaVA is research on large multimodal models and chatbots.
Primary intended users	The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご